Neo-Brutalist Benchmark Board

Run prefix: 2026-03-02T12-44-57Z. Real benchmark outputs only.

Result KPIs

total runs6
completed6
resolved0
avg score0
within target6
within hard6
report found0
patch applied0

Retro Method Cards

Selection

Deterministic task-instance sampling.

Generation

Tracked tokens, cost, retries, and status.

Patch safety

Extract, normalize, validate before harness.

Ranking

Score first, then cost/tokens tie-break.

Winners

Task

backend-auth-refresh

Winner: claude-opus-4.6

Final score: 0

Cost: $0.002670

Tokens: 267

Task

ui-settings-flow

Winner: claude-opus-4.6

Final score: 0

Cost: $0.002650

Tokens: 265

Model Leaderboard

model_id runs completion_rate resolved_rate avg_score best_score avg_cost total_tokens avg_duration patch_invalid patch_apply_failed review_trigger_rate avg_confidence report_found_rate patch_applied_rate
claude-opus-4.62100.0%0.0%00$0.0026605320 ms00100.0%0.47140.0%0.0%
claude-sonnet-4-62100.0%0.0%00$0.0026605320 ms00100.0%0.47140.0%0.0%
gpt-5.3-codex-thinking-mid2100.0%0.0%00$0.0026605320 ms00100.0%0.47140.0%0.0%

Run Ledger

status · score · cost · duration

2026-03-02T12-44-57Z_internal-swe-bench-v1_backend-auth-refresh_claude-opus-4.6_3f111e46

task backend-auth-refresh

model claude-opus-4.6

status completed

score 0

cost $0.002670

duration 0 ms

2026-03-02T12-44-57Z_internal-swe-bench-v1_backend-auth-refresh_claude-sonnet-4-6_e83cd447

task backend-auth-refresh

model claude-sonnet-4-6

status completed

score 0

cost $0.002670

duration 0 ms

2026-03-02T12-44-57Z_internal-swe-bench-v1_backend-auth-refresh_gpt-5.3-codex-thinking-mid_d3bf2754

task backend-auth-refresh

model gpt-5.3-codex-thinking-mid

status completed

score 0

cost $0.002670

duration 0 ms

2026-03-02T12-44-57Z_internal-swe-bench-v1_ui-settings-flow_claude-opus-4.6_a9c944bd

task ui-settings-flow

model claude-opus-4.6

status completed

score 0

cost $0.002650

duration 0 ms

2026-03-02T12-44-57Z_internal-swe-bench-v1_ui-settings-flow_claude-sonnet-4-6_f242aec8

task ui-settings-flow

model claude-sonnet-4-6

status completed

score 0

cost $0.002650

duration 0 ms

2026-03-02T12-44-57Z_internal-swe-bench-v1_ui-settings-flow_gpt-5.3-codex-thinking-mid_ea8f3ad7

task ui-settings-flow

model gpt-5.3-codex-thinking-mid

status completed

score 0

cost $0.002650

duration 0 ms

Artifacts