Neo-Brutalist Benchmark Board
Run prefix: 2026-03-02T12-44-57Z. Real benchmark outputs only.
Result KPIs
Retro Method Cards
Selection
Deterministic task-instance sampling.
Generation
Tracked tokens, cost, retries, and status.
Patch safety
Extract, normalize, validate before harness.
Ranking
Score first, then cost/tokens tie-break.
Winners
Task
backend-auth-refresh
Winner: claude-opus-4.6
Final score: 0
Cost: $0.002670
Tokens: 267
Task
ui-settings-flow
Winner: claude-opus-4.6
Final score: 0
Cost: $0.002650
Tokens: 265
Model Leaderboard
| model_id | runs | completion_rate | resolved_rate | avg_score | best_score | avg_cost | total_tokens | avg_duration | patch_invalid | patch_apply_failed | review_trigger_rate | avg_confidence | report_found_rate | patch_applied_rate |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| claude-opus-4.6 | 2 | 100.0% | 0.0% | 0 | 0 | $0.002660 | 532 | 0 ms | 0 | 0 | 100.0% | 0.4714 | 0.0% | 0.0% |
| claude-sonnet-4-6 | 2 | 100.0% | 0.0% | 0 | 0 | $0.002660 | 532 | 0 ms | 0 | 0 | 100.0% | 0.4714 | 0.0% | 0.0% |
| gpt-5.3-codex-thinking-mid | 2 | 100.0% | 0.0% | 0 | 0 | $0.002660 | 532 | 0 ms | 0 | 0 | 100.0% | 0.4714 | 0.0% | 0.0% |
Run Ledger
status · score · cost · duration
2026-03-02T12-44-57Z_internal-swe-bench-v1_backend-auth-refresh_claude-opus-4.6_3f111e46
task backend-auth-refresh
model claude-opus-4.6
status completed
score 0
cost $0.002670
duration 0 ms
2026-03-02T12-44-57Z_internal-swe-bench-v1_backend-auth-refresh_claude-sonnet-4-6_e83cd447
task backend-auth-refresh
model claude-sonnet-4-6
status completed
score 0
cost $0.002670
duration 0 ms
2026-03-02T12-44-57Z_internal-swe-bench-v1_backend-auth-refresh_gpt-5.3-codex-thinking-mid_d3bf2754
task backend-auth-refresh
model gpt-5.3-codex-thinking-mid
status completed
score 0
cost $0.002670
duration 0 ms
2026-03-02T12-44-57Z_internal-swe-bench-v1_ui-settings-flow_claude-opus-4.6_a9c944bd
task ui-settings-flow
model claude-opus-4.6
status completed
score 0
cost $0.002650
duration 0 ms
2026-03-02T12-44-57Z_internal-swe-bench-v1_ui-settings-flow_claude-sonnet-4-6_f242aec8
task ui-settings-flow
model claude-sonnet-4-6
status completed
score 0
cost $0.002650
duration 0 ms
2026-03-02T12-44-57Z_internal-swe-bench-v1_ui-settings-flow_gpt-5.3-codex-thinking-mid_ea8f3ad7
task ui-settings-flow
model gpt-5.3-codex-thinking-mid
status completed
score 0
cost $0.002650
duration 0 ms
Artifacts
Summary files
Per-run JSON
- 2026-03-02T12-44-57Z_internal-swe-bench-v1_backend-auth-refresh_claude-opus-4.6_3f111e46 (claude-opus-4.6 · completed)
- 2026-03-02T12-44-57Z_internal-swe-bench-v1_backend-auth-refresh_claude-sonnet-4-6_e83cd447 (claude-sonnet-4-6 · completed)
- 2026-03-02T12-44-57Z_internal-swe-bench-v1_backend-auth-refresh_gpt-5.3-codex-thinking-mid_d3bf2754 (gpt-5.3-codex-thinking-mid · completed)
- 2026-03-02T12-44-57Z_internal-swe-bench-v1_ui-settings-flow_claude-opus-4.6_a9c944bd (claude-opus-4.6 · completed)
- 2026-03-02T12-44-57Z_internal-swe-bench-v1_ui-settings-flow_claude-sonnet-4-6_f242aec8 (claude-sonnet-4-6 · completed)
- 2026-03-02T12-44-57Z_internal-swe-bench-v1_ui-settings-flow_gpt-5.3-codex-thinking-mid_ea8f3ad7 (gpt-5.3-codex-thinking-mid · completed)