Generated: 2026-03-02T12:46:19.292681+00:00
Total runs: 6
| run_id | task_id | model_id | status | final_score | cost_usd | total_tokens | review_reasons |
|---|---|---|---|---|---|---|---|
| 2026-03-02T12-44-57Z_internal-swe-bench-v1_backend-auth-refresh_claude-opus-4.6_3f111e46 | backend-auth-refresh | claude-opus-4.6 | completed | 0.0 | 0.00267 | 267 | top2_quality_gap_lte_2.0, auto_confidence_lt_threshold |
| 2026-03-02T12-44-57Z_internal-swe-bench-v1_backend-auth-refresh_claude-sonnet-4-6_e83cd447 | backend-auth-refresh | claude-sonnet-4-6 | completed | 0.0 | 0.00267 | 267 | top2_quality_gap_lte_2.0, auto_confidence_lt_threshold |
| 2026-03-02T12-44-57Z_internal-swe-bench-v1_backend-auth-refresh_gpt-5.3-codex-thinking-mid_d3bf2754 | backend-auth-refresh | gpt-5.3-codex-thinking-mid | completed | 0.0 | 0.00267 | 267 | top2_quality_gap_lte_2.0, auto_confidence_lt_threshold |
| 2026-03-02T12-44-57Z_internal-swe-bench-v1_ui-settings-flow_claude-opus-4.6_a9c944bd | ui-settings-flow | claude-opus-4.6 | completed | 0.0 | 0.00265 | 265 | top2_quality_gap_lte_2.0, auto_confidence_lt_threshold |
| 2026-03-02T12-44-57Z_internal-swe-bench-v1_ui-settings-flow_claude-sonnet-4-6_f242aec8 | ui-settings-flow | claude-sonnet-4-6 | completed | 0.0 | 0.00265 | 265 | top2_quality_gap_lte_2.0, auto_confidence_lt_threshold |
| 2026-03-02T12-44-57Z_internal-swe-bench-v1_ui-settings-flow_gpt-5.3-codex-thinking-mid_ea8f3ad7 | ui-settings-flow | gpt-5.3-codex-thinking-mid | completed | 0.0 | 0.00265 | 265 | top2_quality_gap_lte_2.0, auto_confidence_lt_threshold |