onBench Report

Generated: 2026-03-02T12:46:19.292681+00:00

Total runs: 6

Winners by task

Runs

run_idtask_idmodel_idstatusfinal_scorecost_usdtotal_tokensreview_reasons
2026-03-02T12-44-57Z_internal-swe-bench-v1_backend-auth-refresh_claude-opus-4.6_3f111e46backend-auth-refreshclaude-opus-4.6completed0.00.00267267top2_quality_gap_lte_2.0, auto_confidence_lt_threshold
2026-03-02T12-44-57Z_internal-swe-bench-v1_backend-auth-refresh_claude-sonnet-4-6_e83cd447backend-auth-refreshclaude-sonnet-4-6completed0.00.00267267top2_quality_gap_lte_2.0, auto_confidence_lt_threshold
2026-03-02T12-44-57Z_internal-swe-bench-v1_backend-auth-refresh_gpt-5.3-codex-thinking-mid_d3bf2754backend-auth-refreshgpt-5.3-codex-thinking-midcompleted0.00.00267267top2_quality_gap_lte_2.0, auto_confidence_lt_threshold
2026-03-02T12-44-57Z_internal-swe-bench-v1_ui-settings-flow_claude-opus-4.6_a9c944bdui-settings-flowclaude-opus-4.6completed0.00.00265265top2_quality_gap_lte_2.0, auto_confidence_lt_threshold
2026-03-02T12-44-57Z_internal-swe-bench-v1_ui-settings-flow_claude-sonnet-4-6_f242aec8ui-settings-flowclaude-sonnet-4-6completed0.00.00265265top2_quality_gap_lte_2.0, auto_confidence_lt_threshold
2026-03-02T12-44-57Z_internal-swe-bench-v1_ui-settings-flow_gpt-5.3-codex-thinking-mid_ea8f3ad7ui-settings-flowgpt-5.3-codex-thinking-midcompleted0.00.00265265top2_quality_gap_lte_2.0, auto_confidence_lt_threshold