Back to AlgoArena OA

AI Model Benchmark

How do different AI models perform when paired with real developers on real coding tasks? Unlike static benchmarks, this measures human + AI collaboration with objective, code-level attribution.

Total Sessions
0
Models Tracked
0
Providers
0
Min Sessions for Ranking
10

Model Leaderboard

Loading benchmark data...

How We Calculate Rankings

Full transparency on every metric. No black-box scores — here's exactly how each number is computed.

Primary Metrics (used for ranking)
Code Survival Rate40% weight

What percentage of AI-generated code is still present in the final submission? Measured via line-level CodeSegment attribution — we track which model wrote each line, then compare at submit time. Code that was applied but later deleted, overwritten, or reverted counts against survival.

survival = lines_at_submit / lines_when_applied
Test Pass Rate After Apply35% weight

What percentage of tests pass immediately after the user applies this model's code? Tests auto-run on every "Keep" action, giving a direct causal link between a specific prompt and test results. Only measured for questions with test cases.

pass_rate = tests_passed / tests_total (per prompt)
Final Score25% weight

The end-to-end session score across functionality (50%), code quality (20%), design (20%), and process efficiency (10%). Quality and design scores come from a 3-judge ensemble across different providers to eliminate single-model bias.

score = weighted_avg(functionality, quality, design, process)
Composite Rank Formula
composite = (code_survival × 0.40) + (test_pass_rate × 0.35) + (final_score × 100 × 0.25)

All values normalized to a 0-100 scale. Models are ranked by composite score descending. Minimum 10 sessions required before a model appears in rankings.

Secondary Metrics (diagnostic, click row to view)
Acceptance Rate
How often users kept the model's code. Noisy signal — users often "keep" code to test it then revert.
Prompt Efficiency
Ratio of prompts that resulted in applied code vs. total code-mode prompts.
Avg Time to Solution
Average session duration in minutes. Varies heavily by problem difficulty.
Prompts Needed
Average number of prompts per session. Lower isn't always better — depends on task complexity.

What Makes This Different

Objective Code-Level Attribution

No LLM-as-judge. We track which model wrote which lines of code at the character level, then measure whether those lines survived to the final submission. The code is either there or it isn't.

Human-in-the-Loop

Unlike LMSYS Arena or LiveBench which test models in isolation, this measures how effectively a real developer paired with an AI model can build working software. The human's prompting skill matters.

Real Multi-File Projects

85+ problems across 10 categories — web apps, APIs, data pipelines, games. Not isolated puzzles. This tests practical engineering ability.

Data is collected from AlgoArena OA coding assessments. Each data point represents a real candidate completing a timed assessment with their chosen AI model. Rankings refresh weekly.