How do different AI models perform when paired with real developers on real coding tasks? Unlike static benchmarks, this measures human + AI collaboration with objective, code-level attribution.
Full transparency on every metric. No black-box scores — here's exactly how each number is computed.
What percentage of AI-generated code is still present in the final submission? Measured via line-level CodeSegment attribution — we track which model wrote each line, then compare at submit time. Code that was applied but later deleted, overwritten, or reverted counts against survival.
What percentage of tests pass immediately after the user applies this model's code? Tests auto-run on every "Keep" action, giving a direct causal link between a specific prompt and test results. Only measured for questions with test cases.
The end-to-end session score across functionality (50%), code quality (20%), design (20%), and process efficiency (10%). Quality and design scores come from a 3-judge ensemble across different providers to eliminate single-model bias.
All values normalized to a 0-100 scale. Models are ranked by composite score descending. Minimum 10 sessions required before a model appears in rankings.
No LLM-as-judge. We track which model wrote which lines of code at the character level, then measure whether those lines survived to the final submission. The code is either there or it isn't.
Unlike LMSYS Arena or LiveBench which test models in isolation, this measures how effectively a real developer paired with an AI model can build working software. The human's prompting skill matters.
85+ problems across 10 categories — web apps, APIs, data pipelines, games. Not isolated puzzles. This tests practical engineering ability.
Data is collected from AlgoArena OA coding assessments. Each data point represents a real candidate completing a timed assessment with their chosen AI model. Rankings refresh weekly.