How do different AI models perform when paired with real developers on real coding tasks? Unlike static benchmarks, this measures human + AI collaboration in multi-file projects.
Unlike LMSYS Arena or LiveBench which test models in isolation, this benchmark measures how effectively a real developer paired with an AI model can build working software. The human's prompting skill matters as much as the model's capability.
Problems are full-stack, multi-file projects — not isolated coding puzzles. Candidates build web apps, APIs, data pipelines, and games across 85+ problems spanning 10 categories. This tests practical engineering ability, not just algorithm knowledge.
Each session is scored across functionality (50%), code quality (20%), design (20%), and process efficiency (10%). Quality and design scores come from a 3-judge ensemble across different providers to eliminate single-model bias.
Data is collected from AlgoArena OA coding assessments. Each data point represents a real candidate completing a timed assessment with their chosen AI model. Rankings refresh weekly. Minimum 10 sessions required per model before appearing in rankings.