Research
Jun 22, 2026AlgoArena10 min read

The Five-Competency AI-Native Scoring Model

How assessment scoring moved from a raw score toward public competencies for problem solving, planning, AI direction, verification, and workflow autonomy.

Research figure

The Five-Competency AI-Native Scoring Model

5
Competencies
7
Presets
v1
Profile

The flagship preset weights five public competencies; role presets can shift weights without changing what reviewers see.

Problem solvingflagship AI-native preset
Planningflagship AI-native preset
AI directionflagship AI-native preset
Verificationflagship AI-native preset
Workflow autonomyflagship AI-native preset
PresetAI dir.Verify
AI-native build1822
Classic no-AI024
Debugging1230
Frontend task1626
Competencies: 5 (public recruiter-facing dimensions) | Presets: 7 (role and policy scoring profiles) | Profile: v1 (five-competency scoring profile)

The assessment score used to carry too much meaning by itself. A single number can tell a reviewer that a candidate did well or poorly, but not what kind of work produced the outcome.


The current scoring profile makes the public model explicit. AlgoArena reports five AI-native competencies:


  • Problem Solving & Deliverable Quality
  • Planning & Decomposition
  • AI Direction & Communication
  • Verification & Iteration
  • Agentic Workflow Autonomy

  • Those names are intentionally product-facing. Raw telemetry can stay internal for calibration, while the candidate report gives recruiters a stable vocabulary.


    What changed


    The scoring config now separates question archetypes from scoring presets.


    Question archetypes decide the output-quality method:


  • algorithmic questions lean on tests
  • build tasks lean on browser validation
  • maintenance tasks lean on diff tests
  • judgment tasks lean on rubric review

  • Scoring presets decide how much each competency should matter for a role. The flagship AI-native build preset weights the five dimensions as 28/16/18/22/16. A classic no-AI baseline can set AI direction to 0 while still keeping planning, verification, and deliverable quality visible.


    That separation matters because it keeps two different questions apart. The first question is "what kind of work did the candidate do?" The second is "how should this role value that work?" A UI build, an algorithmic question, and a debugging task should not all be reduced to the same evidence pipeline. But the public report should still use a consistent language.


    The model therefore gives each assessment a stable public shape while letting internal presets adjust the emphasis. This is closer to how teams actually hire: the same candidate behavior can be interpreted differently for a frontend role, an infrastructure role, or a junior generalist role.


    Why this is better than "AI used"


    "Did the candidate use AI?" is a weak question. The useful question is whether they used it with judgment.


    Good assessment evidence should show whether a candidate decomposed the task, gave the agent useful constraints, challenged the output, ran meaningful checks, recovered from failures, and shipped something that actually works.


    That is why the model keeps verification high across presets. Blind acceptance is a different signal from deliberate iteration, even when both candidates end with passing code.


    The product should make that distinction visible. A candidate who asks a model for code, runs no tests, and submits a lucky answer is not demonstrating the same competency as a candidate who uses AI to explore alternatives, tests the result, finds a failing edge case, and narrows the solution. Both may have "used AI." Only one is showing reliable engineering judgment.


    That is why AI direction is a competency rather than a binary compliance field. It can be strong, weak, or irrelevant depending on the assessment policy.


    How evidence reaches the report


    The public score is downstream of several kinds of evidence:


  • test results and output quality
  • candidate planning and decomposition
  • prompts and tool-directed work where AI is allowed
  • browser validation and rendered artifacts for UI tasks
  • iteration traces, reruns, and fixes
  • reviewer calibration and role preset weights

  • Not every question uses every signal. The point of the model is to avoid forcing one scoring path onto every task. A classic algorithm problem can remain mostly test-driven. A project task can include browser evidence. A maintenance task can rely more heavily on diff quality and regression checks.


    What reviewers see


    The reviewer-facing report should be legible without exposing every raw event. That means the five competencies are public, the role preset is understandable, and the evidence examples point back to concrete behavior.


    The system can still keep raw telemetry internal for calibration. Public reports need enough detail to support a hiring conversation, not enough detail to recreate the entire event log.


    Why the public dimensions stay fixed


    Role presets can change weights, but the report should not invent a new language for every assessment. A recruiter comparing candidates needs a stable shape. A candidate reading feedback needs to understand what was measured.


    The five dimensions are that shape. They let the product evolve under the hood without making every report feel like a bespoke internal rubric.


    Fixed dimensions also make longitudinal calibration possible. If every assessment invents new labels, the team cannot compare outcomes across cohorts or improve the weighting model cleanly. Stable labels let the product change the machinery while preserving a usable reporting surface.


    Boundary


    This note describes the shipped scoring structure, not a claim that scoring is final. AI-native work is still changing. The model is versioned because calibration will keep improving as we see more real assessment behavior.


    It also does not claim that automation replaces human review. The scoring model is a structure for evidence, not a final hiring decision. The product should help reviewers ask sharper questions and compare candidates more consistently; it should not pretend to know the role context better than the hiring team.


    Source trail

    lib/oa-scoring-profile.ts
    lib/fluency-dimension-labels.ts
    components/oa/ScoreBreakdownPanel.tsx

    Related notes

    View all