The assessment score used to carry too much meaning by itself. A single number can tell a reviewer that a candidate did well or poorly, but not what kind of work produced the outcome.

The current scoring profile makes the public model explicit. AlgoArena reports five AI-native competencies:

Problem Solving & Deliverable Quality

Planning & Decomposition

AI Direction & Communication

Verification & Iteration

Agentic Workflow Autonomy

Those names are intentionally product-facing. Raw telemetry can stay internal for calibration, while the candidate report gives recruiters a stable vocabulary.

What changed

The scoring config now separates question archetypes from scoring presets.

Question archetypes decide the output-quality method:

algorithmic questions lean on tests

build tasks lean on browser validation

maintenance tasks lean on diff tests

judgment tasks lean on rubric review

Scoring presets decide how much each competency should matter for a role. The flagship AI-native build preset weights the five dimensions as 28/16/18/22/16. A classic no-AI baseline can set AI direction to 0 while still keeping planning, verification, and deliverable quality visible.

That separation matters because it keeps two different questions apart. The first question is "what kind of work did the candidate do?" The second is "how should this role value that work?" A UI build, an algorithmic question, and a debugging task should not all be reduced to the same evidence pipeline. But the public report should still use a consistent language.

The model therefore gives each assessment a stable public shape while letting internal presets adjust the emphasis. This is closer to how teams actually hire: the same candidate behavior can be interpreted differently for a frontend role, an infrastructure role, or a junior generalist role.

Why this is better than "AI used"

"Did the candidate use AI?" is a weak question. The useful question is whether they used it with judgment.

Good assessment evidence should show whether a candidate decomposed the task, gave the agent useful constraints, challenged the output, ran meaningful checks, recovered from failures, and shipped something that actually works.

That is why the model keeps verification high across presets. Blind acceptance is a different signal from deliberate iteration, even when both candidates end with passing code.

The product should make that distinction visible. A candidate who asks a model for code, runs no tests, and submits a lucky answer is not demonstrating the same competency as a candidate who uses AI to explore alternatives, tests the result, finds a failing edge case, and narrows the solution. Both may have "used AI." Only one is showing reliable engineering judgment.

That is why AI direction is a competency rather than a binary compliance field. It can be strong, weak, or irrelevant depending on the assessment policy.

How evidence reaches the report

The public score is downstream of several kinds of evidence:

test results and output quality

candidate planning and decomposition

prompts and tool-directed work where AI is allowed

browser validation and rendered artifacts for UI tasks

iteration traces, reruns, and fixes

reviewer calibration and role preset weights

Not every question uses every signal. The point of the model is to avoid forcing one scoring path onto every task. A classic algorithm problem can remain mostly test-driven. A project task can include browser evidence. A maintenance task can rely more heavily on diff quality and regression checks.

What reviewers see

The reviewer-facing report should be legible without exposing every raw event. That means the five competencies are public, the role preset is understandable, and the evidence examples point back to concrete behavior.

The system can still keep raw telemetry internal for calibration. Public reports need enough detail to support a hiring conversation, not enough detail to recreate the entire event log.

Why the public dimensions stay fixed

Role presets can change weights, but the report should not invent a new language for every assessment. A recruiter comparing candidates needs a stable shape. A candidate reading feedback needs to understand what was measured.

The five dimensions are that shape. They let the product evolve under the hood without making every report feel like a bespoke internal rubric.

Fixed dimensions also make longitudinal calibration possible. If every assessment invents new labels, the team cannot compare outcomes across cohorts or improve the weighting model cleanly. Stable labels let the product change the machinery while preserving a usable reporting surface.

Boundary

This note describes the shipped scoring structure, not a claim that scoring is final. AI-native work is still changing. The model is versioned because calibration will keep improving as we see more real assessment behavior.

It also does not claim that automation replaces human review. The scoring model is a structure for evidence, not a final hiring decision. The product should help reviewers ask sharper questions and compare candidates more consistently; it should not pretend to know the role context better than the hiring team.

The Five-Competency AI-Native Scoring Model