Traditional coding assessments are good at one thing: checking whether a final answer passes tests. Modern engineering work asks for more than that.

What changed

We pushed assessments toward project-style work, replayable process, terminal and workspace signals, similarity review, and benchmark-oriented reporting. The product direction is simple: if the candidate's process matters, the platform should preserve enough of it to review fairly.

This does not mean every signal deserves equal weight. It means the final score should not be the only artifact. A candidate's planning, iteration, verification, and use of assistance can all help explain what happened.

Planning & decomposition

Iteration & reruns

Verification & tests

Use of assistance

Project-style work preserves the parts of engineering a final answer hides, reviewed together, not turned into a verdict.

Why it matters

Two candidates can land on the same final answer for very different reasons. One reasoned through the problem, tested edge cases, and used tools carefully. Another got lucky or followed a brittle path that would collapse in a larger codebase.

Project-style tasks make that difference legible. A larger, multi-step problem has more surface area for judgment, or its absence, to show: where the candidate scoped the work, what they verified before moving on, how they recovered when something broke. A single algorithm answer rarely reveals any of that. A project does, because the path to it is longer and more decisions are visible along the way.

Assessments should help teams see the difference. That is especially important as AI becomes normal in software work. The point is not to punish assistance. It is to measure whether assistance was used with judgment.

That shift is not cosmetic. When a model can produce a plausible answer on demand, the answer stops being the scarce thing. Judgment about the answer becomes the scarce thing: knowing what to ask for, seeing when the output is wrong, deciding what to keep. A test result tells you the code runs. It says nothing about whether the person could tell good output from bad, which is most of what the job is now.

Two candidates, one green check

Picture the same take-home-sized task handed to two engineers. Both submit something that passes. Read only the final files and they look like near-siblings. The path each took to get there does not.

The first candidate opens the problem and spends the early minutes reading, not typing. She sketches the shape of the solution, names the pieces, and builds a small slice first. When a change breaks a test she reruns, reads the failure, and adjusts before moving on. She reaches for assistance twice, both times to get past something mechanical, and she reads what comes back before she keeps it. By the end there is a rhythm to the work: a decision, a check, a correction, another decision.

The second candidate starts typing almost at once. The early structure churns, whole approaches appearing and vanishing. Assistance arrives in large blocks that land unread and mostly survive untouched. Tests run once, near the end, in a single anxious burst. His answer is correct. Nothing about how it got there suggests he could rebuild it, extend it, or notice the day it quietly broke.

A final-answer grader cannot tell these two apart. Both earn the same green check. The whole difference lives in the path, and the path is exactly what a project-style task keeps.

Candidate A: read, run, correct, repeatdecisions visible in the trail

Candidate B: paste, accept, one late runsame final answer, far less to read

Two candidates reach the same passing answer; only the project-style trail shows how many decisions each one actually made in the open.

Give a reviewer both trails and the ranking almost writes itself. The first timeline reads as a series of small, defensible moves, each one checked before the next. The second is a flat run of accepted output with a verification step bolted on at the end. Neither trail is a score by itself, and that restraint matters, but together they answer the question a passing test cannot: which of these two would you trust with the next ambiguous ticket, the one with no tidy answer key at all? On the evidence, it is not a close call.

How the process is preserved

The goal is review, not surveillance. A replayable timeline shows how the solution formed; workspace and terminal signals show what the candidate actually ran; similarity review flags work worth a closer look. None of these is a verdict on its own. They are context a reviewer can weigh together, so a score arrives with the evidence behind it instead of standing alone. A candidate who can explain their path has nothing to fear from it being visible; a result that can't survive that scrutiny was never strong evidence to begin with.

When process signals become noise

Keeping the path has a cost the cheerful version of this story skips. More signal is also more to misread. A replay timeline can make a careful engineer look erratic if she thinks in long silences and commits in bursts. Similarity review can catch two people who simply studied the same well-known pattern. Terminal history quietly rewards the candidate who narrates every step and penalizes the one who holds most of the work in her head. Richer evidence does not automatically mean fairer evidence. Left undisciplined, it just means more ways to arrive at a confident wrong read.

So the work is not collecting more. It is deciding what a signal is allowed to mean. No single signal should move a decision on its own. A flag is a reason to look closer, never a conclusion. The reviewer weighs the trail as a whole and stays honest about its blind spots: a quiet worker is not a weak one, and a talkative one is not automatically strong. The moment a process signal starts behaving like a score, it has stopped being evidence and become a shortcut, which is the exact failure the approach was built to avoid.

Weighting is where that restraint gets concrete. A debugging task should lean on how someone iterates and verifies, so a burst of reruns reads as diligence rather than chaos. A greenfield build should lean on how they scope and structure, where that same rerun pattern means less. The signals do not change; what a given signal is worth changes with what the task is actually asking. Fix the weights to the work and a lot of the false reads take care of themselves.

A flag opens a closer look

Read the whole trail

Weigh signals against the task

The score cites what it saw

Process signals stay useful only under a rule: nothing convicts on its own, and the score has to point back at the evidence it came from.

A score you can argue with

There is a quieter consequence to all of this. When the only artifact is a number, that number has to be either trusted or thrown out, and both are bad. Trust it blindly and you hire on a proxy. Throw it out and the exercise was theater. A result carried by its evidence changes the transaction. The recruiter can see why the score landed where it did. The engineer on the panel can open the path and disagree with it on specifics. The candidate, asked to, can point at the same trail and defend a choice the raw number made look worse than it was.

There is a second reader most scoring conversations forget, which is the candidate. A report legible only to the people making the call is a verdict, not an assessment. If someone can open their own result and see why a clean submission still scored soft on verification, the scoring stays honest, because a reading you can explain to the person it judged is one you have actually reasoned through. It pulls behavior the right way too. When people can see that reading a diff and catching a failure is the rewarded thing, they read diffs and catch failures.

That is a healthier thing to hand a hiring team than a verdict. A verdict closes the conversation. Evidence opens one. A score you can actually argue with is worth more than a cleaner one nobody can inspect, because the argument is where the real judgment happens.

Where it points

This work connects to the benchmark direction: clearer standards, better comparison, and less hand-waving around what a score means. The long-term goal is an assessment artifact that a candidate, recruiter, and engineer can all understand. Get that right and the score stops being the last word. It becomes the first thing a hiring conversation can build on, a shared reference all three readers can trust even when they come to it for different reasons.

Measuring Project Work, Not Just Answers