How to Actually Hire for AI Capability — Without a Quiz

A quiz tests knowledge, and knowledge is the one thing AI hands every candidate for free. So a quiz measures nothing. Here is a process that measures the thing left over: judgment, watched live, when the model is confidently wrong.

14 min read

To hire for AI capability, throw out the quiz and build one live moment where the model is confidently wrong, then watch whether the candidate catches it. That is the whole method. A quiz tests what a person knows, and what a person knows is the exact commodity AI now hands out for free at the desk they are sitting at. You would be grading them on a subject the tool answers for them. The signal you actually want has never lived in an answer. It lives in what they do the second the answer is wrong.

Every hiring manager reaching for a rubric wants a number that sorts the shortlist. The instinct is fine. The instrument is broken. A multiple-choice test on prompt patterns, model names, and "which tool for which task" tells you the candidate has read the same three blog posts as everyone else. It does not tell you whether they can steer the machine when it drives them off a cliff with total confidence. That second thing is the entire job now.

The short version:

  • A quiz measures knowledge, and knowledge is what AI supplies. So a knowledge quiz on AI capability measures the tool, not the person sitting the test. It sorts nobody.
  • Demote every pre-collected artifact (the CV, the take-home, the portfolio) to routing. It tells you where to point the conversation, never who to hire.
  • Build one live moment where the model is confidently wrong, hand it over without warning, and say nothing about the error. The whole assessment happens in the next ninety seconds.
  • Watch for catch, check, rebuild. Then score the judgment, not the deliverable. Wharton researchers found people followed the AI's wrong answer around 80% of the time, so the trait you are hiring against is the majority behaviour.

Why the quiz measures nothing

Start with what a quiz is for. A test of knowledge works when the knowledge is scarce and the candidate carries it in their own head. That held for a very long time. If I asked a marketer to name the difference between a paid and an organic funnel, the ones who knew had learned it, and the learning was the signal. The answer was expensive to hold, so holding it meant something.

AI removed the expense. The knowledge is one tab away now, generated on demand, phrased better than the candidate phrases it themselves. When you set a quiz on "AI capability," you are testing recall of a body of fact the person is holding a machine to recall for them. That is a typing-speed test on a keyboard the machine also operates. Whatever number comes out the other end is measuring Claude, or GPT, or whichever model they opened. It is not measuring them.

This is the same trap as the take-home assignment, wearing a smarter costume. The take-home lied because it looked like real work and was really the tool's work. The quiz lies for the identical reason, except it does not even have the decency to look like the job. It is a driving-theory test for someone applying to be a rally driver.

There is a second, quieter failure. A quiz rewards fluency, the ability to make the tool talk, and fluency in 2026 is table stakes. A teenager has it by default. A course teaches it in an afternoon. What separates a hire is Direction: knowing where the output should go, catching it when it drifts, overriding it when it is confidently wrong. A quiz cannot get near Direction, because Direction only shows up when there is something to direct against. A static question with a fixed right answer never lets the model be wrong in front of you, and the wrongness is the whole test.

The framework underneath: Fluency vs Direction

Two axes, not one. Every candidate sits on both, and your process reads only the first.

Fluency is what you can make AI do. Prompts, tools, speed to a first draft. It is real, it is necessary, and it is everywhere. It is also the axis a quiz measures, which is why a quiz separates nobody. The whole shortlist scores high on it now.

Direction is whether you can tell where the output should go, notice when it drifts, and override it when it is confident and wrong. It does not show up on a CV, does not survive a take-home, and cannot be quizzed. It becomes visible under one condition only: the model has to actually be wrong, live, with the candidate holding the wheel. Create that condition and Direction announces itself in seconds. Skip it and you never see the thing you are paying for.

The reason this matters is not theoretical. It is measured. In a Wharton study titled Thinking, Fast, Slow, and Artificial, researchers Steven Shaw and Gideon Nave ran three experiments with 1,372 people on reasoning tasks. With no AI, accuracy sat at 46%. Hand people a correct AI answer and it climbed to 71%. Then they made the AI wrong, and accuracy did not slip back to the unaided baseline. It collapsed to 31.5%, below the 46% people had managed with no help at all. People followed the wrong answer around 80% of the time, growing more confident as they got it wrong. Shaw and Nave call it cognitive surrender. In hiring terms, the trait you are screening for, catching the confident error, is the exception. Four in five people will not. Your process has to find the fifth.

The process: four steps you can run on Monday

Here is the method, stripped of theatre. Four steps. None needs a new tool, a psychometrician, or a licence. What each one needs is a hiring manager willing to stop grading the deliverable and start grading the person.

Step What it does What to look for
1. Demote artifacts to routing Strips the CV, take-home, and portfolio of verdict power; they only point the conversation now Nothing you'd hire on. Treat every polished submission as the tool's work until proven otherwise
2. Build one wrong-answer moment Creates the single condition Direction becomes visible under: a live task with a confident, plausible, incorrect AI output attached A load-bearing error, not an obvious one. A mistake that costs money three weeks later, not a typo
3. Watch the catch Reads the ninety seconds after you hand it over, saying nothing about the error Do they get an itch, check the claim everyone assumed, find the crack, and rebuild the rotten part, or build on top of it?
4. Score judgment, not the deliverable Records what the person did with the wrongness, not how clean the final artifact looked The story of the catch: what felt off, what they checked, what they rebuilt. Polish is noise; the catch is signal

Step one: demote the artifacts to routing

The résumé is a routing document, not a verdict. So is the portfolio. So is the take-home. Every artifact a candidate hands you was produced somewhere you could not see, at a time you did not control, on an instrument that sits on every desk. It reports the quality of their tool, not their capability. Stop reading it for the answer. Read it only to decide where to point the live conversation. This person claims to run growth automations, so the wrong-answer task will be a growth automation with a poisoned number in it.

This is not soft. It is the hardest step, because it means giving up the comfort of a stack of documents that feel like evidence. They are not evidence. They are the setting on the dial that tells you which live test to run.

Step two: build one moment where the model is confidently wrong

This is the load-bearing step, so build it with care. Take a real task from the actual role. Attach an AI output to it: clean, fluent, confident, and quietly incorrect. The error must be the kind that matters. A wrong assumption baked into a plan. A number that does not reconcile. A recommendation that contradicts a constraint stated three lines up. Not a typo. A mistake with consequences.

Then hand it over and say nothing. Do not flag the error. Do not hint. Do not ask "notice anything?" The value of the moment is that you did not warn them, because the job will not warn them either. When a real model ships a real team a confident wrong answer at 4pm on a Thursday, nobody puts a red box around it. You are simulating the actual failure mode, not a classroom version of it.

Step three: watch for catch, check, rebuild

Now you watch. This is the assessment. Everything before it was setup.

The AI user takes the gift. The output looked right, the machine was confident, the task had a shape, so they build on top of it, and the crack goes into the foundation of everything they do next. Fast, clean, wrong. The AI Operator does something different, and it happens in a sequence you can actually see. First the catch: something does not sit right, and they slow down instead of shipping. Then the check: they go to the load-bearing claim, the thing everyone else assumed, and they interrogate it. Then the rebuild: they tear out the rotten part and put a sound one in its place.

Catch, check, rebuild. That is the behaviour, and it is watchable in real time. You are not scoring whether they eventually produced a correct artifact, because a clever user stumbles into that too. You are scoring whether they distrusted the confident output before they had a reason to, then went and manufactured the reason. That distrust is the capability. It is the thing asking a candidate about AI can never surface but checking them live always can, because the person who has it and the person who fakes it give the same answer to a question and completely different behaviour to a live wrong answer.

The two questions underneath the whole exercise map onto the two axes. Can they build with it? is Fluency, and nearly everyone passes. Can they direct it? is Direction, and it sorts the room. If you want the short version of the interview, it is those two questions and nothing else. The wrong-answer moment is the honest way to make the second one impossible to bluff.

Step four: score the judgment, not the deliverable

Here is where most processes relapse. They run a good live test and then, at the end, grade the polish of the final document, because polish is easy to score and judgment is not. Do not do that. The deliverable is a trap. The candidate who caught the error, rebuilt the section, and handed you something slightly rougher has outperformed the one who shipped a beautiful artifact built on a lie.

So score the story of the catch. Ask them to walk you through it. What felt off? What did they check, and why that thing and not another? What did they rebuild, and what did they leave? An Operator describes their instrument. They can tell you the moment the output stopped feeling trustworthy and what they did with the suspicion. A user describes the view. Their story is really about how fast they moved and how good it looked. One of them you can hand a confident machine and trust the output. The other you cannot, and no CV on earth will tell you which is which.

I am deliberately not handing you a numeric rubric. The internal scoring, how we weight the catch against the rebuild and grade the quality of the check itself, took real work to calibrate, and a scored grid in the wrong hands becomes a thing candidates learn to game, which puts you back where the quiz left you. What travels safely is the shape: watch the four behaviours, score the judgment, ignore the polish. Run that honestly and you will out-hire any team still marking a knowledge test.

What this replaces, and what it costs

It costs more than a quiz. A quiz auto-grades. A wrong-answer session needs a human in the room paying attention for twenty minutes. That is the trade, and it is worth it, because the quiz tells you nothing while the session tells you the one thing that matters.

It also runs against the current. "AI skills" now sits on a growing share of job postings. PwC's AI Jobs Barometer puts AI-related demand at roughly 2.5% of postings and climbing, and the pressure to slap a test on the requirement is real. Resist it. The requirement is right and the test is wrong. You do not want to know whether a candidate has AI skills in the abstract. Every candidate does. You want to know whether they surrender to the machine or supervise it, and that is a behaviour, not a fact. It cannot be quizzed. It must be watched.

Demote the artifacts. Stage the wrong answer. Watch the catch. Score the judgment. Four steps, no rubric leaked, and at the end of them you know something about the person that no quiz, no take-home, and no confident CV could ever have told you.


Frequently asked questions

How do you hire for AI capability without a test? Replace the test with a live wrong-answer moment. Give the candidate a real task from the role with a confident, plausible, incorrect AI output attached, say nothing about the error, and watch whether they catch it, check the load-bearing claim, and rebuild. A quiz measures knowledge the AI supplies for free; the live moment measures judgment, which the tool cannot hand them.

Why doesn't a knowledge quiz work for AI skills? Because the knowledge is no longer in the candidate's head. It is one tab away, generated on demand. A quiz on prompt patterns or model names grades the tool the candidate is holding, not the candidate. In 2026 nearly everyone scores high on that fluency, so a quiz separates nobody. Direction is what separates a hire, and a static question can never surface it.

What exactly should I look for in the wrong-answer moment? Three behaviours in sequence: the catch (they slow down instead of shipping because something feels off), the check (they interrogate the load-bearing claim everyone else assumed), and the rebuild (they replace the rotten part). The AI user builds on the error. The AI Operator distrusts the confident output before they have a reason to, then goes and finds the reason.

Isn't a live test harder to run than a quiz? Yes, and that is the trade. A quiz auto-grades and tells you nothing. A wrong-answer session needs a human paying attention for twenty minutes and tells you the one thing that predicts the hire. Wharton's Shaw and Nave found people followed the wrong AI answer around 80% of the time, so the trait you are screening for is the exception, worth twenty minutes to find.

Should I stop using take-homes and portfolios entirely? Demote them, do not discard them. Treat the CV, take-home, and portfolio as routing information: they point the live conversation, they never decide the hire. Each was produced on an instrument on every desk, so each reports the quality of the tool, not the person. The verdict comes from the live moment.

Why won't you publish the actual scoring rubric? Because a numeric grid in candidates' hands becomes a thing they learn to game, which lands you back where the quiz left you. The safe part to share is the shape: watch the catch, check, and rebuild, score the judgment, ignore the polish. The exact weightings matter far less than the discipline of grading the person, not the deliverable.

How does this connect to the AI Operator idea? Directly. The candidate who catches the confident wrong answer is an AI Operator; the one who builds on it is an AI user. The whole method tells them apart when they look identical on paper. Ivanooo built the AI Operator Profile to make that measurement repeatable at scale: a candidate in front of a confidently wrong machine, and the one signal a quiz can never reach.