On the bench
Why Useful Agents Need Verifiers, Not Just Bigger Models
MOT test for model output
Every vehicle I repair gets tested before it goes back to the customer. Not because I doubt my work — because the cost of being wrong transfers to someone else the moment the car leaves. The test isn’t an insult to the mechanic. It’s the thing that makes the mechanic’s confidence sellable.
Model output needs the same. Not a bigger model — a checkpoint. In my systems the verifier is small, cheap and dumb on purpose: it checks ten things (no leaked internals, claims grounded in the knowledge base, format correct, no invented part numbers) and it has veto power. The smart model proposes; the boring model disposes.
What’s queued for this draft:
- Worker + verifier beats hero-model: real numbers from my diagnosis pipeline
- Why verifiers should be boring, rule-based and slightly stupid
- Veto power: what happens when the verifier and worker disagree
- Where humans still sit in the loop, and why that’s a feature
On the bench. The numbers are being pulled from production logs.