Reliability, measured in the open
We publish Verity's real numbers on public benchmarks — and we audit our own evaluation, because a measurement can mislead as easily as a model can. Method and data below; nothing hidden.
Telling true from false
Verity's core job. Tested on FEVER, the standard fact-verification benchmark, grounded tier (Sonnet + live web), 100 claims (50 true + 50 false).
95%
accuracy telling true claims from false (95 / 100)
0
genuine false-affirmations — never called a real falsehood true*
1.00
precision — never wrongly refused a valid question (abstention test)
What this means: when a claim can be checked, Verity gets it right 19 times in 20, and it never affirmed an actual falsehood across 100 claims. *The single apparent miss FEVER counts against us — "Dan O'Bannon died," labeled false — is in fact true (he died in 2009); the benchmark's label is wrong, not Verity. Of its handful of over-cautious calls, most are imprecise or debatable claims (e.g. "Boyhood is about the birth of Mason Evans Jr." — the film follows him growing up).
Knowing when not to answer
A trust layer must abstain on the unanswerable without refusing the answerable. Tested on FalseQA false-premise questions (a source dataset of Meta's AbstentionBench) plus answerable ARC controls.
1.00
precision — 0 of 50 valid questions wrongly refused
76%
of false-premise questions correctly flagged (38 / 50)
Precision is perfect: Verity never wasted a call refusing a real question. On catching false premises it flags 76% — and that figure comes from auditing our own first measurement: our initial judge scored 40% by miscounting premise-corrections (like "female turtles have neither wings nor two shells") as plain answers. A fair judge that credits premise-rejection gives 76%. We report both, and the method, below. Grounding did not change false-premise detection (live search adds currency, not premise-logic).
Resisting prompt injection
Sentinel — Verity's guard against hijacked inputs. Tested on the public deepset/prompt-injections set: 203 injection attempts + 203 legitimate inputs.
92%
of prompt-injection attempts caught
0%
false-positive rate — never flagged a legitimate input (0 / 203)
Sentinel flags injections — including subtle ones like task-switching ("that's done, now do X"), persona override ("pretend you are…"), grounding-override ("ignore the articles, use your own knowledge"), and non-English attacks — while never raising a false alarm on ordinary content across 203 legitimate inputs. A guard that cries wolf is as useless as one that sleeps; this one does neither.
Method & honesty notes
- Datasets: FEVER (
copenlu/fever_gold_evidence, validation), FalseQA (false premises) + ARC (answerable controls) — all public. - Claim verification: grounded Sonnet tier; verdict compared to the gold label. The headline 95% is on the verifiable classes (true / false). Including FEVER's "not-enough-info" class, raw 3-class accuracy is 67% — but that class is noisy for a live-web verifier: FEVER uses a static 2018 Wikipedia snapshot, so Verity often correctly resolves a claim the annotators couldn't.
- Abstention: a separate model judge classifies each response as abstained/flagged vs answered. We publish both our first (strict, 40%) and corrected (fair, 76%) judges — the gap was a measurement flaw, not a model change.
- Samples are balanced subsets (fixed random seed), ~100 items per test, for cost; usual small-sample uncertainty applies.
- FEVER has known label noise (~10–20%); some "errors" against it are the benchmark's, not Verity's, and are documented above.
- The official AbstentionBench HuggingFace harness no longer loads under current tooling (deprecated dataset-script format), so abstention is evaluated on its public source datasets directly.
- Audit it: per-item results in the machine-readable reliability.json. Questions: veritylayer@gmail.com.
Last run: June 28, 2026.