Reproducible results on public benchmarks.
Every number on this page is measurable via our public API. We publish our methodology, evaluation code, and raw results so researchers and buyers can verify our claims independently.
v51h — current production model
Qwen3-32B backbone augmented with CAEF architecture (FAH + CWMI + ESR + MLP probe). Trained on 10,416 labeled hallucination pairs from HaluEval, TruthfulQA, HellaSwag, Winogrande, MNLI, and ARC-Challenge.
Detailed benchmark breakdown
| Benchmark | Samples | Recall | FPR | F1 |
|---|---|---|---|---|
HaluEval QA Li et al., 2023 | 10,000 | 100.0% | 2.2% | 98.9% |
TruthfulQA Lin et al., 2022 | 817 | 92.2% | 4.0% | 93.9% |
FreshQA Vu et al., 2023 | 19 | 68.4% | 0.0% | 81.3% |
BullshitBench v2 Gostev, 2026 | 24 | 73.3% | 22.2% | 83.3% |
How we benchmark.
Comparison accuracy
For paired benchmarks (HaluEval, TruthfulQA, FreshQA), we score each pair and check whether the hallucinated version scores higher than the correct version. This avoids threshold tuning.
F1, Recall, FPR
For single-claim benchmarks, we use a fixed threshold (0.39) optimized on a held-out training split. Recall measures fraction of hallucinations caught; FPR measures fraction of true facts incorrectly flagged.
No test-set leakage
Training data excludes test splits of all benchmark datasets. HaluEval QA test set (10K) was never seen during training. TruthfulQA training used only 805 pairs from a disjoint set.
Reproducibility
Every number above can be reproduced by calling our public API on the respective benchmark datasets. We publish the evaluation notebook used for FreshQA and BullshitBench on request.
Where AgnosLogic struggles.
We publish failure modes because hiding them doesn't make them go away.
Novel fabrications about non-existent entities
When the model has zero training signal on an entity (e.g., "Grok-4 has 314B parameters"), there is no contradictory representation to detect. The model generates plausible text with hidden states similar to genuine knowledge. This is a fundamental ceiling.
Adversarial paraphrases of training data
HaluEval's paired format may be detected partly by surface cues rather than hidden-state semantics alone. Our 98.9% F1 on HaluEval should be interpreted in that context.
Model-specific training required
The current v51h model works only with Qwen3-32B. Porting to Gemma, LLaMA, or GPT-OSS requires retraining the probes. We are working on a model-agnostic approach for v52.
Test it yourself.
Sign up for a free API key and run these benchmarks on your own data.