Benchmarks

Reproducible results on public benchmarks.

Every number on this page is measurable via our public API. We publish our methodology, evaluation code, and raw results so researchers and buyers can verify our claims independently.

Headline results

v51h — current production model

Qwen3-32B backbone augmented with CAEF architecture (FAH + CWMI + ESR + MLP probe). Trained on 10,416 labeled hallucination pairs from HaluEval, TruthfulQA, HellaSwag, Winogrande, MNLI, and ARC-Challenge.

98.9%
F1 Score
HaluEval QA · 10,000 samples
93.9%
F1 Score
TruthfulQA · 817 samples
100%
Comparison Accuracy
FreshQA · 19 pairs
83.3%
Accuracy
BullshitBench v2 · 24 tests
Full results

Detailed benchmark breakdown

BenchmarkSamplesRecallFPRF1
HaluEval QA
Li et al., 2023
10,000100.0%2.2%98.9%
TruthfulQA
Lin et al., 2022
81792.2%4.0%93.9%
FreshQA
Vu et al., 2023
1968.4%0.0%81.3%
BullshitBench v2
Gostev, 2026
2473.3%22.2%83.3%
All benchmarks evaluated against the deployed production API. Response times observed between 71–123ms on RunPod serverless A100 80GB. Benchmark scripts and raw results available on request.
Methodology

How we benchmark.

Comparison accuracy

For paired benchmarks (HaluEval, TruthfulQA, FreshQA), we score each pair and check whether the hallucinated version scores higher than the correct version. This avoids threshold tuning.

F1, Recall, FPR

For single-claim benchmarks, we use a fixed threshold (0.39) optimized on a held-out training split. Recall measures fraction of hallucinations caught; FPR measures fraction of true facts incorrectly flagged.

No test-set leakage

Training data excludes test splits of all benchmark datasets. HaluEval QA test set (10K) was never seen during training. TruthfulQA training used only 805 pairs from a disjoint set.

Reproducibility

Every number above can be reproduced by calling our public API on the respective benchmark datasets. We publish the evaluation notebook used for FreshQA and BullshitBench on request.

Known limitations

Where AgnosLogic struggles.

We publish failure modes because hiding them doesn't make them go away.

Novel fabrications about non-existent entities

When the model has zero training signal on an entity (e.g., "Grok-4 has 314B parameters"), there is no contradictory representation to detect. The model generates plausible text with hidden states similar to genuine knowledge. This is a fundamental ceiling.

Adversarial paraphrases of training data

HaluEval's paired format may be detected partly by surface cues rather than hidden-state semantics alone. Our 98.9% F1 on HaluEval should be interpreted in that context.

Model-specific training required

The current v51h model works only with Qwen3-32B. Porting to Gemma, LLaMA, or GPT-OSS requires retraining the probes. We are working on a model-agnostic approach for v52.

Test it yourself.

Sign up for a free API key and run these benchmarks on your own data.

Get API keyTry live demo