Benchmarks

Reproducible results on public benchmarks.

Every number on this page is measurable via our public API. We publish our methodology, evaluation code, and raw results so researchers and buyers can verify our claims independently.

Headline results

v51h — current production model

Qwen3-32B backbone augmented with CAEF architecture (FAH + CWMI + ESR + MLP probe). Trained on 10,416 labeled hallucination pairs from HaluEval, TruthfulQA, HellaSwag, Winogrande, MNLI, and ARC-Challenge.

98.9%

F1 Score

HaluEval QA · 10,000 samples

93.9%

F1 Score

TruthfulQA · 817 samples

100%

Comparison Accuracy

FreshQA · 19 pairs

83.3%

Accuracy

BullshitBench v2 · 24 tests

Full results

Detailed benchmark breakdown

Benchmark	Samples	Recall	FPR	F1
HaluEval QA Li et al., 2023	10,000	100.0%	2.2%	98.9%
TruthfulQA Lin et al., 2022	817	92.2%	4.0%	93.9%
FreshQA Vu et al., 2023	19	68.4%	0.0%	81.3%
BullshitBench v2 Gostev, 2026	24	73.3%	22.2%	83.3%

All benchmarks evaluated against the deployed production API. Response times observed between 71–123ms on RunPod serverless A100 80GB. Benchmark scripts and raw results available on request.

Methodology

How we benchmark.

Comparison accuracy

For paired benchmarks (HaluEval, TruthfulQA, FreshQA), we score each pair and check whether the hallucinated version scores higher than the correct version. This avoids threshold tuning.

F1, Recall, FPR

For single-claim benchmarks, we use a fixed threshold (0.39) optimized on a held-out training split. Recall measures fraction of hallucinations caught; FPR measures fraction of true facts incorrectly flagged.

No test-set leakage

Training data excludes test splits of all benchmark datasets. HaluEval QA test set (10K) was never seen during training. TruthfulQA training used only 805 pairs from a disjoint set.

Reproducibility

Every number above can be reproduced by calling our public API on the respective benchmark datasets. We publish the evaluation notebook used for FreshQA and BullshitBench on request.

Known limitations

Where AgnosLogic struggles.

We publish failure modes because hiding them doesn't make them go away.

Novel fabrications about non-existent entities

When the model has zero training signal on an entity (e.g., "Grok-4 has 314B parameters"), there is no contradictory representation to detect. The model generates plausible text with hidden states similar to genuine knowledge. This is a fundamental ceiling.

Adversarial paraphrases of training data

HaluEval's paired format may be detected partly by surface cues rather than hidden-state semantics alone. Our 98.9% F1 on HaluEval should be interpreted in that context.

Model-specific training required

The current v51h model works only with Qwen3-32B. Porting to Gemma, LLaMA, or GPT-OSS requires retraining the probes. We are working on a model-agnostic approach for v52.

Test it yourself.

Get API key Try live demo