PHANTOM (NeurIPS 2025): Measuring LLM Hallucination Detection in Financial Documents
PHANTOM (NeurIPS 2025) asks the question I most wanted answered before trusting an LLM to touch a Beancount ledger: can a model actually tell when it's making things up about a financial document? The results are not reassuring, and the methodological choices are worth examining carefully.
The paper
%3A%20Measuring%20LLM%20Hallucination%20Detection%20in%20Financial%20Documents)
Lanlan Ji, Dominic Seyler, Gunkirat Kaur, Manjunath Hegde, Koustuv Dasgupta, and Bing Xiang — most affiliated with IBM Research — constructed PHANTOM specifically to fill a gap that generic hallucination benchmarks leave open. Standard hallucination benchmarks test short, clean contexts with well-formed queries. Financial documents are the opposite: a single 10-K filing routinely exceeds 100,000 tokens, numbers are precise to the cent, and the language is dense with domain-specific terms that have non-obvious meanings (EBITDA, deferred revenue, goodwill impairment). The core contribution is a dataset of query-answer-document triplets built from real SEC filings — 10-K annual reports, 497K mutual fund filings, and DEF 14A proxy statements — where each answer is either correct or deliberately hallucinated, validated by human annotators. The benchmark then expands that seed set to test context lengths from ~500 tokens all the way to 30,000 tokens, and systematically varies where the relevant information appears: at the beginning, middle, or end of the context.
Key ideas
- The task is hallucination detection, not hallucination generation: given a document chunk and an answer, classify whether the answer is grounded or fabricated. This is a simpler task than generating a grounded answer — yet models still struggle badly.
- Context length matters a lot. The seed set uses ~500-token chunks. As context grows to 10K, 20K, and 30K tokens, performance drops significantly across all models — consistent with the "Lost in the Middle" finding (arXiv:2307.03172) that LLMs degrade when relevant information is buried in the middle of a long context.
- Llama-3.3-70B-Instruct achieves the highest F1 score of 0.916 on the seed dataset — but the authors flag that this model was also used to generate the seed dataset, which is a circularity problem that inflates the number.
- Qwen3-30B-A3B-Thinking achieves F1 = 0.882, outperforming all tested closed-source models. Its non-thinking Instruct sibling scores 0.848, suggesting that test-time compute (chain-of-thought reasoning) adds real value here.
- Small models (Qwen-2.5-7B) score only slightly above random guessing on the benchmark. Hallucination detection over long financial documents appears to require substantial model capacity.
- Fine-tuning open-source models on PHANTOM data substantially improves their detection rates — the paper identifies this as the most promising direction for practitioners.
What holds up — and what doesn't
The construction methodology is careful. Human annotation on the seed set, followed by systematic expansion across context lengths and placement positions, gives PHANTOM a structure that most financial NLP datasets lack. The placement variation in particular is useful: it lets you measure whether a model's failure is about total context length or about the specific U-shaped attention pattern (strong at beginning and end, weak in the middle) that has been documented across many LLM architectures.
The Llama-3.3-70B circularity is a real problem and the authors deserve credit for flagging it — but it also means the benchmark's top result is uninterpretable. For practitioners, the more useful numbers are probably the Qwen3 and Phi-4 results, where no such contamination exists.
What I wish the paper provided: the actual degradation curve as context length grows from 500 to 30,000 tokens. The paper establishes that degradation happens, and that placement matters, but I couldn't extract the specific percentage-point drops from the available materials. That granularity matters for deciding where to set a retrieval chunk size in a production system. It's also worth noting that the benchmark tests only whether a model detects a hallucination in a presented answer — it doesn't test whether the model will hallucinate when asked to produce an answer from scratch. Those are related but different failure modes, and a system that scores well on detection can still fail badly at generation.
Finally, the dataset covers three SEC filing types. That's a meaningful slice of financial document space, but it leaves out earnings call transcripts, audit reports, covenant clauses in loan agreements, and the kind of ad-hoc journal entry descriptions that fill a Beancount ledger. Generalization to those formats is an open question.
Why this matters for finance AI
Hallucination is the trust problem for every autonomous accounting agent I can imagine building on top of Beancount. The write-back scenario is the worst case: an agent that reads a bank statement, classifies a transaction, and posts a journal entry. If it hallucinates the payee, the amount, or the account code, the ledger is silently wrong. PHANTOM is the first benchmark I've seen that tries to measure whether models can catch this class of error in realistic document conditions.
The finding that small models (7B) perform near random on hallucination detection is directly relevant to Bean Labs: if we're running an on-device or low-latency agent, we can't rely on a 7B model to self-verify its own output. We need either a larger verifier model, an external retrieval check, or a constrained output format that makes hallucinations structurally impossible (e.g., forcing the model to cite a line number from the source document before posting an entry). The fine-tuning result is encouraging: domain-specific adaptation on PHANTOM-style data seems to recover much of the detection capability even for smaller models, which suggests that a fine-tuned verifier could be a practical component in a write-back pipeline.
What to read next
- SelfCheckGPT (Manakul et al., arXiv:2303.08896) — sample-based hallucination detection without a reference document; complements PHANTOM's reference-grounded approach and may generalize better to open-ended ledger annotations
- "Lost in the Middle" (Liu et al., arXiv:2307.03172) — the foundational paper on positional attention degradation in long contexts; the PHANTOM placement results are essentially an applied replication of this in the financial domain
- FinanceBench (Islam et al., 2023) — the QA benchmark over SEC filings that showed GPT-4 Turbo with retrieval failing on 81% of a 150-case sample; pairs well with PHANTOM as a generation-side complement to PHANTOM's detection-side view
