FinAuditing: LLMs Score Under 14% on Real SEC XBRL Auditing Tasks
FinAuditing benchmarks LLMs against the structured complexity of real SEC XBRL filings—not the polished QA pairs that dominate financial NLP leaderboards. I'm reading it now because the Bean Labs audit agenda keeps circling back to a question existing benchmarks can't answer: can a model hold an entire structured filing in memory and verify its internal consistency?
The paper
Wang et al. introduce FinAuditing, a benchmark of 1,102 instances drawn from 218 XBRL filings on SEC EDGAR, covering error types catalogued by the XBRL US Data Quality Committee (DQC). XBRL is the machine-readable format the SEC requires for all public company filings; each filing bundles an instance document (reported numbers), a taxonomy schema (valid accounting concepts), and four linkbases—calculation, presentation, definition, and label—that specify how concepts relate to each other. The benchmark operationalizes three auditing subtasks: Financial Semantic Matching (FinSM, retrieve the correct taxonomy concept for a reported fact), Financial Relationship Extraction (FinRE, classify the relation between two taxonomy nodes), and Financial Mathematical Reasoning (FinMR, verify that reported figures satisfy taxonomy-defined calculation rules). Instances average 33,848 tokens—at or beyond the effective context limit of many open-source models—and all 13 models are tested zero-shot.
Key ideas
- FinSM is essentially taxonomy retrieval: given a fact in the filing, find the right US-GAAP concept. DeepSeek-V3 tops the field at 12.42% Hit Rate@20—fewer than one-in-eight guesses correct when choosing from 20 candidates. GPT-4o manages 9.09%.
- FinRE (classifying linkbase relations) is the easiest task: GPT-4o reaches 91.82% accuracy and 90.09 Macro F1. But Qwen3-32B and Fino1-14B—both marketed as finance-capable—score 0.00%, apparently collapsing on the CombinationErr relation type.
- FinMR is brutal: Fino1-14B leads at 13.86% accuracy; most models sit in single digits. Error analysis attributes 70–83% of failures to arithmetic mistakes across multi-step calculation rules, with structural formatting errors accounting for 9–71% depending on the model.
- The source data is 4,545 DQC error messages from real filings (2020–2024)—not synthetic adversarial examples. The benchmark selects the 9 most frequent error types, covering 60.33% of real-world DQC violations.
- Domain-specialized models (Fino1-14B, FinR1) do not systematically beat general-purpose large models; Fino1-14B leads only on FinMR, and even there its 13.86% is barely above noise.
What holds up—and what doesn't
The benchmark is valuable precisely because it escapes the QA-pair format: success requires understanding linkbase relationships, not just matching a question to a text span. Grounding instance construction in DQC violations makes it reproducible and directly tied to the real audit process.
That said, I have reservations. The FinRE results are puzzling: GPT-4o at 91.82% while domain-capable models collapse to 0.00% is a variance that almost certainly reflects prompt sensitivity and output format mismatch rather than genuine reasoning ability. The paper tests all models zero-shot without ablating prompt format or providing few-shot baselines, making it impossible to attribute the 0.00% scores to intelligence rather than parsing failures. The LLM-as-judge framework used for FinMR introduces another layer of evaluation noise.
The headline claim—"accuracy drops of 60–90% over hierarchical multi-document structures"—also needs a clearer anchor. It is not obvious whether this compares against human performance, single-document versions of the same tasks, or flattened (non-hierarchical) variants. The direction is right, but without that baseline the magnitude is hard to interpret.
Why this matters for finance AI
Beancount files are not XBRL, but they share the key structural properties: a hierarchical account namespace analogous to the taxonomy schema, double-entry constraints that must balance analogous to calculation linkbases, and typed entries that reference canonical categories analogous to concept-to-instance matching. The FinMR failure mode—models making arithmetic mistakes across multi-step calculation rules—is exactly what matters for Beancount balance verification. If GPT-4o cannot reliably verify that US-GAAP addition trees sum correctly in an XBRL filing, it almost certainly cannot be trusted to verify complex account hierarchies in a ledger without offloading arithmetic to an external tool (PAL-style).
The FinSM numbers are a direct warning for any Beancount agent that maps user-typed account names or transaction descriptions to a canonical chart of accounts. Even the best model retrieves the right concept fewer than 13% of the time at rank 20. Ranking-based retrieval is nowhere near production-ready without a specialized retriever or fine-tuning on the target taxonomy.
The non-result for domain-specialized models is instructive: raw scale and structured prompting still determine outcomes more than financial pretraining for this class of structured reasoning task.
What to read next
- From Local to Global: A Graph RAG Approach to Query-Focused Summarization (arXiv:2404.16130) — the hierarchical XBRL linkbase structure is exactly the kind of graph-over-documents that Microsoft's GraphRAG targets; worth reading as an architectural response to FinAuditing's retrieval failures
- FinTagging: An LLM-ready Benchmark for Extracting and Structuring Financial Information (arXiv:2505.20650) — from overlapping authors, focuses on mapping financial facts to taxonomy concepts (the upstream task before auditing); complements FinAuditing's scope
- Towards Verifiably Safe Tool Use for LLM Agents (arXiv:2601.08012) — if models can't verify calculations reliably at zero-shot, the answer may be formal verification tooling layered over agent actions rather than better prompting
