AuditCopilot: LLMs for Fraud Detection in Double-Entry Bookkeeping
The paper I am reading this week is AuditCopilot: Leveraging LLMs for Fraud Detection in Double-Entry Bookkeeping (arXiv:2512.02726), submitted in December 2025 by Kadir, Macharla Vasu, Nair, and Sonntag. It sits at the intersection of LLM agent research and financial compliance: using foundation models to detect fraudulent journal entries in real corporate ledgers. Of all the papers in the Bean Labs reading list so far, this is the one most directly concerned with the same raw data format we care about.
The paper
Every public company audit — mandated by PCAOB Auditing Standard AS 2401 — must include Journal Entry Testing (JET): systematic checks over the ledger for entries that flag on rule-based heuristics. The rules are things like "entry posted after midnight," "round-number amount," "unusual account pair," or "entry posted by a rarely active user." These rules work, but they generate enormous volumes of false positives: auditors spend most of their time dismissing obvious noise.
AuditCopilot asks whether LLMs can replace or augment these rules. The system passes each journal entry — structured as a JSON-like text snippet with fields for posting date, debit/credit amounts, account IDs, tax rates, and a set of precomputed binary anomaly flags — to an LLM prompt that returns a binary anomaly label and a natural-language explanation. The authors benchmark Mistral-8B, Gemma-2B, Gemma-7B, and Llama-3.1-8B on both a synthetic enterprise ledger and a single real-world anonymized tax ledger, comparing against traditional JETs and an Isolation Forest baseline.
Key ideas
- On the synthetic dataset (5,000 posting IDs, ~1% true anomaly rate), Mistral-8B with the full prompt achieves Precision 0.90, Recall 0.98, F1 0.94 — compared to the JET baseline's Precision 0.53, Recall 0.90, F1 0.50, and critically only 12 false positives versus JET's 942.
- The "full" AuditCopilot prompt includes not just the raw entry features but also global dataset statistics (mean, median, 95th and 99th-percentile amounts) and a pre-computed Isolation Forest score per row. This context engineering is load-bearing.
- On the real-world dataset, Gemma-7B with the full prompt reaches Precision 0.89, Recall 0.78, F1 0.83. When the Isolation Forest hint is removed, precision collapses to 0.14 — the LLM alone is not carrying the weight.
- The explanations are the system's most defensible contribution: unlike a numeric anomaly score, each flagged entry comes with a prose justification ("this amount exceeds the 99th percentile for this account cluster and is posted outside business hours"), which an auditor can quickly accept or dismiss.
- No fine-tuning anywhere. Everything runs zero-shot or with a brief system-role prompt, which is good for deployment cost but also means the results are very prompt-template dependent.
What holds up — and what doesn't
The false-positive reduction result is striking and real. Going from 942 to 12 false positives on the same data is the kind of operational gain that actually changes whether a tool gets used in practice. I believe the direction.
But I have serious reservations about the evaluation design.
First, the ground-truth labels on the synthetic dataset are themselves constructed from JET rules. The anomalies that were injected are exactly the kinds of patterns JETs were designed to catch. So "the LLM outperforms JET" on a JET-labeled test set may partly reflect the LLM learning to mimic the same rules from the contextual statistics in the prompt, not generalizing beyond them.
Second, the Isolation Forest ablation on real data is damning in a way the paper under-discusses. F1 drops from 0.83 to 0.24 without IF scores. This tells me the LLM is primarily functioning as a flexible threshold on top of the IF signal, not as an independent anomaly detector. The system is closer to an ML ensemble with a natural-language skin than a "foundation model doing audit reasoning."
Third, only one real-world dataset, drawn from a single industry partner. The authors acknowledge this, but it means we cannot assess generalization across company size, accounting system, or industry.
Fourth, the paper compares against JETs and a single ML baseline (Isolation Forest). Autoencoder-based anomaly detection, XGBoost with engineered features, and simple logistic regression on IF scores are all absent. The space of what counts as "classical ML" here is narrow.
The hallucination question is not addressed. The authors call the explanations a key contribution, but there is no evaluation of whether the prose justifications are factually correct or consistent with the binary prediction.
Why this matters for finance AI
This is the closest existing paper to what Bean Labs is building. Beancount ledgers are double-entry bookkeeping systems. Every transaction is a set of posting lines. Anomaly detection over those lines — unusual accounts, out-of-range amounts, implausible date patterns — is an obvious first feature for an autonomous finance assistant.
The AuditCopilot result suggests that the right approach for Beancount audit is probably not "prompt an LLM with a raw transaction and ask if it's suspicious," but rather "compute a lightweight statistical context (account-level baselines, temporal distribution, Isolation Forest scores) and give the LLM that enriched context." The LLM's value is in synthesis and explanation, not in raw anomaly scoring.
The false-positive reduction is also directly relevant. A Beancount audit tool that surfaces 942 candidate anomalies per run will be ignored. One that surfaces 12 high-confidence candidates with explanations will be used. That is not a performance metric — it is a product metric.
The write-back safety concern I care most about is not addressed in this paper. AuditCopilot only reads and flags; it does not propose corrections or modify the ledger. That is the right scope for a first paper, but the hard problem for Bean Labs remains: once you have a flagged anomaly, how do you safely decide what to do about it?
What to read next
- Understanding Structured Financial Data with LLMs: A Case Study on Fraud Detection (arXiv:2512.13040, ACL 2026) — introduces FinFRE-RAG, which adds retrieval-augmented in-context examples to the same fraud detection problem and benchmarks across four public fraud datasets; directly addresses the single-dataset limitation of AuditCopilot.
- Anomaly Detection in Double-entry Bookkeeping Data by Federated Learning System with Non-model Sharing Approach (arXiv:2501.12723) — addresses the privacy constraint that prevents pooling ledger data across firms; the federated approach is likely necessary for any production Beancount audit service that wants to train on client data without centralizing it.
- GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning (arXiv:2406.09187) — the safety enforcement problem AuditCopilot deliberately sidesteps: once anomalies are flagged, how do you make sure a write-back agent does not commit changes that violate accounting invariants?
