Skip to main content

AnoLLM: Fine-Tuning LLMs for Tabular Anomaly Detection in Financial Data

· 6 min read
Mike Thrift
Mike Thrift
Marketing Manager

The zero-shot LLM anomaly detection paper I read two days ago (arXiv:2406.16308) showed that GPT-4 could identify tabular outliers without any training, matching classical baselines like ECOD on the ODDS benchmark. But it had an obvious weakness: asking the model to output a list of anomalous row indices is fragile — open-source models routinely hallucinate indices, go out of bounds, or flag every row as suspicious. AnoLLM, published at ICLR 2025 by Che-Ping Tsai, Ganyu Teng, Phillip Wallis, and Wei Ding from Amazon, fixes that fragility while also pushing into mixed-type datasets where pure numerical baselines start to struggle.

The paper

2026-06-24-anollm-llm-fine-tuning-tabular-anomaly-detection

AnoLLM reframes tabular anomaly detection as language model density estimation rather than prompted classification. Instead of asking the LLM to name which rows look suspicious, the authors fine-tune a pre-trained language model on serialized in-distribution (normal) training rows, then score each test row by its negative log-likelihood under that learned distribution. A row that looks nothing like the training distribution gets a high NLL — that is the anomaly score. No index format, no output parsing, no fragile regex extraction.

The serialization converts each table row to a natural-language string with feature names and values. For text-valued columns the NLL is normalized per column to avoid length bias, where longer descriptions would otherwise mechanically accumulate higher probability costs. For numerical and categorical columns, raw token-level NLL is summed across the field. The model is fine-tuned in a semi-supervised setting — only normal-labelled rows enter training — for up to 2,000 steps using distributed GPU training.

Key ideas

  • The output format problem: prior index-prediction approaches require LLMs to reliably output anomalous row indices from a batch. Llama-family models frequently pair wrong indices with values, generate indices beyond the batch size, or simply list everything as anomalous. NLL bypasses this entirely.
  • AnoLLM achieves best performance on six benchmark datasets with mixed feature types, including vehicle insurance fraud detection and e-commerce fraud datasets from Kaggle.
  • On the 30 predominantly numerical ODDS benchmark datasets, AnoLLM performs on par with top classical baselines — not clearly better, just competitive.
  • The NLL-per-column normalization for text features is a small but load-bearing engineering decision: without it, a transaction description with thirty tokens would dominate the score over a two-digit amount, which is the wrong inductive bias.
  • The training baseline context: the zero-shot GPT-4 approach (arXiv:2406.16308) achieves an average AUROC of 74.1 on ODDS, comparable to ECOD (75.5) and KNN (70.7). AnoLLM's advantage shows up specifically on datasets where text and categorical features carry meaningful anomaly signal.

What holds up — and what doesn't

The core NLL idea is sound. Using a fine-tuned language model as a density estimator over serialized rows is principled, and it naturally handles the joint distribution of all columns simultaneously — something that classical unsupervised detectors applied column-by-column cannot do cleanly. The fix to index prediction is genuinely useful and the comparison to the zero-shot baseline is fair.

What bothers me is the cost-benefit gap that the paper underreports. AnoLLM requires fine-tuning and serving an LLM for inference — a substantial infrastructure commitment compared to fitting ECOD or IsolationForest on a CPU in seconds. On the ODDS benchmark (purely numerical), AnoLLM is only "on par," not better. So the case for AnoLLM is entirely in the mixed-type regime, where the six evaluated datasets are from fraud detection on Kaggle. Six datasets is a thin empirical foundation for a strong recommendation, especially since benchmark datasets from Kaggle tend to have clean schemas, fixed column semantics, and known ground truth — all things that production ledger data often lacks.

The column ordering problem is also left open. CausalTAD (arXiv:2602.07798) immediately identified this gap: AnoLLM serializes columns in arbitrary order, ignoring the causal relationships between fields. For structured data with known causal chains — account type influences valid transaction ranges, which influence expected counterparty — this is a real limitation. CausalTAD frames reordering as a linear ordering problem and reports consistent improvement over AnoLLM across 30+ datasets. That the gap existed and was findable so quickly suggests AnoLLM's serialization design wasn't fully thought through.

There is also a scale question the paper doesn't address: at what volume of normal training examples does fine-tuning an LLM become worth it over, say, a tabular deep learning model trained directly on the numerical features? For personal Beancount ledgers with a few thousand entries, the compute cost may easily dwarf any accuracy gain.

Why this matters for finance AI

Beancount ledger entries are exactly the kind of mixed-type data AnoLLM targets: amounts (numerical), account names (structured text), payee/narration (free text), tags (categorical), dates (structured). A single row like 2024-03-15 * "AWS" "Cloud invoice" Assets:Checking -$2,400 encodes information across all these types simultaneously. Classical anomaly detectors struggle here because they need separate handling for each column type, and they lose the correlations between them — the joint pattern that "AWS" invoices should be in a certain range and hit a specific account.

AnoLLM's NLL approach would, in principle, learn these joint patterns from normal historical entries and flag deviations across any column combination. That is potentially more useful than rule-based JETs or single-column statistical tests.

That said, the double-entry accounting constraint is structural knowledge AnoLLM cannot learn from serialized rows alone — debits must equal credits, account hierarchies must be respected. These domain invariants are hard constraints, not statistical regularities, and no amount of LLM fine-tuning on historical rows will enforce them reliably if the training data contains any exceptions or rounding artifacts. The right architecture probably combines AnoLLM's NLL scoring for semantic anomalies with explicit rule checks for structural ones.

  • CausalTAD (arXiv:2602.07798) — directly improves on AnoLLM by injecting causal column ordering; the most immediate follow-up to evaluate
  • AD-LLM: Benchmarking Large Language Models for Anomaly Detection (arXiv:2412.11142, ACL Findings 2025) — provides the systematic multi-paradigm evaluation missing from individual method papers
  • "Language Models are Realistic Tabular Data Generators" (Borisov et al., arXiv:2210.06280, ICLR 2023) — the BE-GREAT model that AnoLLM uses as a baseline; understanding it clarifies what AnoLLM actually improves over beyond index-prediction