Skip to main content

Zero-Shot Anomaly Detection with LLMs: How GPT-4 Performs on Tabular Data

· 6 min read
Mike Thrift
Mike Thrift
Marketing Manager

The AuditCopilot paper I read last month benchmarked LLMs on journal-entry fraud detection by fine-tuning on labeled anomaly data. I've been curious ever since whether zero-shot prompting could get you most of the way there — no labeled anomalies required, no domain-specific fine-tuning. That's exactly the promise of "Anomaly Detection of Tabular Data Using LLMs" by Li, Zhao, Qiu, Kloft, Smyth, Rudolph, and Mandt (arXiv:2406.16308), a workshop paper from mid-2024. The headline result — GPT-4 matching classical transductive methods like ECOD — sounded almost too good, so I read it carefully.

The paper

2026-06-21-anomaly-detection-tabular-data-llms

The core idea is what the authors call "batch-level" anomaly detection. Instead of fitting a model on training data and then scoring test points individually, you present the LLM with a batch of N rows at inference time and ask it to identify which rows are anomalous relative to the others in the same batch. Anomalies are sparse within any batch, so a capable enough model should implicitly recognize the majority pattern and flag the outliers. No retraining, no labeled examples — just the LLM's pretrained world knowledge and in-context reasoning.

They evaluate on the 32-dataset ODDS benchmark, a standard collection of real-world tabular anomaly detection problems. Due to context window limits, they cap each evaluation batch at 150 rows and 10 columns. Features are serialized one dimension at a time with the template "Data i is x_i." and the LLM is prompted to name the anomalous indices across each dimension separately; a row's final anomaly score aggregates how many dimensions flagged it.

For proprietary models, they test zero-shot. For open-source models (Llama2-7B, Llama2-70B, Mistral-7B), zero-shot performance is poor, so they also propose fine-tuning on a synthetic dataset of 5,000 batches generated from Gaussian mixtures and categorical distributions — no real anomaly labels required. The fine-tuned variants are called Llama2-AD and Mistral-AD.

Key ideas

  • GPT-4 zero-shot achieves 74.1 mean AUROC across 32 ODDS datasets, compared to ECOD's 75.5 (the best classical baseline) and KNN's 70.7. GPT-3.5 lags at 68.3.
  • Llama2-7B zero-shot scores only 51.1 — essentially random — but fine-tuning on synthetic data brings it to 60.0, a gain of +8.9 points. Mistral-7B improves from 62.4 to 69.1 (+6.7 points).
  • The "batch-level" framing is the interesting conceptual move: the LLM acts as an implicit density estimator over the batch rather than a discriminator trained to separate classes.
  • Fine-tuning uses LoRA on synthetic Gaussian and categorical data only — no real anomaly annotations needed. That's a meaningful practical advantage if it generalizes.
  • Output parsing is brittle for open-source models; the authors enforce grammar constraints and use regex patterns to extract anomaly indices.

What holds up — and what doesn't

The benchmark coverage is the biggest problem. The paper compares against only two classical baselines: KNN and ECOD. Isolation Forest, LOF, One-Class SVM, and any deep learning anomaly detection method are completely absent. ECOD happens to be a strong baseline on ODDS — but GPT-4 doesn't clearly beat it (74.1 vs 75.5), and neither does Mistral-AD (69.1). Against a broader set of baselines, it's not obvious GPT-4 would hold its position.

The 150 rows / 10 columns cap is also a serious constraint that the paper doesn't adequately address. Real accounting ledgers have thousands of transactions and many more features. Whether the batch-level approach scales — or whether it degrades because anomalies become harder to distinguish in larger batches with more diverse patterns — isn't tested.

The variance numbers are troubling. GPT-3.5 on the breastw dataset scores 63.1 ± 34.4 AUROC. That's not a method you can deploy when a single run can plausibly score anywhere from 30 to 98. GPT-4 is tighter (98.7 ± 0.5 on breastw) but shows similar variance on other datasets.

The feature independence assumption is another hole. The LLM queries each feature dimension separately and aggregates scores. It cannot reason about joint feature patterns — a transaction with an unusual combination of amount, counterparty, and account code might look normal on any individual dimension. Multi-dimensional anomalies, which are arguably the most common and economically significant kind in accounting, won't be caught by this approach without significant redesign.

The follow-up literature confirms these worries. AnoLLM (ICLR 2025) from Amazon Science takes a different approach: instead of prompting for anomaly indices, it fine-tunes an LLM to model the data distribution and uses negative log-likelihood as the anomaly score, avoiding the brittle output-parsing regime entirely. CausalTAD (arXiv:2602.07798, February 2026) identifies another gap shared by this paper and AnoLLM: column ordering during serialization is random, ignoring causal relationships between features. Reordering columns to respect causal structure improves average AUC-ROC from ~0.80 to 0.83 on six benchmarks.

Why this matters for finance AI

Despite its limitations, the zero-shot direction is genuinely interesting for Beancount ledger anomaly detection. The AuditCopilot paper required fine-tuning on labeled anomaly examples — hard to obtain in practice because real fraud cases are rare, sensitive, and labeling them requires expert accountants. The paper's synthetic fine-tuning approach (Llama2-AD, Mistral-AD) sidesteps this: you generate realistic-looking transaction batches with artificial anomalies and fine-tune without ever touching a real ledger.

The batch-level mechanism maps naturally to how accountants actually think: "in this month's transactions, which entries look unusual relative to the rest?" That's the intuition behind journal entry testing in auditing. The challenge is that real ledger anomalies are multi-dimensional — a payment that's normal in amount but unusual in timing, counterparty, and account code combination. Querying each feature independently, as this paper does, won't catch those.

What I want to see is a version of this approach where the full row is embedded and scored holistically — closer to what AnoLLM does with distribution modeling — applied to a realistic sample of Beancount transaction data. The synthetic fine-tuning idea deserves serious exploration; generating synthetic Beancount ledger batches with injected anomalies (wrong accounts, duplicated entries, implausible amounts) is straightforward, and fine-tuning a 7B model on those could produce a useful zero-shot auditor without requiring any real labeled data.

  • AnoLLM: Large Language Models for Tabular Anomaly Detection — ICLR 2025, OpenReview ID 7VkHffT5X2; the most direct extension of this work, using likelihood-based scoring instead of prompted index prediction
  • CausalTAD: Injecting Causal Knowledge into Large Language Models for Tabular Anomaly Detection — arXiv:2602.07798; addresses the column-ordering gap by aligning serialization to causal structure
  • AD-LLM: Benchmarking Large Language Models for Anomaly Detection — arXiv:2412.11142, ACL Findings 2025; a broader benchmark covering NLP anomaly detection tasks, useful for understanding where LLMs are already reliable vs. unreliable as anomaly detectors