Skip to main content

LLM Anomaly Detection Survey (NAACL 2025): Strong Taxonomy, Absent Tabular Coverage

· 5 min read
Mike Thrift
Mike Thrift
Marketing Manager

The previous three entries in this thread covered AnoLLM, CausalTAD, and AD-LLM — each targeting tabular anomaly detection specifically. This survey by Ruiyao Xu and Kaize Ding, accepted to NAACL 2025 Findings, is supposed to tie those threads together into a unified landscape map. I expected a taxonomy that would clarify the design space; what I got is mostly a survey of image and video anomaly detection with a thin veneer of generality.

The paper

2026-07-03-llm-anomaly-ood-detection-survey

Xu and Ding's survey (arXiv:2409.01980) proposes organizing LLM-based anomaly and out-of-distribution (OOD) detection into two high-level classes: LLMs for Detection, where the model directly identifies anomalies, and LLMs for Generation, where the model augments training data or produces natural-language explanations that feed a downstream detector. Each class subdivides further. Detection splits into prompting-based methods (frozen or tuned LLMs queried with natural-language prompts) and contrasting-based methods (CLIP-family models that score anomalousness by comparing image patches to text descriptions). Generation splits into augmentation-centric methods (generating pseudo-OOD labels or synthetic minority samples) and explanation-centric methods (producing natural-language rationales for flagged events).

The accompanying GitHub reading list covers roughly 39 papers: 24 in detection, 10 in augmentation, and 5 in explanation.

Key ideas

  • Contrasting-based methods dominate image anomaly detection. WinCLIP achieves 91.8% and 85.1% AUROC on zero-shot anomaly classification and segmentation on MVTec-AD without any dataset-specific tuning, which is competitive with supervised methods trained on that dataset.
  • Frozen LLMs hit a modality gap for non-text data. The survey explicitly notes that "directly prompting frozen LLMs for anomaly or OOD detection results across various data types often yields suboptimal performance due to inherent modality gap between text and other data modalities."
  • LoRA and adapter tuning recover much of that gap. Methods like AnomalyGPT and AnomalyCLIP fine-tune with parameter-efficient techniques and substantially outperform their frozen counterparts.
  • Generation as augmentation is underutilized. BLIP-2-generated caption-level pseudo-OOD labels outperform word-level and description-level alternatives in OOD detection, suggesting that richer text supervision matters even for visual tasks.
  • Explanation-centric generation is the newest subcategory. Systems like Holmes-VAD and VAD-LLaMA go beyond binary flags to generate natural-language rationales for anomalous events, mostly in surveillance video.
  • Tabular data is nearly absent. The survey cites one method — "Tabular" by Li et al. (2024) — that converts tabular rows into text prompts and fine-tunes with LoRA, but provides no comparative numbers.

What holds up — and what doesn't

The two-class taxonomy is genuinely clean and I'll probably use it to organize my own thinking. The detection-vs-generation distinction captures a real architectural fork: you either ask the LLM to classify directly or you use it to build better training signal for a traditional detector.

What I can't accept is the paper's framing as a survey of anomaly detection broadly. The coverage is overwhelmingly concentrated on industrial defect images (MVTec-AD, VisA) and surveillance video (UCF-Crime, XD-Violence). Of the roughly 39 papers catalogued, almost none address tabular or financial data. Time series gets a few citations. Tabular gets one sentence. This is not a landscape map for Bean Labs — it is a landscape map for computer vision researchers who want to use CLIP for defect detection.

The authors acknowledge "space constraints prevent detailed metric summaries," which is a polite way of saying there are no comparison tables. For a survey paper, the absence of quantitative synthesis is a significant gap. Readers cannot use this paper to decide which paradigm is better for their use case without tracking down each cited paper individually.

The hallucination challenge is listed as an open problem, but the treatment is shallow — it names the risk without analyzing which detection paradigms are more or less susceptible, or how explanation-centric generation might make hallucinations more detectable through human review.

Why this matters for finance AI

Two subcategories are relevant despite the image-heavy coverage. First, the explanation-centric generation subcategory is exactly what Beancount audit agents need: not just a flag that a journal entry is anomalous, but a natural-language sentence explaining why. Financial auditors cannot act on a binary output. Second, the survey's near-total silence on tabular anomaly detection is itself informative — it confirms that the AnoLLM, CausalTAD, and AD-LLM thread I've been following is a frontier area rather than a well-trodden one, and that designing LLM-based audit tools for Beancount ledgers requires synthesizing insights from vision anomaly detection that have not yet been ported to tabular settings.

The prompting-vs-tuning trade-off is the most actionable finding: zero-shot prompting works as a first approximation but suffers from the modality gap; LoRA-based fine-tuning on representative labeled examples closes the gap. For a Beancount deployment with labeled anomaly examples from historical ledgers, the fine-tuning path appears more reliable than pure prompting.

  • "Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs" (arXiv:2406.03614) — uses LLM sentence-transformer embeddings on real general ledger journal entries; a direct bridge from this survey's framework to the Beancount tabular use case.
  • "Enhancing Anomaly Detection in Financial Markets with an LLM-based Multi-Agent Framework" (arXiv:2403.19735) — multi-agent pipeline for market data anomaly detection; the multi-agent coordination pattern may carry over to ledger audit.
  • AnomalyGPT (arXiv:2308.15366) — fine-tuned LVLM for industrial anomaly detection with pixel-level localization; reading this clarifies what "LLM tuning for detection" actually means architecturally, which the survey describes but does not explain.