AD-LLM Benchmark: GPT-4o Hits 0.93+ AUROC Zero-Shot for Text Anomaly Detection
AD-LLM benchmarks GPT-4o and Llama 3.1 8B across three anomaly detection roles — zero-shot detector, data augmenter, and model selector — on five NLP datasets; GPT-4o reaches AUROC 0.93–0.99 zero-shot, but LLM-based model selection remains unreliable, with direct implications for financial audit AI.
