A critical reading of Xu and Ding's NAACL 2025 survey on LLM-based anomaly and OOD detection: the detection-vs-generation taxonomy holds up, but near-total absence of tabular coverage means financial AI practitioners must synthesize insights from vision models themselves.
Fin-RATE benchmarks 17 LLMs on 7,500 expert-curated QA pairs from 2,472 SEC filings, revealing an 18.60% accuracy collapse under longitudinal tracking and a 54-point drop for finance-specialized Fin-R1 on cross-entity tasks — with the retrieval pipeline, not the backbone model, as the binding bottleneck.
The TACL 2024 paper by Liu et al. shows LLMs perform up to 20 points worse on information buried in the middle of long contexts — a U-shaped degradation affecting every tested model including Claude-1.3-100K — with concrete implications for how RAG pipelines should order retrieved passages in finance and accounting applications.
AD-LLM benchmarks GPT-4o and Llama 3.1 8B across three anomaly detection roles — zero-shot detector, data augmenter, and model selector — on five NLP datasets; GPT-4o reaches AUROC 0.93–0.99 zero-shot, but LLM-based model selection remains unreliable, with direct implications for financial audit AI.
τ-bench shows that top LLMs like Claude 3.5 Sonnet drop from pass@1 of 0.692 to pass@4 of 0.462 in retail customer-service tasks — a consistency cliff with direct implications for any write-back agent operating on a Beancount ledger.
ConvFinQA (EMNLP 2022) extends FinQA into multi-turn conversation over S&P 500 earnings reports, finding that the best fine-tuned model achieves 68.9% execution accuracy versus 89.4% for human experts—and drops to 52.4% on hybrid multi-aspect conversations where models must carry numerical context across different financial topics.
FinanceBench evaluates 16 AI configurations against 10,231 questions from real SEC filings; shared-vector-store RAG answers correctly only 19% of the time, and even GPT-4-Turbo with the oracle passage reaches just 85% accuracy — showing that numerical reasoning, not retrieval, is the binding constraint for enterprise finance AI.
Себесъгласуваността заменя „алчното“ декодиране на веригата от мисли с гласуване с мнозинство върху N извлечени пътища на разсъждение — повишавайки точността на GPT-3 върху GSM8K със 17,9 процентни пункта без допълнително обучение — и се прилага директно към многостъпкови финансови изчисления, където единичното декодиране на модела е ненадеждно.