40 posts tagged with "Data Science"

LLMAIMachine LearningData ScienceFraud DetectionAnalyticsAnomaly Detection

AD-LLM Benchmark: GPT-4o Hits 0.93+ AUROC Zero-Shot for Text Anomaly Detection

AD-LLM benchmarks GPT-4o and Llama 3.1 8B across three anomaly detection roles — zero-shot detector, data augmenter, and model selector — on five NLP datasets; GPT-4o reaches AUROC 0.93–0.99 zero-shot, but LLM-based model selection remains unreliable, with direct implications for financial audit AI.

LLMAIMachine LearningFraud DetectionData ScienceAnomaly DetectionBeancount

CausalTAD: Causal Column Ordering for LLM Tabular Anomaly Detection

CausalTAD improves LLM-based tabular anomaly detection by reordering table columns to respect causal dependencies before serialization, lifting average AUC-ROC from 0.803 to 0.834 over AnoLLM on mixed-type benchmarks — with direct implications for detecting anomalies in structured ledger data.

AILLMMachine LearningFraud DetectionData ScienceBeancountFinance

AnoLLM: Fine-Tuning LLMs for Tabular Anomaly Detection in Financial Data

AnoLLM (ICLR 2025) reformulates tabular anomaly detection as LLM density estimation — fine-tuning on normal rows and scoring by negative log-likelihood. It outperforms classical methods on mixed-type fraud datasets but offers no edge on purely numerical data, with real implications for detecting anomalies in Beancount ledger entries.

AILLMMachine LearningBeancountAutomationData ScienceQueriesPlain-Text Accounting

TableMaster: Adaptive Reasoning for Table Understanding with LLMs

TableMaster is a prompting-only pipeline that reaches 78.13% on WikiTQ with GPT-4o-mini—13 points above Chain-of-Table—by combining table-of-focus extraction, semantic verbalization, and adaptive switching between text and symbolic reasoning. Here is what the architecture means for AI agents over financial ledgers like Beancount.

AILLMFraud DetectionMachine LearningData ScienceBeancountAutomation

Zero-Shot Anomaly Detection with LLMs: How GPT-4 Performs on Tabular Data

GPT-4 achieves 74.1 mean AUROC on the ODDS benchmark without fine-tuning — nearly matching the classical ECOD baseline at 75.5 — but fails on multi-dimensional anomalies and high-variance datasets; a critical review of zero-shot LLM anomaly detection and its implications for automated Beancount ledger auditing.

AILLMMachine LearningFinanceFinancial ReportingData ScienceBeancount

DocFinQA: Long-Context Financial Reasoning on Full SEC Filings

DocFinQA replaces FinQA's curated 700-word passages with full 123,000-word SEC filings, exposing a 175× context increase that nearly halves GPT-4 accuracy on long documents. Retrieval pipelines fail to surface the right chunk 45% of the time at HR@3 — and long-context models are not a substitute.

AILLMMachine LearningAutomationBeancountPlain-Text AccountingData Science

GAIA Benchmark: Measuring What Frontier AI Agents Can Actually Do

GAIA benchmarks 466 real-world tasks across three difficulty levels; frontier agents reached 74.55% in mid-2026 versus 92% for humans, and the remaining Level 3 gap maps directly to the multi-step coordination challenges in automated Beancount ledger workflows.

AIMachine LearningAutomationLLMTechnologyData ScienceAI Agents

OSWorld: Desktop AI Agents Succeed on 12% of Tasks Where Humans Succeed on 72%

OSWorld (NeurIPS 2024) benchmarks multimodal AI agents on 369 real desktop tasks across Ubuntu, Windows, and macOS — finding a 60-percentage-point gap between the best model (12.24%) and human performance (72.36%), with 75% of failures traced to visuomotor grounding errors rather than reasoning failures.

AILLMMachine LearningBeancountPlain-Text AccountingQueriesData Science

Chain-of-Table: Evolving Tables in the LLM Reasoning Chain

Chain-of-Table (ICLR 2024) improves LLM tabular reasoning by evolving the table itself as the intermediate state — achieving 67.31% on WikiTQ vs. 61.48% for prior baselines, with a +10.25 point advantage on tables exceeding 4,000 tokens and direct applicability to Beancount ledger query agents.

AIMachine LearningLLMData SciencePlain-Text AccountingBeancountQueriesAutomation

TAPAS: Weakly Supervised Table QA Without SQL, and What It Means for Beancount

TAPAS (Google Research, ACL 2020) answers table questions by selecting cells and applying scalar aggregations — no SQL generated. This post analyzes the architecture, its 12-point SQA accuracy gain, and why the cell-selection paradigm fits small Beancount ledger queries but breaks down at scale.

Everything About Data Science

AD-LLM Benchmark: GPT-4o Hits 0.93+ AUROC Zero-Shot for Text Anomaly Detection

CausalTAD: Causal Column Ordering for LLM Tabular Anomaly Detection

AnoLLM: Fine-Tuning LLMs for Tabular Anomaly Detection in Financial Data

TableMaster: Adaptive Reasoning for Table Understanding with LLMs

Zero-Shot Anomaly Detection with LLMs: How GPT-4 Performs on Tabular Data

DocFinQA: Long-Context Financial Reasoning on Full SEC Filings

GAIA Benchmark: Measuring What Frontier AI Agents Can Actually Do

OSWorld: Desktop AI Agents Succeed on 12% of Tasks Where Humans Succeed on 72%

Chain-of-Table: Evolving Tables in the LLM Reasoning Chain

TAPAS: Weakly Supervised Table QA Without SQL, and What It Means for Beancount

Get started with Beancount.io

Getting Started

Features

Community

Legal