40 posts tagged with "Data Science"

AILLMMachine LearningFinanceFinancial ReportingData ScienceAnalytics

ConvFinQA: Multi-Turn Financial QA and the 21-Point Gap Between Models and Human Experts

ConvFinQA (EMNLP 2022) extends FinQA into multi-turn conversation over S&P 500 earnings reports, finding that the best fine-tuned model achieves 68.9% execution accuracy versus 89.4% for human experts—and drops to 52.4% on hybrid multi-aspect conversations where models must carry numerical context across different financial topics.

AIMachine LearningLLMFinanceFinancial ReportingData Science

TAT-QA: Hybrid Table-Text QA Benchmark for Financial Annual Report Reasoning

TAT-QA is a 16,552-question benchmark over hybrid table-plus-text financial report contexts that showed evidence grounding — not arithmetic — is the core bottleneck in finance AI; by 2024, fine-tuned 7B LLMs reached 83% F1, closing most of the gap against a 91% human ceiling.

AILLMMachine LearningFinancial ReportingFinancial StatementsData ScienceAnalytics

FinanceBench: Why Vector-Store RAG Fails on Real Financial Documents

FinanceBench evaluates 16 AI configurations against 10,231 questions from real SEC filings; shared-vector-store RAG answers correctly only 19% of the time, and even GPT-4-Turbo with the oracle passage reaches just 85% accuracy — showing that numerical reasoning, not retrieval, is the binding constraint for enterprise finance AI.

AILLMMachine LearningAutomationFinanceData ScienceAnalytics

Себесъгласуваност: Изборът чрез мнозинство повишава точността на веригата от мисли

Себесъгласуваността заменя „алчното“ декодиране на веригата от мисли с гласуване с мнозинство върху N извлечени пътища на разсъждение — повишавайки точността на GPT-3 върху GSM8K със 17,9 процентни пункта без допълнително обучение — и се прилага директно към многостъпкови финансови изчисления, където единичното декодиране на модела е ненадеждно.

AILLMMachine LearningBeancountFinanceAutomationData Science

PAL: Program-Aided Language Models for Reliable Financial Arithmetic

PAL (Program-Aided Language Models) achieves a +38pp accuracy gain over chain-of-thought on arithmetic-heavy tasks by delegating computation to a Python interpreter — a directly applicable architecture for reliable Beancount ledger queries and finance AI.

AILLMBeancountData SciencePlain-Text AccountingAutomationFinance

Can LLMs Reason Over Tabular Data? What Four Benchmarks Tell Us About Finance AI

Four 2024–2025 benchmarks show GPT-4 scoring 42% on real-world table QA versus 86% for humans, with complex aggregations collapsing to 19.6%—and Beancount's native syntax sits at the worst-performing end of the serialization hierarchy for LLM input.

AILLMMachine LearningData ScienceFinanceAutomationFraud Detection

Chain-of-Thought Prompting: Precision-Recall Trade-offs for Finance AI

A close reading of Wei et al.'s 2022 Chain-of-Thought paper and what it means for finance AI — why CoT raises precision but may cut recall on rare-event detection, why the scale threshold matters for production agents, and what a finance team building on LLMs should watch out for.

LLMAIMachine LearningFinanceFinancial ReportingTrustBeancountData Science

PHANTOM (NeurIPS 2025): Measuring LLM Hallucination Detection in Financial Documents

PHANTOM (NeurIPS 2025) is the first benchmark to measure LLM hallucination detection on real SEC filings across context lengths up to 30,000 tokens. Qwen3-30B-A3B-Thinking leads with F1=0.882; 7B models score near random guessing — with direct implications for autonomous accounting agents.

AILLMMachine LearningAutomationBeancountDevelopersData SciencePlain-Text Accounting

Toolformer: Self-Supervised Tool Use and Its Limits for Finance AI

A close reading of Toolformer (Meta AI, NeurIPS 2023): how perplexity-filtered self-supervised training teaches a 6.7B-parameter model to call external APIs, where it outperforms GPT-3 175B on arithmetic benchmarks, and why its single-step architecture cannot support the chained tool calls required for structured ledger operations.

AILLMMachine LearningFinanceForecastingData ScienceBeancount

FinBen: Benchmarking LLMs Across 36 Financial Tasks — Implications for Accounting AI

FinBen evaluates 15 LLMs across 36 financial datasets at NeurIPS 2024, finding GPT-4 reaches 0.63 Exact Match on numerical QA and 0.54 on stock movement forecasting — near chance. Here is what those numbers mean for building a reliable accounting agent on a Beancount ledger.

Everything About Data Science

ConvFinQA: Multi-Turn Financial QA and the 21-Point Gap Between Models and Human Experts

TAT-QA: Hybrid Table-Text QA Benchmark for Financial Annual Report Reasoning

FinanceBench: Why Vector-Store RAG Fails on Real Financial Documents

Себесъгласуваност: Изборът чрез мнозинство повишава точността на веригата от мисли

PAL: Program-Aided Language Models for Reliable Financial Arithmetic

Can LLMs Reason Over Tabular Data? What Four Benchmarks Tell Us About Finance AI

Chain-of-Thought Prompting: Precision-Recall Trade-offs for Finance AI

PHANTOM (NeurIPS 2025): Measuring LLM Hallucination Detection in Financial Documents

Toolformer: Self-Supervised Tool Use and Its Limits for Finance AI

FinBen: Benchmarking LLMs Across 36 Financial Tasks — Implications for Accounting AI

Get started with Beancount.io

Getting Started

Features

Community

Legal