35 posts tagged with "Finance"

AILLMMachine LearningBeancountFinanceAutomationData Science

PAL: Program-Aided Language Models for Reliable Financial Arithmetic

PAL (Program-Aided Language Models) achieves a +38pp accuracy gain over chain-of-thought on arithmetic-heavy tasks by delegating computation to a Python interpreter — a directly applicable architecture for reliable Beancount ledger queries and finance AI.

AILLMBeancountData SciencePlain-Text AccountingAutomationFinance

Can LLMs Reason Over Tabular Data? What Four Benchmarks Tell Us About Finance AI

Four 2024–2025 benchmarks show GPT-4 scoring 42% on real-world table QA versus 86% for humans, with complex aggregations collapsing to 19.6%—and Beancount's native syntax sits at the worst-performing end of the serialization hierarchy for LLM input.

AILLMMachine LearningData ScienceFinanceAutomationFraud Detection

Chain-of-Thought Prompting: Precision-Recall Trade-offs for Finance AI

A close reading of Wei et al.'s 2022 Chain-of-Thought paper and what it means for finance AI — why CoT raises precision but may cut recall on rare-event detection, why the scale threshold matters for production agents, and what a finance team building on LLMs should watch out for.

LLMAIMachine LearningFinanceFinancial ReportingTrustBeancountData Science

PHANTOM (NeurIPS 2025): Measuring LLM Hallucination Detection in Financial Documents

PHANTOM (NeurIPS 2025) is the first benchmark to measure LLM hallucination detection on real SEC filings across context lengths up to 30,000 tokens. Qwen3-30B-A3B-Thinking leads with F1=0.882; 7B models score near random guessing — with direct implications for autonomous accounting agents.

AILLMMachine LearningFinanceForecastingData ScienceBeancount

FinBen: Benchmarking LLMs Across 36 Financial Tasks — Implications for Accounting AI

FinBen evaluates 15 LLMs across 36 financial datasets at NeurIPS 2024, finding GPT-4 reaches 0.63 Exact Match on numerical QA and 0.54 on stock movement forecasting — near chance. Here is what those numbers mean for building a reliable accounting agent on a Beancount ledger.

Everything About Finance

PAL: Program-Aided Language Models for Reliable Financial Arithmetic

Can LLMs Reason Over Tabular Data? What Four Benchmarks Tell Us About Finance AI

Chain-of-Thought Prompting: Precision-Recall Trade-offs for Finance AI

PHANTOM (NeurIPS 2025): Measuring LLM Hallucination Detection in Financial Documents

FinBen: Benchmarking LLMs Across 36 Financial Tasks — Implications for Accounting AI

Get started with Beancount.io

Getting Started

Features

Community

Legal