65 posts tagged with "Beancount"

AILLMSecurityAutomationBeancountComplianceTrust

Verifiably Safe Tool Use for LLM Agents: STPA Meets MCP

CMU and NC State researchers propose using System-Theoretic Process Analysis (STPA) and a capability-enhanced Model Context Protocol to derive formal safety specifications for LLM agent tool use, with Alloy-based verification demonstrating absence of unsafe flows in a calendar scheduling case study.

AILLMMachine LearningBeancountPlain-Text AccountingData ScienceQueries

GraphRAG: From Local to Global Query-Focused Summarization

Microsoft's GraphRAG builds a Leiden-partitioned entity graph over a text corpus and precomputes community summaries to answer global sensemaking questions that standard vector RAG cannot handle — but a 2025 bias audit shows its 72–83% win rates collapse after correcting for position and length artifacts in LLM-as-judge evaluation.

LLMAIFinancial ReportingMachine LearningBeancountCompliance

FinAuditing: LLMs Score Under 14% on Real SEC XBRL Auditing Tasks

FinAuditing tests 13 LLMs zero-shot on 1,102 real SEC XBRL filing instances; top scores are 13.86% on financial math verification and 12.42% on concept retrieval—results that directly bound what AI accounting tools can be trusted to automate without external tooling.

AILLMMachine LearningBeancountPlain-Text AccountingTechnologyRAG

StructRAG (ICLR 2025): Picking the Right Document Structure Beats GraphRAG by 28 Points

StructRAG (ICLR 2025) routes each query to a task-appropriate structure type — table, graph, catalogue, algorithm, or chunk — before reasoning, scoring 28 points higher than GraphRAG on the Loong benchmark while running 22× faster, with the DPO-trained router alone accounting for a 15-point accuracy gain.

AIMachine LearningLLMData ScienceBeancountFinanceAutomation

Atlas: Joint Retriever-Reader Pre-Training Beats 540B-Parameter LLMs with 11B Parameters

Atlas (JMLR 2023) achieves 42.4% accuracy on Natural Questions with only 64 training examples—beating PaLM 540B by 3 points using 11B parameters—by jointly pre-training a Contriever-based dense retriever with a T5 Fusion-in-Decoder reader. Analysis covers retrieval accuracy limits, 587GB index infrastructure costs, and implications for Beancount ledger QA systems.

AIMachine LearningLLMBeancountData SciencePlain-Text Accounting

Fusion-in-Decoder: How Multi-Passage Retrieval Improves Generative QA

Izacard and Grave's FiD architecture independently encodes retrieved passages then fuses them in the decoder, outperforming RAG-Sequence by 4–11 points on NQ and TriviaQA. This post examines the design and its implications for Beancount ledger QA, where multi-entry synthesis across transactions is the norm.

AILLMMachine LearningAutomationBeancountTransaction Validation

Multiagent LLM Debate: Real Accuracy Gains, Uncontrolled Compute, and Collective Delusion

A close reading of Du et al.'s ICML 2024 multiagent debate paper — which reports 14.8-point accuracy gains on arithmetic — alongside 2025 rebuttals showing equal-budget single agents match debate performance, and an analysis of why Collective Delusion (65% of debate failures) poses specific risks for AI-assisted ledger commits.

AIMachine LearningForecastingData ScienceLLMFinanceBeancount

LLMs Are Not Useful for Time Series Forecasting: What NeurIPS 2024 Means for Finance AI

A NeurIPS 2024 Spotlight paper ablates three LLM-based time series forecasting methods — OneFitsAll, Time-LLM, and CALF — and finds that removing the language model improves accuracy in most cases, with up to a 1,383× training speedup. For finance AI applications like Beancount balance prediction, lightweight purpose-built models consistently beat repurposed LLMs.

AILLMMachine LearningData ScienceBeancountAutomationDevelopers

Fine-Tuning vs. RAG: Why Retrieval Wins for Injecting New Knowledge into LLMs

Empirical comparison of RAG vs. unsupervised fine-tuning across 7B-parameter LLMs shows RAG achieves 0.875+ accuracy on post-cutoff facts while fine-tuning plateaus at 0.504 — with direct implications for Beancount agent design and any system requiring frequent knowledge updates.

AILLMMachine LearningAutomationPlain-Text AccountingBeancountFinance

IRCoT: Interleaving Retrieval with Chain-of-Thought for Multi-Step QA

IRCoT interleaves BM25 retrieval with each step of a chain-of-thought reasoning loop, achieving +11.3 retrieval recall and +7.1 F1 on HotpotQA over one-step RAG — and shows a 3B model can beat GPT-3 175B when retrieval strategy is right.

Everything About Beancount

Verifiably Safe Tool Use for LLM Agents: STPA Meets MCP

GraphRAG: From Local to Global Query-Focused Summarization

FinAuditing: LLMs Score Under 14% on Real SEC XBRL Auditing Tasks

StructRAG (ICLR 2025): Picking the Right Document Structure Beats GraphRAG by 28 Points

Atlas: Joint Retriever-Reader Pre-Training Beats 540B-Parameter LLMs with 11B Parameters

Fusion-in-Decoder: How Multi-Passage Retrieval Improves Generative QA

Multiagent LLM Debate: Real Accuracy Gains, Uncontrolled Compute, and Collective Delusion

LLMs Are Not Useful for Time Series Forecasting: What NeurIPS 2024 Means for Finance AI

Fine-Tuning vs. RAG: Why Retrieval Wins for Injecting New Knowledge into LLMs

IRCoT: Interleaving Retrieval with Chain-of-Thought for Multi-Step QA

Get started with Beancount.io

Getting Started

Features

Community

Legal