65 posts tagged with "Beancount"

AILLMAutomationBeancountFinanceReconciliationMulti-Agent

AutoGen: Multi-Agent Conversation Frameworks for Finance AI

AutoGen (Wu et al., 2023) introduces a multi-agent conversation framework where LLM-backed agents pass messages to complete tasks; a two-agent setup lifts MATH benchmark accuracy from 55% to 69%, and a dedicated SafeGuard agent improves unsafe-code detection by up to 35 F1 points — findings directly applicable to building safe, modular Beancount automation pipelines.

AILLMMachine LearningAutomationPython APIDevelopersBeancount

Gorilla: How Retrieval-Aware Training Reduces LLM API Hallucinations from 78% to 11%

Gorilla (Patil et al., NeurIPS 2024) fine-tunes a 7B LLaMA model with Retriever-Aware Training on retrieved API documentation, cutting hallucination rates from 78% to 11% versus GPT-4 zero-shot — with direct implications for finance AI write-back agents where wrong account names or inverted signs are correctness failures, not annoyances.

AILLMMachine LearningAutomationBeancountPlain-Text AccountingTechnologyFinance

MemGPT: Virtual Context Management for LLM Agents

MemGPT applies OS-style virtual memory paging to LLMs, using three-tier storage — working memory, recall, and archival — to give agents persistent recall across sessions; on multi-session chat benchmarks, MemGPT with GPT-4 achieves 92.5% accuracy versus a 32.1% fixed-context baseline.

AILLMAutomationMachine LearningOpen SourceDevelopersPlain-Text AccountingBeancount

SWE-agent: How Interface Design Unlocks Automated Software Engineering

SWE-agent (NeurIPS 2024) introduces Agent-Computer Interfaces (ACIs) — purpose-built layers between LLMs and software environments — showing a 10.7-percentage-point improvement over raw shell access and 12.47% resolution on SWE-bench with GPT-4 Turbo. Interface design, not model capability, is the primary bottleneck for autonomous coding agents.

AILLMMachine LearningBeancountAutomationDevelopersPlain-Text Accounting

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

SWE-bench evaluates language models on 2,294 real GitHub issues across 12 Python repositories using execution-based tests; at publication, Claude 2 resolved only 1.96% of issues with realistic retrieval, establishing the de facto benchmark for coding agents and revealing retrieval and patch-length failure modes directly relevant to Beancount write-back agents.

AILLMAutomationMachine LearningBeancountReconciliationPlain-Text Accounting

CodeAct: Why Executable Python Code Makes LLM Agents 20% More Accurate

CodeAct (ICML 2024) replaces JSON tool-calling with executable Python code, improving GPT-4 agent success rates by ~20 percentage points on multi-tool tasks and reducing interaction turns by 30% — with direct implications for building reliable Beancount reconciliation agents.

LLMAIMachine LearningAutomationBeancountFinanceTrust

LLMs Cannot Self-Correct Reasoning Yet — ICLR 2024 Findings and Finance AI Implications

Huang et al. (ICLR 2024) show that LLMs asked to review their own reasoning without external feedback consistently degrade accuracy — GPT-4 drops from 95.5% to 91.5% on GSM8K — and what this means for designing reliable Beancount journal entry agents.

AILLMMachine LearningAutomationBeancountPlain-Text Accounting

Reflexion: Language Agents That Learn from Mistakes Without Retraining

Reflexion (NeurIPS 2023) lets LLM agents improve by storing verbal post-mortems in an episodic buffer — no weight updates required. It reaches 91% on HumanEval with GPT-4 but fails on WebShop, revealing a structural constraint: verbal reinforcement only works when the evaluator produces a crisp, actionable signal. Here is what that means for building a self-correcting Beancount ledger agent.

AILLMMachine LearningBeancountFinanceAutomationData Science

PAL: Program-Aided Language Models for Reliable Financial Arithmetic

PAL (Program-Aided Language Models) achieves a +38pp accuracy gain over chain-of-thought on arithmetic-heavy tasks by delegating computation to a Python interpreter — a directly applicable architecture for reliable Beancount ledger queries and finance AI.

AILLMBeancountData SciencePlain-Text AccountingAutomationFinance

Can LLMs Reason Over Tabular Data? What Four Benchmarks Tell Us About Finance AI

Four 2024–2025 benchmarks show GPT-4 scoring 42% on real-world table QA versus 86% for humans, with complex aggregations collapsing to 19.6%—and Beancount's native syntax sits at the worst-performing end of the serialization hierarchy for LLM input.

Everything About Beancount

AutoGen: Multi-Agent Conversation Frameworks for Finance AI

Gorilla: How Retrieval-Aware Training Reduces LLM API Hallucinations from 78% to 11%

MemGPT: Virtual Context Management for LLM Agents

SWE-agent: How Interface Design Unlocks Automated Software Engineering

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

CodeAct: Why Executable Python Code Makes LLM Agents 20% More Accurate

LLMs Cannot Self-Correct Reasoning Yet — ICLR 2024 Findings and Finance AI Implications

Reflexion: Language Agents That Learn from Mistakes Without Retraining

PAL: Program-Aided Language Models for Reliable Financial Arithmetic

Can LLMs Reason Over Tabular Data? What Four Benchmarks Tell Us About Finance AI

Get started with Beancount.io

Getting Started

Features

Community

Legal