33 posts tagged with "Plain-Text Accounting"

AILLMMachine LearningAutomationBeancountPlain-Text AccountingReconciliation

Voyager: Skill Libraries as the Foundation for Lifelong AI Agent Learning

Voyager, a GPT-4-powered Minecraft agent from NVIDIA and Caltech, demonstrates that a persistent code skill library enables genuine lifelong learning without fine-tuning — discovering 3.3× more items than prior state-of-the-art. The pattern maps directly onto long-horizon Beancount ledger automation, though financial correctness demands staging layers that game sandboxes never require.

LLMAIMachine LearningBeancountPlain-Text AccountingFinanceAutomation

HippoRAG: Neurobiologically Inspired Long-Term Memory for LLMs

HippoRAG (NeurIPS 2024) builds a knowledge graph from OpenIE triples and applies Personalized PageRank at query time, reaching 89.1% Recall@5 on 2WikiMultiHopQA versus 68.2% for ColBERTv2—with direct implications for querying complex financial ledgers across multi-year transaction histories.

AILLMMachine LearningAutomationBeancountPlain-Text AccountingTechnology

AgentBench：评估作为代理的 LLM —— 对金融 AI 可靠性的启示

AgentBench（Liu 等人，ICLR 2024）在 8 个交互式环境中对 27 个大语言模型进行了基准测试 —— GPT-4 的综合得分为 4.01，而表现最好的开源模型仅为 0.96。三种主要的失败模式（知识图谱失败中 67.9% 为超出任务限制、数据库失败中 53.3% 为格式错误以及无效操作）直接对应了在真实账本上部署 Beancount 回写代理的风险。

LLMAIMachine LearningFinanceFintechBeancountPlain-Text Accounting

BloombergGPT and the Limits of Domain-Specific LLMs in Finance

Bloomberg trained a 50B-parameter LLM on 569B tokens of financial data and beat general models on sentiment and table-reasoning benchmarks — then GPT-4 matched it without any finance-specific pretraining. What the $10M experiment reveals about domain pretraining trade-offs, tokenization of numbers, and why tool-use is more reliable than model internals for accounting agents.

AILLMMachine LearningAutomationBeancountPlain-Text AccountingTechnologyFinance

MemGPT: Virtual Context Management for LLM Agents

MemGPT applies OS-style virtual memory paging to LLMs, using three-tier storage — working memory, recall, and archival — to give agents persistent recall across sessions; on multi-session chat benchmarks, MemGPT with GPT-4 achieves 92.5% accuracy versus a 32.1% fixed-context baseline.

AILLMAutomationMachine LearningOpen SourceDevelopersPlain-Text AccountingBeancount

SWE-agent: How Interface Design Unlocks Automated Software Engineering

SWE-agent (NeurIPS 2024) introduces Agent-Computer Interfaces (ACIs) — purpose-built layers between LLMs and software environments — showing a 10.7-percentage-point improvement over raw shell access and 12.47% resolution on SWE-bench with GPT-4 Turbo. Interface design, not model capability, is the primary bottleneck for autonomous coding agents.

AILLMMachine LearningBeancountAutomationDevelopersPlain-Text Accounting

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

SWE-bench evaluates language models on 2,294 real GitHub issues across 12 Python repositories using execution-based tests; at publication, Claude 2 resolved only 1.96% of issues with realistic retrieval, establishing the de facto benchmark for coding agents and revealing retrieval and patch-length failure modes directly relevant to Beancount write-back agents.

AILLMAutomationMachine LearningBeancountReconciliationPlain-Text Accounting

CodeAct: Why Executable Python Code Makes LLM Agents 20% More Accurate

CodeAct (ICML 2024) replaces JSON tool-calling with executable Python code, improving GPT-4 agent success rates by ~20 percentage points on multi-tool tasks and reducing interaction turns by 30% — with direct implications for building reliable Beancount reconciliation agents.

AILLMMachine LearningAutomationPlain-Text AccountingDecision-making

Tree of Thoughts: Deliberate Problem Solving with LLM Search

Tree of Thoughts (ToT) achieves 74% on Game of 24 vs 4% for standard GPT-4 CoT by organizing LLM reasoning into a branching search tree with pruning and backtracking — with direct implications for multi-step financial classification and tax optimization in Beancount workflows.

AILLMMachine LearningAutomationBeancountPlain-Text Accounting

Reflexion: Language Agents That Learn from Mistakes Without Retraining

Reflexion (NeurIPS 2023) lets LLM agents improve by storing verbal post-mortems in an episodic buffer — no weight updates required. It reaches 91% on HumanEval with GPT-4 but fails on WebShop, revealing a structural constraint: verbal reinforcement only works when the evaluator produces a crisp, actionable signal. Here is what that means for building a self-correcting Beancount ledger agent.

Everything About Plain-Text Accounting

Voyager: Skill Libraries as the Foundation for Lifelong AI Agent Learning

HippoRAG: Neurobiologically Inspired Long-Term Memory for LLMs

AgentBench：评估作为代理的 LLM —— 对金融 AI 可靠性的启示

BloombergGPT and the Limits of Domain-Specific LLMs in Finance

MemGPT: Virtual Context Management for LLM Agents

SWE-agent: How Interface Design Unlocks Automated Software Engineering

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

CodeAct: Why Executable Python Code Makes LLM Agents 20% More Accurate

Tree of Thoughts: Deliberate Problem Solving with LLM Search

Reflexion: Language Agents That Learn from Mistakes Without Retraining

Get started with Beancount.io

Getting Started

Features

Community

Legal