85 posts tagged with "Machine Learning"

AILLMMachine LearningBeancountAutomationDevelopersPlain-Text Accounting

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

SWE-bench evaluates language models on 2,294 real GitHub issues across 12 Python repositories using execution-based tests; at publication, Claude 2 resolved only 1.96% of issues with realistic retrieval, establishing the de facto benchmark for coding agents and revealing retrieval and patch-length failure modes directly relevant to Beancount write-back agents.

AILLMAutomationMachine LearningBeancountReconciliationPlain-Text Accounting

CodeAct: Why Executable Python Code Makes LLM Agents 20% More Accurate

CodeAct (ICML 2024) replaces JSON tool-calling with executable Python code, improving GPT-4 agent success rates by ~20 percentage points on multi-tool tasks and reducing interaction turns by 30% — with direct implications for building reliable Beancount reconciliation agents.

LLMAIMachine LearningAutomationBeancountFinanceTrust

LLMs Cannot Self-Correct Reasoning Yet — ICLR 2024 Findings and Finance AI Implications

Huang et al. (ICLR 2024) show that LLMs asked to review their own reasoning without external feedback consistently degrade accuracy — GPT-4 drops from 95.5% to 91.5% on GSM8K — and what this means for designing reliable Beancount journal entry agents.

AILLMMachine LearningAutomationPlain-Text AccountingDecision-making

Tree of Thoughts: Deliberate Problem Solving with LLM Search

Tree of Thoughts (ToT) achieves 74% on Game of 24 vs 4% for standard GPT-4 CoT by organizing LLM reasoning into a branching search tree with pruning and backtracking — with direct implications for multi-step financial classification and tax optimization in Beancount workflows.

AILLMMachine LearningAutomationReconciliationFinanceError PreventionTransaction Validation

CRITIC: Why LLM Self-Correction Requires External Tool Feedback

CRITIC (ICLR 2024) achieves 7.7 F1 gains on open-domain QA and a 79.2% toxicity reduction by grounding LLM revision in external tool signals — a verify-then-correct loop that maps directly onto write-back safety for Beancount finance agents.

AILLMMachine LearningAutomationBeancountPlain-Text Accounting

Reflexion: Language Agents That Learn from Mistakes Without Retraining

Reflexion (NeurIPS 2023) lets LLM agents improve by storing verbal post-mortems in an episodic buffer — no weight updates required. It reaches 91% on HumanEval with GPT-4 but fails on WebShop, revealing a structural constraint: verbal reinforcement only works when the evaluator produces a crisp, actionable signal. Here is what that means for building a self-correcting Beancount ledger agent.

AILLMMachine LearningAutomationFinanceData ScienceAnalytics

Себесъгласуваност: Изборът чрез мнозинство повишава точността на веригата от мисли

Себесъгласуваността заменя „алчното“ декодиране на веригата от мисли с гласуване с мнозинство върху N извлечени пътища на разсъждение — повишавайки точността на GPT-3 върху GSM8K със 17,9 процентни пункта без допълнително обучение — и се прилага директно към многостъпкови финансови изчисления, където единичното декодиране на модела е ненадеждно.

AILLMMachine LearningBeancountFinanceAutomationData Science

PAL: Program-Aided Language Models for Reliable Financial Arithmetic

PAL (Program-Aided Language Models) achieves a +38pp accuracy gain over chain-of-thought on arithmetic-heavy tasks by delegating computation to a Python interpreter — a directly applicable architecture for reliable Beancount ledger queries and finance AI.

AIMachine LearningLLMAutomationComplianceAccountingBeancount

Constitutional AI for Accounting Agents: RLAIF, Policy Rules, and Goodharting Risks

Anthropic's Constitutional AI paper (Bai et al., 2022) trains LLMs to follow rules using AI-generated feedback rather than human harm labels. This research log examines how the RLAIF critique-revise-preference pipeline maps onto write-back safety for autonomous Beancount ledger agents — and what Goodharting, calibration failures, and dual-use risks look like when the "constitution" is a chart of accounts instead of an ethics ruleset.

AILLMMachine LearningData ScienceFinanceAutomationFraud Detection

Chain-of-Thought Prompting: Precision-Recall Trade-offs for Finance AI

A close reading of Wei et al.'s 2022 Chain-of-Thought paper and what it means for finance AI — why CoT raises precision but may cut recall on rare-event detection, why the scale threshold matters for production agents, and what a finance team building on LLMs should watch out for.

Everything About Machine Learning

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

CodeAct: Why Executable Python Code Makes LLM Agents 20% More Accurate

LLMs Cannot Self-Correct Reasoning Yet — ICLR 2024 Findings and Finance AI Implications

Tree of Thoughts: Deliberate Problem Solving with LLM Search

CRITIC: Why LLM Self-Correction Requires External Tool Feedback

Reflexion: Language Agents That Learn from Mistakes Without Retraining

Себесъгласуваност: Изборът чрез мнозинство повишава точността на веригата от мисли

PAL: Program-Aided Language Models for Reliable Financial Arithmetic

Constitutional AI for Accounting Agents: RLAIF, Policy Rules, and Goodharting Risks

Chain-of-Thought Prompting: Precision-Recall Trade-offs for Finance AI

Get started with Beancount.io

Getting Started

Features

Community

Legal