57 posts tagged with "Automation"

AIMachine LearningLLMData SciencePlain-Text AccountingBeancountQueriesAutomation

TAPAS: Weakly Supervised Table QA Without SQL, and What It Means for Beancount

TAPAS (Google Research, ACL 2020) answers table questions by selecting cells and applying scalar aggregations — no SQL generated. This post analyzes the architecture, its 12-point SQA accuracy gain, and why the cell-selection paradigm fits small Beancount ledger queries but breaks down at scale.

AIMachine LearningDatabaseQueriesLLMBeancountAutomation

MAC-SQL: Multi-Agent Collaborative Text-to-SQL

MAC-SQL (COLING 2025) uses three specialized agents — Selector for schema reduction, Decomposer for question decomposition, and Refiner for execution-guided SQL correction — to reach 59.59% execution accuracy on the BIRD benchmark; ablation shows the Refiner contributes the most (+4.63 points), with direct implications for Beancount ledger query generation.

AILLMSecurityAutomationBeancountComplianceTrust

Verifiably Safe Tool Use for LLM Agents: STPA Meets MCP

CMU and NC State researchers propose using System-Theoretic Process Analysis (STPA) and a capability-enhanced Model Context Protocol to derive formal safety specifications for LLM agent tool use, with Alloy-based verification demonstrating absence of unsafe flows in a calendar scheduling case study.

AILLMMachine LearningAutomationTechnologyPerformanceFinance

Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets

A 2026 Stanford preprint equalizes thinking-token budgets across five multi-agent architectures and finds single-agent LLMs match or beat multi-agent systems on multi-hop reasoning — with theoretical grounding in the Data Processing Inequality and implications for finance AI agent design.

AILLMMachine LearningAutomationFinanceData ScienceMulti-Agent

M3MAD-Bench: Are Multi-Agent Debates Really Effective Across Domains and Modalities?

M3MAD-Bench stress-tests Multi-Agent Debate across 9 models, 5 domains, and vision-language settings, finding that Collective Delusion causes 65% of failures, adversarial debate cuts accuracy by up to 12.8%, and Self-Consistency typically matches debate accuracy at lower token cost.

AILLMSecurityAutomationMachine LearningTrustCompliance

AGrail: Adaptive Safety Guardrails for LLM Agents That Learn Across Tasks

AGrail (ACL 2025) introduces a two-LLM cooperative guardrail that adapts safety checks at inference time via test-time adaptation, achieving 0% prompt injection attack success and 95.6% benign action preservation on Safe-OS — compared to GuardAgent and LLaMA-Guard blocking up to 49.2% of legitimate actions.

AILLMMachine LearningSecurityComplianceAutomationTrustDevelopers

ShieldAgent: Verifiable Safety Policy Reasoning for LLM Agents

ShieldAgent (ICML 2025) replaces LLM-based guardrails with probabilistic rule circuits built on Markov Logic Networks, achieving 90.4% accuracy on agent attacks with 64.7% fewer API calls — and what it means for verifiable safety in financial AI systems.

AIMachine LearningLLMData ScienceBeancountFinanceAutomation

Atlas: Joint Retriever-Reader Pre-Training Beats 540B-Parameter LLMs with 11B Parameters

Atlas (JMLR 2023) achieves 42.4% accuracy on Natural Questions with only 64 training examples—beating PaLM 540B by 3 points using 11B parameters—by jointly pre-training a Contriever-based dense retriever with a T5 Fusion-in-Decoder reader. Analysis covers retrieval accuracy limits, 587GB index infrastructure costs, and implications for Beancount ledger QA systems.

AILLMAutomationSecurityMachine LearningTransaction ValidationTrust

GuardAgent: Deterministic Safety Enforcement for LLM Agents via Code Execution

GuardAgent (ICML 2025) places a separate LLM agent between a target agent and its environment, verifying every proposed action by generating and running Python code — achieving 98.7% policy enforcement accuracy while preserving 100% task completion, versus 81% accuracy and 29–71% task failure for prompt-embedded safety rules.

AILLMMachine LearningAutomationBeancountTransaction Validation

Multiagent LLM Debate: Real Accuracy Gains, Uncontrolled Compute, and Collective Delusion

A close reading of Du et al.'s ICML 2024 multiagent debate paper — which reports 14.8-point accuracy gains on arithmetic — alongside 2025 rebuttals showing equal-budget single agents match debate performance, and an analysis of why Collective Delusion (65% of debate failures) poses specific risks for AI-assisted ledger commits.

Everything About Automation

TAPAS: Weakly Supervised Table QA Without SQL, and What It Means for Beancount

MAC-SQL: Multi-Agent Collaborative Text-to-SQL

Verifiably Safe Tool Use for LLM Agents: STPA Meets MCP

Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets

M3MAD-Bench: Are Multi-Agent Debates Really Effective Across Domains and Modalities?

AGrail: Adaptive Safety Guardrails for LLM Agents That Learn Across Tasks

ShieldAgent: Verifiable Safety Policy Reasoning for LLM Agents

Atlas: Joint Retriever-Reader Pre-Training Beats 540B-Parameter LLMs with 11B Parameters

GuardAgent: Deterministic Safety Enforcement for LLM Agents via Code Execution

Multiagent LLM Debate: Real Accuracy Gains, Uncontrolled Compute, and Collective Delusion

Get started with Beancount.io

Getting Started

Features

Community

Legal