89 posts tagged with "LLM"

LLMAIFinanceMachine LearningForecastingDecision-makingData Science

InvestorBench: Benchmarking LLM Agents on Financial Trading Decisions

InvestorBench (ACL 2025) tests 13 LLM backbones on backtested stock, crypto, and ETF trading using cumulative return and Sharpe ratio — not QA accuracy. Qwen2.5-72B tops the stock leaderboard at 46.15% CR; finance-tuned models backfire on equities. Model size predicts performance more reliably than domain fine-tuning.

AILLMMachine LearningBeancountPlain-Text AccountingTechnologyRAG

StructRAG (ICLR 2025): Picking the Right Document Structure Beats GraphRAG by 28 Points

StructRAG (ICLR 2025) routes each query to a task-appropriate structure type — table, graph, catalogue, algorithm, or chunk — before reasoning, scoring 28 points higher than GraphRAG on the Loong benchmark while running 22× faster, with the DPO-trained router alone accounting for a 15-point accuracy gain.

AILLMMachine LearningAutomationTechnologyPerformanceFinance

Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets

A 2026 Stanford preprint equalizes thinking-token budgets across five multi-agent architectures and finds single-agent LLMs match or beat multi-agent systems on multi-hop reasoning — with theoretical grounding in the Data Processing Inequality and implications for finance AI agent design.

AILLMMachine LearningAutomationFinanceData ScienceMulti-Agent

M3MAD-Bench: Are Multi-Agent Debates Really Effective Across Domains and Modalities?

M3MAD-Bench stress-tests Multi-Agent Debate across 9 models, 5 domains, and vision-language settings, finding that Collective Delusion causes 65% of failures, adversarial debate cuts accuracy by up to 12.8%, and Self-Consistency typically matches debate accuracy at lower token cost.

AILLMSecurityAutomationMachine LearningTrustCompliance

AGrail: Adaptive Safety Guardrails for LLM Agents That Learn Across Tasks

AGrail (ACL 2025) introduces a two-LLM cooperative guardrail that adapts safety checks at inference time via test-time adaptation, achieving 0% prompt injection attack success and 95.6% benign action preservation on Safe-OS — compared to GuardAgent and LLaMA-Guard blocking up to 49.2% of legitimate actions.

AILLMMachine LearningSecurityComplianceAutomationTrustDevelopers

ShieldAgent: Verifiable Safety Policy Reasoning for LLM Agents

ShieldAgent (ICML 2025) replaces LLM-based guardrails with probabilistic rule circuits built on Markov Logic Networks, achieving 90.4% accuracy on agent attacks with 64.7% fewer API calls — and what it means for verifiable safety in financial AI systems.

AIMachine LearningLLMData ScienceBeancountFinanceAutomation

Atlas: Joint Retriever-Reader Pre-Training Beats 540B-Parameter LLMs with 11B Parameters

Atlas (JMLR 2023) achieves 42.4% accuracy on Natural Questions with only 64 training examples—beating PaLM 540B by 3 points using 11B parameters—by jointly pre-training a Contriever-based dense retriever with a T5 Fusion-in-Decoder reader. Analysis covers retrieval accuracy limits, 587GB index infrastructure costs, and implications for Beancount ledger QA systems.

AIMachine LearningLLMBeancountData SciencePlain-Text Accounting

Fusion-in-Decoder: How Multi-Passage Retrieval Improves Generative QA

Izacard and Grave's FiD architecture independently encodes retrieved passages then fuses them in the decoder, outperforming RAG-Sequence by 4–11 points on NQ and TriviaQA. This post examines the design and its implications for Beancount ledger QA, where multi-entry synthesis across transactions is the norm.

AILLMAutomationSecurityMachine LearningTransaction ValidationTrust

GuardAgent: Deterministic Safety Enforcement for LLM Agents via Code Execution

GuardAgent (ICML 2025) places a separate LLM agent between a target agent and its environment, verifying every proposed action by generating and running Python code — achieving 98.7% policy enforcement accuracy while preserving 100% task completion, versus 81% accuracy and 29–71% task failure for prompt-embedded safety rules.

AILLMMachine LearningAutomationBeancountTransaction Validation

Multiagent LLM Debate: Real Accuracy Gains, Uncontrolled Compute, and Collective Delusion

A close reading of Du et al.'s ICML 2024 multiagent debate paper — which reports 14.8-point accuracy gains on arithmetic — alongside 2025 rebuttals showing equal-budget single agents match debate performance, and an analysis of why Collective Delusion (65% of debate failures) poses specific risks for AI-assisted ledger commits.

Everything About LLM

InvestorBench: Benchmarking LLM Agents on Financial Trading Decisions

StructRAG (ICLR 2025): Picking the Right Document Structure Beats GraphRAG by 28 Points

Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets

M3MAD-Bench: Are Multi-Agent Debates Really Effective Across Domains and Modalities?

AGrail: Adaptive Safety Guardrails for LLM Agents That Learn Across Tasks

ShieldAgent: Verifiable Safety Policy Reasoning for LLM Agents

Atlas: Joint Retriever-Reader Pre-Training Beats 540B-Parameter LLMs with 11B Parameters

Fusion-in-Decoder: How Multi-Passage Retrieval Improves Generative QA

GuardAgent: Deterministic Safety Enforcement for LLM Agents via Code Execution

Multiagent LLM Debate: Real Accuracy Gains, Uncontrolled Compute, and Collective Delusion

Get started with Beancount.io

Getting Started

Features

Community

Legal