85 posts tagged with "Machine Learning"

AILLMMachine LearningFinanceFinancial ReportingData ScienceReconciliationBeancount

FinRAGBench-V: Multimodal RAG with Visual Citations in the Financial Domain

FinRAGBench-V (EMNLP 2025) is the first large-scale benchmark for multimodal RAG with visual citations in finance, covering 112K+ document pages and 1,394 human-annotated QA pairs. Top models achieve only 20–61% block-level citation recall, and multimodal retrieval outperforms text-only by nearly 50 percentage points.

AILLMAutomationMachine LearningBeancountData ScienceTechnology

WildToolBench: Why No LLM Exceeds 15% Session Accuracy in Real-World Tool Use

WildToolBench (ICLR 2026) evaluates 57 LLMs on 1,024 tasks drawn from real user behavior — no model exceeds 15% session accuracy, with compositional orchestration, hidden intent, and instruction transitions as the three sharpest failure modes.

LLMAIMachine LearningTrustFinanceData ScienceHallucination Detection

LLM Confidence and Calibration: A Survey of What the Research Actually Shows

A systematic survey of LLM confidence estimation and calibration methods—white-box logit approaches, consistency-based SelfCheckGPT, and semantic entropy—reveals that verbalized confidence scores from GPT-4 achieve only ~62.7% AUROC, barely above chance, with direct implications for deploying uncertainty-aware agents in finance and accounting.

LLMAIMachine LearningAutomationBeancountPerformance

JSONSchemaBench: Real-World Schema Complexity Breaks LLM Structured Output Guarantees

JSONSchemaBench tests 9,558 real-world JSON schemas against six constrained decoding frameworks and finds that schema complexity causes coverage to collapse from 86% on simple schemas to 3% on complex ones, with XGrammar silently emitting 38 non-compliant outputs and no framework covering all 45 JSON Schema feature categories.

AILLMAutomationBeancountFintechMachine LearningReconciliation

FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under MCP

FinMCP-Bench evaluates six LLM models on 613 real-world financial tool-use tasks backed by 65 MCP servers — the best model scores 3.08% exact match on multi-turn tasks, revealing a 20× performance collapse from single-tool to multi-turn scenarios.

LLMAIFinanceFintechAutomationBeancountMachine Learning

FinTrace: Trajectory-Level Evaluation of LLM Tool Calling for Financial Tasks

FinTrace benchmarks 13 LLMs on 800 expert-annotated financial task trajectories across 9 metrics, finding that frontier models achieve strong tool selection (F1 ~0.9) but score only 3.23/5 on information utilization — the step where agents reason over what tools return.

AILLMAutomationMachine LearningFintechBeancountComplianceData Science

FinToolBench: Оцінка агентів LLM на основі використання фінансових інструментів у реальних умовах

FinToolBench поєднує 760 активних фінансових інструментів API з 295 виконуваними запитами для тестування агентів LLM на реальних фінансових завданнях — виявивши, що консервативна частота викликів GPT-4o у 22,7% забезпечує вищу якість відповідей (CSS 0,670), ніж агресивна TIR Qwen3-8B у 87,1%, тоді як невідповідність намірів перевищує 50% у всіх протестованих моделях.

AIMachine LearningLLMFinanceData ScienceBeancountAutomation

OmniEval: Omnidirectional RAG Evaluation Benchmark for the Financial Domain

OmniEval (EMNLP 2025) benchmarks RAG systems across 5 task types × 16 financial topics using 11.4k auto-generated test cases. The best systems achieve only 36% numerical accuracy — concrete evidence that RAG pipelines need validation layers before writing to structured financial ledgers.

AILLMMachine LearningFraud DetectionData ScienceBeancountAnalytics

LLM Anomaly Detection Survey (NAACL 2025): Strong Taxonomy, Absent Tabular Coverage

A critical reading of Xu and Ding's NAACL 2025 survey on LLM-based anomaly and OOD detection: the detection-vs-generation taxonomy holds up, but near-total absence of tabular coverage means financial AI practitioners must synthesize insights from vision models themselves.

AILLMMachine LearningData ScienceAutomationBeancountReconciliation

Found in the Middle: Calibrating Positional Attention Bias Improves Long-Context RAG

A training-free inference-time calibration subtracts positional bias from LLM attention weights, recovering up to 15 percentage points of RAG accuracy when retrieved documents are buried mid-context — and what it means for finance-specific agent pipelines.

Everything About Machine Learning

FinRAGBench-V: Multimodal RAG with Visual Citations in the Financial Domain

WildToolBench: Why No LLM Exceeds 15% Session Accuracy in Real-World Tool Use

LLM Confidence and Calibration: A Survey of What the Research Actually Shows

JSONSchemaBench: Real-World Schema Complexity Breaks LLM Structured Output Guarantees

FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under MCP

FinTrace: Trajectory-Level Evaluation of LLM Tool Calling for Financial Tasks

FinToolBench: Оцінка агентів LLM на основі використання фінансових інструментів у реальних умовах

OmniEval: Omnidirectional RAG Evaluation Benchmark for the Financial Domain

LLM Anomaly Detection Survey (NAACL 2025): Strong Taxonomy, Absent Tabular Coverage

Found in the Middle: Calibrating Positional Attention Bias Improves Long-Context RAG

Get started with Beancount.io

Getting Started

Features

Community

Legal