40 posts tagged with "Data Science"

AILLMMachine LearningFinanceFinancial ReportingData ScienceReconciliationBeancount

FinRAGBench-V: Multimodal RAG with Visual Citations in the Financial Domain

FinRAGBench-V (EMNLP 2025) is the first large-scale benchmark for multimodal RAG with visual citations in finance, covering 112K+ document pages and 1,394 human-annotated QA pairs. Top models achieve only 20–61% block-level citation recall, and multimodal retrieval outperforms text-only by nearly 50 percentage points.

AILLMAutomationMachine LearningBeancountData ScienceTechnology

WildToolBench: Why No LLM Exceeds 15% Session Accuracy in Real-World Tool Use

WildToolBench (ICLR 2026) evaluates 57 LLMs on 1,024 tasks drawn from real user behavior — no model exceeds 15% session accuracy, with compositional orchestration, hidden intent, and instruction transitions as the three sharpest failure modes.

LLMAIMachine LearningTrustFinanceData ScienceHallucination Detection

LLM Confidence and Calibration: A Survey of What the Research Actually Shows

A systematic survey of LLM confidence estimation and calibration methods—white-box logit approaches, consistency-based SelfCheckGPT, and semantic entropy—reveals that verbalized confidence scores from GPT-4 achieve only ~62.7% AUROC, barely above chance, with direct implications for deploying uncertainty-aware agents in finance and accounting.

AILLMAutomationMachine LearningFintechBeancountComplianceData Science

FinToolBench: Оцінка агентів LLM на основі використання фінансових інструментів у реальних умовах

FinToolBench поєднує 760 активних фінансових інструментів API з 295 виконуваними запитами для тестування агентів LLM на реальних фінансових завданнях — виявивши, що консервативна частота викликів GPT-4o у 22,7% забезпечує вищу якість відповідей (CSS 0,670), ніж агресивна TIR Qwen3-8B у 87,1%, тоді як невідповідність намірів перевищує 50% у всіх протестованих моделях.

AIMachine LearningLLMFinanceData ScienceBeancountAutomation

OmniEval: Omnidirectional RAG Evaluation Benchmark for the Financial Domain

OmniEval (EMNLP 2025) benchmarks RAG systems across 5 task types × 16 financial topics using 11.4k auto-generated test cases. The best systems achieve only 36% numerical accuracy — concrete evidence that RAG pipelines need validation layers before writing to structured financial ledgers.

AILLMMachine LearningFraud DetectionData ScienceBeancountAnalytics

LLM Anomaly Detection Survey (NAACL 2025): Strong Taxonomy, Absent Tabular Coverage

A critical reading of Xu and Ding's NAACL 2025 survey on LLM-based anomaly and OOD detection: the detection-vs-generation taxonomy holds up, but near-total absence of tabular coverage means financial AI practitioners must synthesize insights from vision models themselves.

AILLMMachine LearningData ScienceAutomationBeancountReconciliation

Found in the Middle: Calibrating Positional Attention Bias Improves Long-Context RAG

A training-free inference-time calibration subtracts positional bias from LLM attention weights, recovering up to 15 percentage points of RAG accuracy when retrieved documents are buried mid-context — and what it means for finance-specific agent pipelines.

LLMAIMachine LearningAnalyticsFinancial ReportingData ScienceReconciliation

Fin-RATE: How LLMs Fail at Cross-Period and Cross-Entity Financial Analysis

Fin-RATE benchmarks 17 LLMs on 7,500 expert-curated QA pairs from 2,472 SEC filings, revealing an 18.60% accuracy collapse under longitudinal tracking and a 54-point drop for finance-specialized Fin-R1 on cross-entity tasks — with the retrieval pipeline, not the backbone model, as the binding bottleneck.

AILLMMachine LearningFinanceBeancountData ScienceFinancial Reporting

FinDER: Real Analyst Queries Expose a 74% Recall Gap in Financial RAG

FinDER benchmarks RAG on 5,703 real hedge fund analyst queries against S&P 500 10-K filings; E5-Mistral achieves only 25.95% context recall, and abbreviation-heavy queries cost 8.2 precision points — evidence that query normalization, not better embeddings, is the first fix for finance AI pipelines.

LLMAIMachine LearningData ScienceFinanceTechnologyAnalytics

Lost in the Middle: Position Bias in LLMs and Its Impact on Finance AI

The TACL 2024 paper by Liu et al. shows LLMs perform up to 20 points worse on information buried in the middle of long contexts — a U-shaped degradation affecting every tested model including Claude-1.3-100K — with concrete implications for how RAG pipelines should order retrieved passages in finance and accounting applications.

Everything About Data Science

FinRAGBench-V: Multimodal RAG with Visual Citations in the Financial Domain

WildToolBench: Why No LLM Exceeds 15% Session Accuracy in Real-World Tool Use

LLM Confidence and Calibration: A Survey of What the Research Actually Shows

FinToolBench: Оцінка агентів LLM на основі використання фінансових інструментів у реальних умовах

OmniEval: Omnidirectional RAG Evaluation Benchmark for the Financial Domain

LLM Anomaly Detection Survey (NAACL 2025): Strong Taxonomy, Absent Tabular Coverage

Found in the Middle: Calibrating Positional Attention Bias Improves Long-Context RAG

Fin-RATE: How LLMs Fail at Cross-Period and Cross-Entity Financial Analysis

FinDER: Real Analyst Queries Expose a 74% Recall Gap in Financial RAG

Lost in the Middle: Position Bias in LLMs and Its Impact on Finance AI

Get started with Beancount.io

Getting Started

Features

Community

Legal