2026
- April 15 - FinBen: Benchmarking LLMs Across 36 Financial Tasks — Implications for Accounting AI
- April 16 - Toolformer: Self-Supervised Tool Use and Its Limits for Finance AI
- April 17 - ReAct: Synergizing Reasoning and Acting in Language Models
- April 18 - FinMaster Benchmark: Why LLMs Score 96% on Financial Literacy but 3% on Statement Generation
- April 19 - PHANTOM (NeurIPS 2025): Measuring LLM Hallucination Detection in Financial Documents
- April 20 - Chain-of-Thought Prompting: Precision-Recall Trade-offs for Finance AI
- April 21 - Constitutional AI for Accounting Agents: RLAIF, Policy Rules, and Goodharting Risks
- April 22 - Can LLMs Reason Over Tabular Data? What Four Benchmarks Tell Us About Finance AI
- April 23 - PAL: Program-Aided Language Models for Reliable Financial Arithmetic
- April 24 - Себесъгласуваност: Изборът чрез мнозинство повишава точността на веригата от мисли
- April 25 - Reflexion: Language Agents That Learn from Mistakes Without Retraining
- April 26 - CRITIC: Why LLM Self-Correction Requires External Tool Feedback
- April 27 - Tree of Thoughts: Deliberate Problem Solving with LLM Search
- April 28 - LLMs Cannot Self-Correct Reasoning Yet — ICLR 2024 Findings and Finance AI Implications
- April 29 - CodeAct: Why Executable Python Code Makes LLM Agents 20% More Accurate
- April 30 - SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
- May 1 - SWE-agent: How Interface Design Unlocks Automated Software Engineering
- May 2 - MemGPT: Virtual Context Management for LLM Agents
- May 3 - Gorilla: How Retrieval-Aware Training Reduces LLM API Hallucinations from 78% to 11%
- May 4 - AutoGen: Multi-Agent Conversation Frameworks for Finance AI
- May 5 - BloombergGPT and the Limits of Domain-Specific LLMs in Finance
- May 6 - AgentBench:评估作为代理的 LLM —— 对金融 AI 可靠性的启示
- May 7 - HippoRAG: Neurobiologically Inspired Long-Term Memory for LLMs
- May 8 - Voyager: Skill Libraries as the Foundation for Lifelong AI Agent Learning
- May 9 - Self-RAG: Adaptive Retrieval and Self-Critique for LLMs
- May 10 - LATS: Language Agent Tree Search — 추론, 행동, 계획을 하나의 프레임워크로 통합
- May 11 - DSPy: Replacing Brittle Prompt Engineering with Compiled LLM Pipelines
- May 12 - FinanceBench: Why Vector-Store RAG Fails on Real Financial Documents
- May 13 - FinQA: The Benchmark Measuring AI Numerical Reasoning on Financial Reports
- May 14 - TAT-QA: Hybrid Table-Text QA Benchmark for Financial Annual Report Reasoning
- May 15 - ConvFinQA: Multi-Turn Financial QA and the 21-Point Gap Between Models and Human Experts
- May 16 - MultiHiertt: Benchmarking Numerical Reasoning Over Multi-Hierarchical Financial Tables
- May 17 - Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
- May 18 - FLARE: Active Retrieval Augmented Generation
- May 19 - IRCoT: Interleaving Retrieval with Chain-of-Thought for Multi-Step QA
- May 20 - Fine-Tuning vs. RAG: Why Retrieval Wins for Injecting New Knowledge into LLMs
- May 21 - TAT-LLM: Ge-fined-tunde LLaMA 2 voor discreet redeneren over financiële tabellen en tekst
- May 22 - AuditCopilot: LLMs for Fraud Detection in Double-Entry Bookkeeping
- May 23 - LLMs Are Not Useful for Time Series Forecasting: What NeurIPS 2024 Means for Finance AI
- May 24 - Multiagent LLM Debate: Real Accuracy Gains, Uncontrolled Compute, and Collective Delusion
- May 25 - GuardAgent: Deterministic Safety Enforcement for LLM Agents via Code Execution
- May 26 - Fusion-in-Decoder: How Multi-Passage Retrieval Improves Generative QA
- May 27 - Atlas: Joint Retriever-Reader Pre-Training Beats 540B-Parameter LLMs with 11B Parameters
- May 28 - ShieldAgent: Verifiable Safety Policy Reasoning for LLM Agents
- May 29 - AGrail: Adaptive Safety Guardrails for LLM Agents That Learn Across Tasks
- May 30 - M3MAD-Bench: Are Multi-Agent Debates Really Effective Across Domains and Modalities?
- May 31 - Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets
- June 1 - StructRAG (ICLR 2025): Picking the Right Document Structure Beats GraphRAG by 28 Points
- June 2 - InvestorBench: Benchmarking LLM Agents on Financial Trading Decisions
- June 3 - FinAuditing: LLMs Score Under 14% on Real SEC XBRL Auditing Tasks
- June 4 - GraphRAG: From Local to Global Query-Focused Summarization
- June 5 - Verifiably Safe Tool Use for LLM Agents: STPA Meets MCP
- June 6 - BIRD Benchmark: The Real-Database Gap in LLM Text-to-SQL
- June 7 - DIN-SQL: Decomposed In-Context Learning for Text-to-SQL
- June 8 - MAC-SQL: Multi-Agent Collaborative Text-to-SQL
- June 9 - TAPAS: Weakly Supervised Table QA Without SQL, and What It Means for Beancount
- June 10 - TableLlama: Can a 7B Open Model Match GPT-4 on Table Understanding?
- June 11 - Chain-of-Table: Evolving Tables in the LLM Reasoning Chain
- June 12 - τ-bench: Measuring AI Agent Reliability in Real-World Tool-Use Domains
- June 13 - WorkArena: How LLM Web Agents Perform on Real Enterprise Knowledge Work
- June 14 - WebArena: The 812-Task Benchmark That Measures What Web Agents Actually Can and Cannot Do
- June 15 - OSWorld: Desktop AI Agents Succeed on 12% of Tasks Where Humans Succeed on 72%
- June 16 - GAIA Benchmark: Measuring What Frontier AI Agents Can Actually Do
- June 17 - WorkArena++: The 93% Gap Between Human and AI Agent Performance on Compositional Enterprise Tasks
- June 18 - τ²-bench: Measuring the Cost of Dual-Control in Conversational AI Agents
- June 19 - TheAgentCompany: Benchmarking LLM Agents on Real-World Enterprise Tasks
- June 20 - DocFinQA: Long-Context Financial Reasoning on Full SEC Filings
- June 21 - Zero-Shot Anomaly Detection with LLMs: How GPT-4 Performs on Tabular Data
- June 22 - TableMaster: Adaptive Reasoning for Table Understanding with LLMs
- June 23 - LLMs Score 2.3% on Beancount DSL Generation: The LLMFinLiteracy Benchmark
- June 24 - AnoLLM: Fine-Tuning LLMs for Tabular Anomaly Detection in Financial Data
- June 25 - CausalTAD: Causal Column Ordering for LLM Tabular Anomaly Detection
- June 26 - AD-LLM Benchmark: GPT-4o Hits 0.93+ AUROC Zero-Shot for Text Anomaly Detection
- June 27 - Lost in the Middle: Position Bias in LLMs and Its Impact on Finance AI
- June 28 - FinDER: Real Analyst Queries Expose a 74% Recall Gap in Financial RAG
- June 29 - Fin-RATE: How LLMs Fail at Cross-Period and Cross-Entity Financial Analysis
- June 30 - OpenHands: Open Platform for AI Software Agents and What It Means for Finance Automation
- July 1 - Uncertainty-Aware Deferral for LLM Agents: When to Escalate from Small to Large Models
- July 2 - Found in the Middle: Calibrating Positional Attention Bias Improves Long-Context RAG
- July 3 - LLM Anomaly Detection Survey (NAACL 2025): Strong Taxonomy, Absent Tabular Coverage
- July 4 - OmniEval: Omnidirectional RAG Evaluation Benchmark for the Financial Domain
- July 5 - FinToolBench: Оцінка агентів LLM на основі використання фінансових інструментів у реальних умовах
- July 6 - FinTrace: Trajectory-Level Evaluation of LLM Tool Calling for Financial Tasks
- July 7 - FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under MCP
- July 8 - JSONSchemaBench: Real-World Schema Complexity Breaks LLM Structured Output Guarantees
- July 9 - LLM Confidence and Calibration: A Survey of What the Research Actually Shows
- July 10 - WildToolBench: Why No LLM Exceeds 15% Session Accuracy in Real-World Tool Use
- July 11 - Can LLM Agents Be CFOs? EnterpriseArena's 132-Month Simulation Reveals a Wide Gap
- July 12 - FinRAGBench-V: Multimodal RAG with Visual Citations in the Financial Domain
