OmniEval: Omnidirectional RAG Evaluation Benchmark for the Financial Domain
Most RAG benchmarks in finance ask whether a system can retrieve and answer — full stop. OmniEval (EMNLP 2025, arXiv:2412.13018) from Shuting Wang et al. at RUC asks a harder question: does performance hold across the full matrix of task types and financial topics? I'm reading it now because it's the most structured attempt to map the shape of RAG failure in finance before we try to build reliable Beancount ledger agents on top of RAG pipelines.
The paper
OmniEval constructs a two-dimensional evaluation grid: five task classes (extractive QA, multi-hop reasoning, contrast QA, long-form QA, and conversational QA) crossed with 16 financial topics (stock markets, investment banking, funds, property insurance, and others). The result is a structured benchmark with 11.4k automatically generated test examples, 1.7k human-annotated examples, and a 362k-document retrieval corpus assembled from six Chinese financial data sources (BSCF-DB at 193k documents, FinGLM at 55k, BAAI-Fin at 48k, official web crawls, PDFs, and Wikipedia financial content). The benchmark also includes a fine-tuned LLM evaluator — Qwen2.5-7B-Instruct trained on 910 human-labeled instances — that scores generation quality across accuracy, hallucination, completeness, utilization, and numerical accuracy. The paper was published at EMNLP 2025.
Key ideas
- The auto-generated test cases passed a human acceptance check at 87.47%, meaning roughly 1 in 8 generated instances was discarded — not a trivial noise rate for a benchmark.
- Best retriever (GTE-Qwen2-1.5B) achieved MAP of 0.4370 and MRR of 0.4491 on the auto-generated set, meaning the top-ranked passage is correct less than half the time even with the strongest retriever tested.
- Generation accuracy (ACC) across all retriever-LLM combinations ranged from 0.3238 to 0.4476 — the best configuration gets fewer than half the questions right.
- Numerical accuracy (NAC) is the sharpest finding: 0.0659 to 0.3595. The best system gets financial numbers right about 36% of the time; the worst is near-zero.
- The fine-tuned evaluator reached 74.4% agreement with human annotation (κ = 0.6486), substantially outperforming prompting-only baselines at 55–71% — but still leaving one in four evaluations misaligned with human judgment.
- Multi-hop reasoning and conversational QA were consistently the hardest task classes.
What holds up — and what doesn't
The matrix evaluation design is genuinely useful. Previous finance benchmarks (FinanceBench, FinQA, DocFinQA) treat evaluation as a single axis — usually answer accuracy — and miss the structural variation in how RAG fails. Knowing that a system scores well on extractive QA but poorly on multi-hop reasoning is actionable; knowing it averages some overall score is not. The OmniEval grid makes that variation visible, and the finding that performance is inconsistent across topics is exactly the kind of result practitioners need to see before deploying.
That said, there are real limits I want to be direct about. The corpus is overwhelmingly Chinese: five of six data sources are Chinese financial data (BSCF, FinGLM, BAAI-Fin), and the sixth is Chinese Wikipedia. The paper does not report results broken out by language — it just reports aggregate numbers. This makes every score in the paper suspect as a claim about financial RAG in general, as opposed to financial RAG over Chinese text with Chinese-specialized retrievers and LLMs (GTE-Qwen2-1.5B, Qwen2.5-72B, Yi15-34B). English financial users cannot directly use these numbers.
The LLM evaluator is trained on 910 labeled instances. That is thin. The 74.4% human agreement at κ = 0.6486 is defensible as a starting point but means the eval framework itself introduces substantial noise. If the benchmark is used to compare systems that differ by a few percentage points, the evaluator variance will swamp the signal.
The automatic generation pipeline — GPT-4 produces test questions, humans filter at 87.47% acceptance — also raises a contamination question the paper does not address: GPT-4-generated questions may play to GPT-4-class models' strengths in ways that disadvantage older or smaller models systematically.
Why this matters for finance AI
The numerical accuracy scores are the number I keep coming back to: 0.0659–0.3595. If the best tested RAG system gets financial numbers right only 36% of the time in a benchmarked evaluation, any Beancount write-back agent built on top of a naive RAG pipeline is going to corrupt ledger data. Beancount's format is unforgiving — an incorrect amount, date, or account name produces either a parse error or a silent accounting error that can propagate across fiscal years. This benchmark gives us concrete evidence that RAG retrieval and LLM generation are not yet reliable enough for direct ledger write-back without a validation layer.
The task-class structure also maps cleanly to Beancount use cases. Extractive QA corresponds to simple balance lookups. Multi-hop reasoning corresponds to questions like "what is my net income after tax across Q1–Q3?" Conversational QA corresponds to a user iteratively refining a reconciliation request across a session. OmniEval's finding that multi-hop and conversational tasks are hardest is exactly the bad news for the Beancount agent design: the easy cases are almost fine; the realistic cases are where the system falls apart.
What to read next
- ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation (arXiv:2311.09476, NAACL 2025) — the closest general-domain analog to OmniEval's evaluator fine-tuning approach; comparing ARES methodology to OmniEval's would clarify whether the LLM-evaluator design choices are principled or ad hoc.
- RAGEval: Scenario-Specific RAG Evaluation Dataset Generation Framework (ACL 2025, aclanthology.org/2025.acl-long.418) — automated scenario generation for RAG evaluation; extends the auto-generation methodology OmniEval uses and may address the contamination concern.
- FinRAGBench-V: A Benchmark for Multimodal RAG with Visual Citation in the Financial Domain (arXiv:2505.17471) — extends RAG evaluation to multimodal financial documents (tables, charts); relevant as Beancount users increasingly have receipt images and PDF statements alongside plain-text ledgers.
