FinRAGBench-V: Multimodal RAG with Visual Citations in the Financial Domain
Financial AI has been dominated by text-only RAG, but real financial documents are full of charts, tables, and figures that OCR cannot fully capture. FinRAGBench-V (EMNLP 2025) is the first large-scale benchmark to evaluate multimodal RAG with visual citations in the financial domain, and its results are a sobering reminder of how far production systems still have to go.
The paper
Zhao, Jin, Li, and Gao from Peking University introduce FinRAGBench-V, a bilingual benchmark constructed from real financial documents: research reports, financial statements, prospectuses, academic papers, magazines, and news articles. The retrieval corpus is substantial—60,780 Chinese pages and 51,219 English pages across roughly 1,100 documents per language—paired with 1,394 human-annotated QA pairs spanning seven question categories: text inference, chart and table extraction, numerical calculation, time-sensitive queries, and multi-page reasoning. Beyond the dataset, the paper's central contribution is RGenCite, a baseline system that generates answers alongside pixel-level visual citations in the form of bounding-box coordinates marking the specific document regions that support each claim.
Key ideas
- Multimodal retrieval dominates text-only by a crushing margin: ColQwen2, a vision-language retriever built on page-image embeddings, achieves Recall@10 of 90.13% (Chinese) and 85.86% (English). The best text-based retrievers, BM25 and BGE-M3, top out around 42.71%. This gap is not a rounding error.
- Generation accuracy is low even for frontier models: GPT-4o on English reaches 43.41% accuracy (ROUGE 24.66); o4-mini on Chinese reaches 58.13% (ROUGE 38.55). These are top proprietary models with strong retrieval in place.
- Page-level citation works; block-level does not: Page-level recall sits at 75–93% for the best models. Block-level recall—knowing which specific table cell or chart region grounds a claim—drops to 20–61%. This is the key gap for auditability.
- Numerical reasoning and multi-page inference break models first: Questions requiring calculations across pages or temporal spans are where accuracy falls most steeply across all tested systems.
- Proprietary models substantially outperform open-source alternatives: The closed-API vs. open-source gap is larger here than on most NLP benchmarks, suggesting visual financial reasoning remains unsolved for open models.
- Auto-evaluation for citations is imperfect: The image-cropping citation evaluator achieves Pearson r = 0.68 with human judgments—reasonable but not reliable enough to trust fully without sampling.
What holds up — and what doesn't
The retrieval finding is the most credible result in the paper. A gap of nearly 50 percentage points between multimodal and text-only retrievers at 60k+ pages is too large to dismiss. When you OCR a financial document before indexing, you destroy structural layout signals—which column a number appears in, whether a figure caption modifies a table's interpretation—that turn out to matter enormously for retrieval.
The generation numbers are honest but hard to interpret in isolation. The authors do not ablate how much of the accuracy gap is attributable to retrieval errors versus generation failures. Given that Recall@10 is already 85.86% for English, a meaningful fraction of failures must be generation-side rather than retrieval-side. Knowing that breakdown would clarify whether the bottleneck is multimodal reasoning or something more fundamental about how MLLMs handle financial language.
The evaluation set of 1,394 QA pairs is small for the scope of the benchmark. Split across seven categories and two languages, some slices have well under 200 examples. The statistical significance of category-level findings is left implicit. This is not unusual for a benchmark paper, but it does mean cherry-picked comparisons would be easy to construct.
The citation evaluation protocol is an interesting contribution, but Pearson r = 0.68 with human ratings is not strong enough to treat auto-evaluation as ground truth for block-level grounding. The authors acknowledge this; future work on better citation metrics is explicitly flagged.
Why this matters for finance AI
Beancount operates over plain-text ledger files, which makes text-only RAG defensible for querying past transactions. But the broader accounting task involves documents that are emphatically not plain text: bank statement PDFs, scanned invoices, receipt images, annual reports with embedded tables and charts. The moment a Beancount agent needs to reconcile a ledger entry against a source document—verify that a particular charge matches the invoice on file—it is doing exactly the task FinRAGBench-V benchmarks.
The block-level citation finding matters most for this use case. If an agent must justify a ledger entry by pointing to a specific line item in a PDF, and the best available system achieves only 20–61% block-level recall, that is not audit-ready. Any Beancount pipeline that touches scanned source documents needs human-in-the-loop review until this number improves substantially.
The retrieval modality gap also argues strongly against pure-text pipelines for document ingestion. A receipt image carries layout information—amount fields, vendor names, line-item positions—that OCR destroys. That layout information is precisely what distinguishes a line total from a tax amount, and FinRAGBench-V shows that multimodal retrievers exploit it in ways text retrievers cannot.
What to read next
- ColPali: Efficient Document Retrieval with Vision Language Models — the predecessor to ColQwen2 that established the visual page-embedding approach FinRAGBench-V's best retriever is built on [arXiv:2407.01449, ECCV 2024]
- M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding — tackles multi-document visual QA with a flexible framework that handles single and multi-hop visual reasoning across pages [arXiv:2411.04952]
- Benchmarking Temporal-Aware Multi-Modal RAG in Finance — a companion benchmark from 2025 evaluating time-sensitivity in financial multimodal RAG, directly complementary to FinRAGBench-V's time-sensitive question category [arXiv:2503.05185]
