MultiHiertt (ACL 2022) introduces 10,440 QA pairs from real financial reports averaging 3.89 hierarchical tables each; state-of-the-art models score 38% F1 versus 87% for humans, with a 15-point penalty for cross-table questions — quantifying the retrieval gap finance AI must close.
FinanceBench evaluates 16 AI configurations against 10,231 questions from real SEC filings; shared-vector-store RAG answers correctly only 19% of the time, and even GPT-4-Turbo with the oracle passage reaches just 85% accuracy — showing that numerical reasoning, not retrieval, is the binding constraint for enterprise finance AI.
FinMaster (arXiv:2505.13533) benchmarks o3-mini, Claude 3.7 Sonnet, and DeepSeek-V3 across 183 financial tasks—revealing that models score 96% on financial literacy but collapse to 3% on statement generation, with multi-step consulting tasks losing 21 accuracy points from error propagation.