InvestorBench: Benchmarking LLM Agents on Financial Trading Decisions

June 2, 2026 · 6 min read

Mike Thrift

Marketing Manager

Most finance AI benchmarks test whether LLMs can answer questions about financial data. InvestorBench asks a harder question: can an LLM agent make money? It's the first benchmark I've seen that puts 13 different backbone models through actual (backtested) trading tasks across stocks, crypto, and ETFs, measuring cumulative return and Sharpe ratio rather than QA accuracy. That shift from comprehension to decision-making is the right framing for Bean Labs.

The paper

2026-06-02-investorbench-llm-agent-financial-decision-making

InvestorBench (Li et al., arXiv:2412.18174, ACL 2025) introduces a benchmark and accompanying agent framework for evaluating LLMs on financial trading. The agent architecture is modular — a Brain (the LLM backbone), a Perception layer that converts market data into text, and a layered Memory system with three decay windows: 14 days for daily news, 90 days for quarterly reports, and 365 days for annual filings. At decision time, the agent retrieves across all three layers and reasons toward a buy/sell/hold action.

The benchmark covers three single-asset task families. Stock trading uses seven equities (MSFT, JNJ, TSLA, AAPL, etc.) tested from October 2020 through May 2021. Crypto covers Bitcoin and Ethereum from April through November 2023. ETF trading uses the NIFTY dataset from January through September 2020. Each task provides OHLCV data, news articles with sentiment labels, and SEC filings or equivalent. The primary metrics are cumulative return (CR) and Sharpe ratio (SR).

Key ideas

The tiered memory design (14/90/365-day decay windows) mirrors how professional analysts actually treat information: daily price action, quarterly earnings, and annual strategic context carry different temporal weights.
Model size is the strongest predictor of performance. Open-source models above 67B parameters match proprietary models on stock CR and SR, while smaller models trail significantly. Qwen2.5-72B tops the stock leaderboard at 46.15% CR and SR 1.276 against a buy-and-hold baseline of 34.10% CR / 0.732 SR.
Domain-specific fine-tuning backfires on stocks. Palmyra-Fin-70B — a finance-pretrained model — averaged −0.45% CR and SR 0.031 on stock trading, worse than every general-purpose model tested. Palmyra-Fin-70B did well on ETFs (24.76% CR, 1.152 SR), which the authors attribute to ETF tasks requiring longer-horizon reasoning aligned with its training.
Proprietary models (GPT-4, GPT-4o, GPT-o1-preview) averaged 36.14% CR and SR 0.82 on stocks, reliably above buy-and-hold but not dramatically so. Their bigger edge shows in crypto, where they hit 23.60% BTC CR vs. 21.82% for buy-and-hold while open-source models averaged 14.14%.
The benchmark is open-source and includes evaluation tooling — a practically useful contribution given how hard it is to reproduce trading experiments.

What holds up — and what doesn't

The layered memory architecture is the most principled design choice in the paper, and the empirical finding that it outperforms pure similarity-based retrieval is plausible and useful. The size-versus-performance correlation is also a clean result.

The main weakness is that the test periods are short historical backtests, not live trading. The stock period (October 2020–May 2021) coincides with one of the most unusual bull markets on record: post-COVID stimulus, meme stock frenzy, and near-zero rates drove broad equity appreciation. Buy-and-hold earned 34.10% in about seven months on a seven-stock basket. Whether LLM agent improvements on top of that number reflect genuine alpha or just more aggressive position-taking in a rising market cannot be determined from the data given. Similarly, the ETF period spans the COVID crash and recovery — a regime so abnormal that any model that happened to go defensive in March 2020 would look prescient.

The Palmyra-Fin-70B anomaly — catastrophic on stocks, strong on ETFs — is not satisfactorily explained. If domain fine-tuning realigns a model toward longer time horizons, that should show up in the stock results too. The fact that it doesn't suggests the result may be noise in a short backtesting window rather than a principled finding.

There is also no comparison against traditional algorithmic baselines (momentum, mean-reversion, factor models). Using only buy-and-hold as the passive baseline sets a low bar. If a simple moving-average crossover beats buy-and-hold over these periods — which it often does in trending markets — the agent comparison looks much less impressive.

Finally, the benchmark tests only single-asset decisions. Real portfolio management requires correlated position sizing, rebalancing, and risk aggregation that single-asset tasks do not capture.

Why this matters for finance AI

The tiered memory architecture translates directly to Beancount. A ledger agent needs to reason at different temporal scales simultaneously: what happened in today's import session (shallow), what a quarter of transactions reveals about a budget (intermediate), and what multi-year patterns say about account health (deep). InvestorBench's 14/90/365-day layering provides a concrete design template worth borrowing, even if the trading context differs from bookkeeping.

The Palmyra-Fin-70B finding also carries a warning for Beancount fine-tuning efforts. A model trained extensively on financial text doesn't automatically make better agent decisions — the gap between financial language fluency and financial reasoning competence is real. If Bean Labs ever fine-tunes a model on Beancount syntax and accounting rules, the agent evaluation must test decision quality, not just output format.

The benchmark's absence of write-back safety evaluation is a clean gap for Bean Labs to fill. InvestorBench agents can only lose money; Beancount agents can corrupt a ledger. The evaluation framework needs an irreversibility dimension that trading benchmarks have no reason to include.

InvestorBench: Benchmarking LLM Agents on Financial Trading Decisions

The paper

Key ideas

What holds up — and what doesn't

Why this matters for finance AI

What to read next

Get started with Beancount.io

Getting Started

Features

Community

Legal

The paper​

Key ideas​

What holds up — and what doesn't​

Why this matters for finance AI​

What to read next​

Get started with Beancount.io

Getting Started

Features

Community

Legal

The paper

Key ideas

What holds up — and what doesn't

Why this matters for finance AI

What to read next