Skip to main content

LLMs Are Not Useful for Time Series Forecasting: What NeurIPS 2024 Means for Finance AI

· 5 min read
Mike Thrift
Mike Thrift
Marketing Manager

This paper showed up on my reading list because it directly challenges the wave of LLM-based time series forecasting work from 2023–2024. As Bean Labs thinks about forecasting account balances and cash flows from Beancount ledgers, the question of whether to use general LLMs or purpose-built numerical models is not academic. Tan et al.'s NeurIPS 2024 Spotlight result is a bucket of cold water.

The paper

2026-05-23-are-llms-useful-for-time-series-forecasting

"Are Language Models Actually Useful for Time Series Forecasting?" by Mingtian Tan, Mike Merrill, Vinayak Gupta, Tim Althoff, and Thomas Hartvigsen (arXiv:2406.16964, NeurIPS 2024 Spotlight) ablates three popular LLM-based forecasting methods: OneFitsAll (GPT-2 with frozen attention and patching), Time-LLM (LLaMA with patch reprogramming), and CALF (GPT-2 with LoRA adapters and cross-modal alignment). The question is whether removing or replacing the LLM component degrades performance. Across 13 benchmarks, the answer is almost always no — and often the ablations are better.

Key ideas

  • Ablations outperform Time-LLM in 26/26 metric cases across 13 datasets, CALF in 22/26, and OneFitsAll in 19/26 — the LLM is a drag more often than it helps.
  • Time-LLM has 6,642M parameters and requires 3,003 training minutes on the Weather dataset; a 0.245M-parameter attention-only ablation trains in 2.17 minutes — roughly a 1,383× speedup with equal or better accuracy.
  • Randomly initialized LLMs outperform pretrained ones in 8 of 11 dataset comparisons, meaning the text-pretrained weights contribute negatively on balance.
  • In few-shot settings (10% training data), Time-LLM and the no-LLM ablation each win in 8 of 16 cases — statistically indistinguishable, disproving the few-shot argument commonly used to justify LLM inclusion.
  • Shuffling entire time series sequences degrades both LLM-based and attention-only models comparably, suggesting neither architecture reliably captures sequential temporal structure.
  • A simple PAttn baseline (patching plus a single attention layer) matches full LLM methods across datasets while being orders of magnitude cheaper at inference.

What holds up — and what doesn't

The ablation design is principled: the authors replace only the LLM component while keeping everything else (patching, normalization, heads) fixed, so the comparison is clean. The code is public. The compute finding alone — 1,383× speedup, no accuracy loss — is hard to argue against for any production use case.

What the paper leaves open is why LLMs fail to help. The shuffling experiment shows that models can't distinguish temporally ordered from scrambled series — but this pathology holds for the ablations too, not just the LLMs. The failure might be a deeper property of how patch-based transformers process time series rather than a language model flaw specifically. The authors hint at this but don't pursue it.

The scope is also bounded. All three methods use frozen or lightly adapted LLMs from 2022–2023 (GPT-2, LLaMA-7B). Models purpose-built for time series — Chronos, TimesFM — tokenize numerical data differently and are not covered. A skeptic can reasonably argue the critique lands on a specific design pattern (repurposing NLP architectures without modification) rather than on LLMs for numerical data generally.

Why this matters for finance AI

For Beancount forecasting tasks — predicting next month's balance, estimating annual tax liability, flagging cash flow gaps — this paper pushes firmly toward lightweight purpose-built numerical models. The compute gap is not theoretical: an agent running rolling forecasts over a personal ledger can't afford Time-LLM's inference overhead.

There's a sharper implication too. The sequential-structure finding suggests that any agent treating ledger entries as tokens and expecting the model to reason about temporal ordering from context alone is on shaky ground. If the model can't tell shuffled from ordered, temporal pattern matching needs to be engineered explicitly — through positional encoding, trend-seasonal decomposition, or a purpose-built architecture — not assumed to emerge from pretraining.

The risk is over-generalizing. Tan et al.'s critique is narrowly about numerical extrapolation. LLMs still bring genuine value when the task involves natural language — explaining anomalies, answering "why did my grocery spending spike in March," auditing narrative notes in a ledger. The mistake is conflating "LLMs can't extrapolate time series" with "LLMs can't reason about finances." These are different claims, and Bean Labs needs both capabilities.

  • TimesFM: "A decoder-only foundation model for time-series forecasting" (Das et al., ICML 2024, arXiv:2310.10688) — Google's 200M-parameter model pretrained on 100B real-world time points; purpose-built for forecasting rather than repurposed from NLP, and a direct test of whether the problem is LLMs or the repurposing pattern.
  • Chronos: "Learning the Language of Time Series" (Ansari et al., TMLR 2024, arXiv:2403.07815) — Amazon's approach of tokenizing numeric values into a discrete vocabulary and training T5-based models from scratch on time series; closer in spirit to PatchTST than GPT-based forecasters and achieves strong zero-shot results on 42 benchmarks.
  • PatchTST: "A Time Series is Worth 64 Words" (Nie et al., ICLR 2023, arXiv:2211.14730) — the patching + channel-independence design that underlies most of the LLM wrappers ablated in this paper; understanding it clarifies exactly which component is doing the real work in OneFitsAll and Time-LLM.