Chain-of-Thought Prompting: Precision-Recall Trade-offs for Finance AI
I'm re-reading Wei et al.'s 2022 Chain-of-Thought paper (arXiv:2201.11903) with a specific question in mind: earlier experiments showed that CoT prompting improved precision but hurt recall on financial anomaly detection. The paper should explain why — or at least give me enough mechanistic intuition to form a hypothesis.
The paper
"Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" by Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, and colleagues (Google Brain) is the paper that put CoT on the map. The idea is simple: instead of asking a model to jump straight to an answer, you show it a few examples where the answer is preceded by a written-out reasoning trace. The model then produces its own reasoning trace before answering.
The paper tests this on arithmetic (GSM8K, SVAMP, AQuA), commonsense (CommonsenseQA, StrategyQA), and symbolic reasoning (letter concatenation, coin flip) tasks across three large language models — PaLM 540B, GPT-3 175B, and LaMDA 137B — and compares against standard few-shot prompting.
Key ideas
- GSM8K (math word problems): standard prompting with PaLM 540B gets 17.9%; CoT gets 56.9%, a 39-point jump. This is a stunning gain on a hard benchmark, and it's the headline result the paper is rightly known for.
- Letter concatenation: standard 7.6%, CoT 99.4%. For pure symbolic manipulation, CoT essentially solves the task at large scale.
- CommonsenseQA: standard 78.1%, CoT 79.9%. Minimal gain. Tasks that don't require multi-step inference don't benefit much.
- Scale cliff: CoT only reliably helps at roughly 100B+ parameters. Below ~10B, adding a reasoning trace often hurts — the model produces "fluent but illogical chains of thought," which actively misleads it.
- Easy tasks show no benefit: On MAWPS SingleOp (single-step arithmetic), PaLM 540B scored 94.1% with both standard and CoT prompting. Reasoning overhead adds no value when the task doesn't actually require multi-step inference.
- No guarantee of correctness: the authors are explicit that an LLM can produce a coherent-looking reasoning trace that leads to a wrong answer. The trace and the answer are generated jointly, and neither is independently verified.
What holds up — and what doesn't
The empirical results hold up. The gains on GSM8K are replicated in follow-on work, the scale threshold matches what was observed elsewhere, and the symbolic reasoning numbers are consistent with what you'd expect from in-context learning mechanics. This paper did real science.
What I find underexplored is the precision/recall asymmetry. Wei et al. show aggregate accuracy numbers — they don't break out false positive versus false negative rates. But if you think about how CoT changes the answer distribution, the mechanism is suggestive: CoT prompts the model to generate and commit to a reasoning path. This narrowing of the generation space likely increases specificity (precision) at the expense of coverage (recall). The model produces fewer answers overall, and the ones it does produce tend to be better justified — but it may pass over correct answers that don't fit a neat step-by-step narrative. For anomaly detection in financial data, where the "anomaly" class is rare and atypical by definition, this is exactly the failure mode you'd expect.
The paper also leaves the mechanistic question open. The authors are careful not to claim the model is "actually reasoning" in any strong sense. Whether CoT elicits genuine multi-step inference or a sophisticated pattern-matching shortcut that mimics such inference is unresolved. A 2025 Wharton report testing modern reasoning models (o3-mini, o4-mini) found that explicit CoT instructions produced only 2–3% marginal gains, and sometimes reduced "perfect accuracy" by triggering errors on questions the model would otherwise have answered correctly. The paper's scale threshold may have shifted as models have gotten better at implicit reasoning — but the variability problem, where CoT introduces a non-zero chance of derailing an otherwise-correct answer, persists.
Why this matters for finance AI
Three connections to the Bean Labs agenda:
First, the write-back safety problem. A CoT-prompted agent explaining its reasoning before taking a ledger action provides an audit trail — but the reasoning trace is not a guarantee of correctness. The agent can produce a plausible-looking explanation for a wrong action. This means showing users a reasoning trace may create false confidence rather than genuine auditability.
Second, the anomaly detection asymmetry. If CoT raises precision but lowers recall on rare-event detection tasks, then for Beancount use cases — finding misclassified transactions, flagging duplicate entries, catching policy violations — using CoT naively may produce fewer false alarms at the cost of missing real problems. That's potentially the wrong trade-off. A finance agent that confidently explains why it didn't flag something suspicious is more dangerous than one that over-flags.
Third, the scale dependency. If production finance agents run on smaller models for cost or latency reasons, the CoT gains evaporate — and can reverse. Any evaluation of a CoT-based finance agent needs to be done at the same model scale used in production.
What to read next
- "Self-Consistency Improves Chain of Thought Reasoning in Language Models" (Wang et al., 2022, arXiv:2203.11171) — samples multiple CoT paths and takes the majority vote; directly addresses the variance problem Wei et al. flag
- "Large Language Models are Zero-Shot Reasoners" (Kojima et al., 2022, arXiv:2205.11916) — shows "Let's think step by step" without any exemplars also elicits reasoning; tests the boundary of what CoT actually needs
- "Is Chain-of-Thought Reasoning of LLMs a 'Reasoning' or 'Searching' Process?" (arXiv:2508.01191) — directly attacks the mechanistic question the original paper leaves open
