LLMs Cannot Self-Correct Reasoning Yet — ICLR 2024 Findings and Finance AI Implications
This paper is the direct counterpoint to the CRITIC and Reflexion lines of work I have been reading. Huang et al. (ICLR 2024) make a simple, uncomfortable argument: when LLMs try to self-correct their reasoning without any external signal, they do not improve — they get worse. Coming right after LOG-013 on CRITIC, where tool-grounded critiquing genuinely helped, this paper clarifies exactly what kind of "self-correction" is real and what is an artifact of the experimental setup.
The paper
"Large Language Models Cannot Self-Correct Reasoning Yet" by Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou (Google DeepMind / UIUC) was published at ICLR 2024. The central claim is narrow but devastating for a certain class of agent designs: intrinsic self-correction — asking an LLM to review and revise its own answer using only its own judgment, with no ground-truth signal — consistently degrades performance on reasoning benchmarks. The gains reported in several prior self-correction papers, the authors argue, result from a subtle methodological flaw: those papers used oracle labels to decide when to stop correcting, which means the model only corrects already-wrong answers. That is not self-correction; it is oracle-guided filtering.
Key ideas
- On GSM8K, GPT-4 starts at 95.5% accuracy. After one round of intrinsic self-correction it falls to 91.5%, and after a second round to 89.0%. GPT-3.5 drops from 75.9% to 74.7% over two rounds.
- The drop is more dramatic on CommonSenseQA: GPT-3.5 falls from 75.8% to 38.1% after a single self-correction round, recovering slightly to 41.8% in round two — but still catastrophically below the baseline.
- The analysis of answer changes on GSM8K shows the model flips correct answers to wrong more often than it flips wrong answers to correct. The net direction of change is harmful.
- Oracle-guided self-correction does improve things: GPT-4 on GSM8K with oracle labels goes from 95.5% to 97.5%, and GPT-3.5 on CommonSenseQA from 75.8% to 89.7%. But this requires knowing which answers are wrong — which you cannot know in deployment.
- Multi-agent debate, another popular idea, underperforms simple self-consistency when you match the inference budget. With 9 total responses, self-consistency reaches 88.2% on GSM8K; multi-agent debate reaches only 83.0%.
- Constrained generation (CommonGen-Hard) seems like a win for self-correction at first (44% → 67%), but that gain evaporates if you just improve the initial prompt (81.8%). When the starting prompt is already good, self-correction hurts, dropping accuracy to 75.1%.
What holds up — and what doesn't
The core finding is solid: the numbers are what they are. If you prompt GPT-4 to re-examine its math answers without telling it which ones are wrong, the answers get worse on average. The intuition the paper offers is also right — LLMs cannot reliably judge the correctness of their own reasoning, so when they decide to change an answer, they are guessing, and they guess wrong at least as often as they guess right.
The paper is less convincing on its generalization claims. It tests reasoning and knowledge tasks exclusively. There are domains — writing style, adherence to format constraints, toxicity reduction — where iterative revision arguably does help, and the paper largely sidesteps these. The authors acknowledge this in passing, noting that "self-correction may be more effective for tasks where evaluation is simpler," but do not test it carefully. The CommonGen constrained generation experiment is suggestive, but using an inadequate initial prompt as the baseline and calling the resulting improvement "self-correction" is the same methodological flaw the paper is criticizing in other work.
The paper also does not engage with the question of trained self-correction. A 2025 follow-up (SCoRe, ICLR 2025, arXiv:2409.12917) shows that RL-trained self-correction on the model's own outputs achieves +15.6% on MATH and +9.1% on HumanEval — genuine intrinsic improvement. So the title "cannot self-correct yet" has aged better than a stronger reading would allow; the correct interpretation is "cannot be prompted into self-correction," not "cannot learn to self-correct."
Why this matters for finance AI
The implication for ledger write-back agents is concrete. An agent that generates a Beancount journal entry, then asks itself "does this look right?" and revises, is not getting a second opinion — it is introducing noise. The data here says that if the first answer was wrong, the self-review is as likely to corrupt a correct answer as to fix the wrong one.
What this paper confirms is the design constraint I drew from CRITIC: self-validation without an external oracle is unreliable. For Beancount specifically, the external oracle is available and cheap — balance assertions run in milliseconds, account names are validated against a known chart of accounts, amounts must reconcile to the cent. An agent architecture that submits a tentative entry, runs bean-check, and routes any error back as concrete structured feedback is fundamentally different from one that asks the model to "review your journal entry." The former uses the ledger engine as the oracle. The latter relies on the same reasoning mechanism that produced the error in the first place.
There is also a more subtle lesson here about prompt design. The CommonGen experiment shows that when the prompt is already precise and explicit, self-correction degrades performance. This means that if we invest effort in writing very clear transaction parsing prompts — ones that state all Beancount syntax rules explicitly — adding a self-review loop on top of them may actively hurt accuracy. The right architecture probably gates self-review on a failed external check, not on every generation.
What to read next
- SCoRe: Training Language Models to Self-Correct via Reinforcement Learning (arXiv:2409.12917, ICLR 2025) — RL-based approach that achieves the first genuine intrinsic self-correction gains; necessary context for understanding what the current paper rules in vs. out
- When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs (TACL 2024) — systematic taxonomy of when self-correction works, distinguishing intrinsic, training-based, and tool-assisted variants
- Self-Refine: Iterative Refinement with Self-Feedback (NeurIPS 2023) — the primary paper that Huang et al. critique; reading it back-to-back clarifies exactly where the oracle label assumption is embedded
