PAL: Program-Aided Language Models for Reliable Financial Arithmetic
After spending time with the tabular reasoning literature, I wanted to understand the complementary approach: rather than making LLMs reason about tables in natural language, what happens when you let them write code and hand off the computation entirely? PAL (Gao et al., 2022, arXiv:2211.10435) is the canonical answer, and it has obvious implications for any system that needs to do arithmetic over financial data reliably.
The paper
"PAL: Program-Aided Language Models" (Gao, Madaan, Zhou, Alon, Liu, Yang, Callan, Neubig; ICML 2023) starts from a straightforward observation: LLMs decompose problems well but execute arithmetic badly. Chain-of-thought prompting fixes the first problem but leaves the second untouched. The proposed fix is to change what the LLM produces as its "reasoning steps" — instead of natural language arithmetic, it generates Python code. A Python interpreter then runs that code and returns the answer.
The decomposition-execution split is clean: the LLM handles problem understanding and program structure; the interpreter handles everything numerical. A question like "Olivia has $23, buys five bagels at $3 each — how much is left?" becomes money_left = 23 - (5 * 3), not a sequence of prose arithmetic that the model might botch.
Key ideas
- On GSM8K (grade-school math word problems), PAL with Codex reaches 72.0% accuracy versus 65.6% for chain-of-thought with the same Codex model — a +6.4pp gain — and 56.9% for CoT with the much larger PaLM-540B model. A smaller model wins by delegating arithmetic to Python.
- On GSM-hard, a version of GSM8K where numbers are replaced with larger values, PAL achieves 61.2% versus CoT's 23.1% — a +38.1pp absolute gap. Large numbers break token-level arithmetic; they do not break Python.
- With majority voting over 40 samples, PAL reaches 80.4% on GSM8K, edging out Minerva-540B (78.5%) with a model roughly 1/10th the size.
- The gains generalize to symbolic reasoning. On BIG-Bench Hard tasks like Object Counting, PAL scores 96.7% versus CoT's 73.0%; on Penguins in a Table, 93.3% versus 79.2%.
- An ablation reveals where the work actually happens: when the LLM acts as its own interpreter (no external Python), GSM8K accuracy collapses to 23.2%. The interpreter is not a minor enhancement — it is doing all the arithmetic.
- Variable naming matters. Replacing meaningful variable names with random characters causes substantial accuracy drops on symbolic tasks. The model reads its own code.
What holds up — and what doesn't
The core claim is trivially correct and the experiments confirm it convincingly: Python is better than an LLM at arithmetic, and GSM-hard makes this brutally visible. The +38pp there is not a benchmark quirk — it reflects a categorical failure mode of CoT under scale.
What I find less convincing is the framing as a general reasoning breakthrough. PAL works on tasks that happen to have deterministic, Python-expressible answers. Much of what matters in financial reasoning does not decompose this way. Deciding whether a transaction pattern is "unusual for this account in Q4" or whether a transfer warrants a manual review flag is not reducible to a Python expression. PAL gives you a reliable arithmetic engine; it does not give you judgment.
The security dimension receives no attention in the paper. The benchmarks run in a controlled environment, but any deployment that generates and executes arbitrary Python in response to user-supplied inputs is a meaningful attack surface. Kernel escapes from sandboxed interpreters, access to filesystem or secrets, adversarially crafted inputs that generate malicious code — none of this is addressed. For financial applications, this is not a footnote.
The paper also does not deeply analyze failure modes when code generation goes wrong. If PAL emits syntactically invalid Python, it falls back to nothing. The authors report execution success rates but do not characterize what causes generation failures or whether they are random or systematic. Given that the interpreter is doing all the arithmetic, code quality is the entire reliability bottleneck — and it is underanalyzed.
Why this matters for finance AI
This is one of the most directly applicable papers to Beancount I have read. Ledger operations are almost perfectly aligned with what PAL does well: summing transactions by category, applying foreign exchange rates, computing tax basis across multiple lots, reconciling bank statement totals against ledger balances. These are deterministic, arithmetic-heavy, and Python-expressible. CoT-based agents will hallucinate numbers here; PAL will not, as long as the program structure is right.
Program of Thoughts (arXiv:2211.12588), a concurrent paper that independently developed the same idea, evaluated on three financial QA datasets — FinQA, ConvFinQA, and TATQA — and reported an average ~12% gain over chain-of-thought. That is the most direct evidence that the program-generation approach helps on finance-domain reasoning, not just grade-school math.
The write-back safety question, though, is sharper in a ledger context than on benchmarks. An agent that generates Python to read Beancount data is low-risk. One that generates Python to write ledger entries needs a tightly restricted execution environment — one that can touch only ledger objects and nothing else, that fails closed on any exception, and that requires the generated code to pass a whitelist before execution. PAL treats the interpreter as a neutral computation engine. A production finance agent cannot.
What to read next
- Program of Thoughts Prompting (Chen et al., arXiv:2211.12588) — concurrent work that evaluates on FinQA, ConvFinQA, and TATQA and reports a ~12% average gain over CoT; the finance-specific evaluation that PAL defers.
- FinQA: A Dataset of Numerical Reasoning over Financial Reports (Chen et al., EMNLP 2021) — the benchmark underlying the PoT financial evaluations; understanding what is actually being tested calibrates how much to trust the transfer to real Beancount use cases.
- Self-Refine: Iterative Refinement with Self-Feedback (Madaan et al., arXiv:2303.17651) — same first author as PAL, extends the code-generation insight to iterative self-correction loops; relevant for whether PAL-style agents can recover from their own code generation failures.
