Skip to main content

Constitutional AI for Accounting Agents: RLAIF, Policy Rules, and Goodharting Risks

· 6 min read
Mike Thrift
Mike Thrift
Marketing Manager

Anthropic's Constitutional AI paper (Bai et al., 2022, arXiv:2212.08073) keeps coming up whenever I think about write-back safety for autonomous accounting agents. The core question it addresses — can you get an AI to consistently follow a ruleset without labelling every violation by hand? — maps almost exactly onto the question I keep asking about Beancount ledger agents: how do you stop the agent from posting malformed or policy-violating entries without hiring a compliance reviewer to check every transaction?

The paper

2026-04-21-constitutional-ai-harmlessness-from-ai-feedback

Bai et al. introduce Constitutional AI (CAI), a training pipeline for making LLMs harmless without collecting human labels for harmful outputs. The only human input is a short list of natural-language principles — the "constitution" — that governs what the model should and shouldn't do. Everything else is automated: the model critiques its own responses against those principles, revises them, and then a separate AI evaluator picks the better response from pairs, generating preference data for RL training. The technique is called RLAIF (Reinforcement Learning from AI Feedback), as opposed to the standard RLHF.

The pipeline has two phases. In the supervised learning (SL-CAI) phase, the model reads a harmful prompt, generates a response, critiques that response by sampling one of sixteen constitutional principles, then rewrites the response to address the critique. This critique-revise loop repeats up to four times per example. The resulting revised responses, plus standard helpfulness examples, are used to finetune the base model. In the reinforcement learning (RL-CAI) phase, the SL-CAI model generates pairs of responses to harmful prompts, and a feedback model — also conditioned on the constitution — picks which of the two is better. Those AI-generated preference labels train a reward model, which then drives RL finetuning of the policy. Chain-of-thought prompting is added at the RL stage to improve reasoning quality before the final binary preference judgment.

Key ideas

  • The sixteen constitutional principles are randomly sampled at each critique step, so no single principle dominates and the model is pushed toward diverse coverage of potential harms.
  • Crowdworker comparisons (via Surge AI) evaluated harmlessness and helpfulness across 10,274 helpfulness comparisons and 8,135 harmlessness comparisons across 24 training snapshots. RL-CAI improved harmlessness Elo relative to the SL-CAI baseline without proportionally sacrificing helpfulness Elo — the main empirical claim of the paper.
  • The AI feedback model achieves "well over 90% binary accuracy" at predicting which of two responses is better, approaching human performance on the same comparison task.
  • Soft preference labels (normalized log-probabilities) significantly outperformed hard 0/1 labels during reward model training. Clamping chain-of-thought probabilities to a 40–60% range substantially improved RL stability over unclamped confidence scores.
  • The number of constitutional principles in the set did not significantly affect aggregate harmlessness scores — what matters is having some principles, not optimizing the count.
  • Ablations show critiqued revisions outperform direct revisions for smaller models; at 52B parameters the gap narrows, but critiques still help at the margins.

What holds up — and what doesn't

The central claim — that AI feedback can substitute for human harm labels while preserving helpfulness — is backed by real crowdworker comparisons, and the RLAIF machinery is sound enough that it has since become standard practice. That part holds up.

The limitations the authors acknowledge are worth dwelling on. First, Goodharting: RL-CAI models "can become over-trained," producing boilerplate language like "you are valid, valued, and cared for" instead of substantive engagement. The preference model saturates, scores lose calibration at high values, and the policy learns surface patterns of harmlessness rather than genuine reasoning. Second, calibration: chain-of-thought probabilities are typically close to 0 or 1 and not well-calibrated — the authors had to clamp them to stabilize training. Third, the claim that the method requires "no human labels" is overstated, as the Austin ML Journal Club review noted: humans wrote the constitution, humans labeled the helpfulness data, and humans evaluated the final models. The human input is smaller, not absent.

The dual-use concern buried in the paper deserves more attention than it received. A technique that makes it easy to train rule-following models cheaply also lowers the barrier for training models that follow pernicious rules cheaply. The authors mention it; they do not resolve it.

Why this matters for finance AI

The Bean Labs use case is almost a direct substitution: replace "harmful outputs" with "accounting policy violations" and the CAI pipeline becomes a plausible architecture for write-back safety. Define a constitution of accounting rules — GAAP treatment of prepaid expenses, company-specific chart-of-accounts constraints, double-entry balance checks, approval thresholds — and run SL-CAI to teach the agent to self-critique proposed journal entries before committing them. Run RL-CAI to train a reward model on AI-generated judgments of which proposed entry is more compliant.

The failure modes translate directly too. Goodharting in an accounting agent would look like the agent learning to append a boilerplate disclaimer to every entry — "this transaction may require additional documentation" — rather than actually checking compliance. That is arguably worse than no safety layer at all, because it creates false assurance. The calibration problem matters for threshold decisions: an over-confident reward model will give near-binary scores that don't capture marginal policy violations. And the dual-use concern resurfaces: the same technique could be used to train an agent that reliably follows instructions designed to obscure transactions.

What the paper does not address is temporal consistency — whether a CAI-trained agent applies rules uniformly across an entire ledger history or just locally per entry. That gap matters for month-end reconciliation and multi-step workflows.

  • Collective Constitutional AI: Aligning a Language Model with Public Input (FAccT 2024) — explores crowdsourcing the constitution itself; directly relevant to how Bean Labs might surface accounting rules from multiple stakeholders rather than encoding them unilaterally.
  • Specific versus General Principles for Constitutional AI (arXiv:2310.13798) — tests whether a single high-level principle ("do what's best for humanity") can substitute for a long specific list; the answer matters for how tightly you need to specify accounting rules versus relying on general financial ethics.
  • RLHF workflow for LLMs (Ouyang et al., InstructGPT, arXiv:2203.02155) — the RLHF baseline that CAI is improving on; understanding the original helps calibrate what RLAIF actually gains.