The Proof Phase: Testing AI Categorization Accuracy with Beancount as Ground Truth

finance_fred · March 27, 2026, 8:48pm

2026: The Year We Stop Taking AI Vendors at Their Word

We’ve all seen the claims. Every AI accounting tool promises 98-99% categorization accuracy. Some even claim their AI is “better than human bookkeepers.” But here’s what bothers me as someone who tracks every penny toward FIRE: How do we actually verify these claims?

The AI accounting market hit $10.87 billion in 2026. Investors are pouring money into these tools. Yet when I dig into the actual benchmarks, the story changes dramatically. DualEntry’s AI Accounting Benchmark tested 19 AI models across 101 real accounting tasks. The winner? OpenAI GPT-5.4 at 77.3% accuracy. That means AI is failing one-third of real-world accounting scenarios.

Even more telling: 97% of finance teams report they’re still buried in manual work despite having AI in their stack. Something doesn’t add up.

The Accuracy Theater Problem

After testing three popular AI categorization tools against my own Beancount ledger (2,000+ transactions over 6 months), I discovered what I call “accuracy theater.”

All three vendors claimed 99% accuracy. Here’s what I actually measured:

Tool A: 85% accurate out of the box
Tool B: 91% accurate out of the box
Tool C: 78% accurate out of the box

None of them hit 99%. Not even close.

The secret? Vendors measure “accuracy after user corrections.” Of course the AI gets accurate when you teach it the right answer! That’s like a student taking an open-book test and claiming 99% mastery.

What we need is out-of-the-box accuracy on real-world transactions. No training period. No user corrections. Just raw categorization performance.

Why Beancount is the Perfect Ground Truth System

This is where plain text accounting becomes a superpower. Here’s why Beancount is ideal for validating AI claims:

1. Transparent and Auditable
Every transaction is human-readable. No black box. No vendor lock-in. You can see exactly what the AI got wrong.

2. Version Controlled
Track every categorization change with Git. Want to measure how many corrections you had to make? Just diff the branches.

3. Programmable Validation
Beancount Query Language lets you calculate precision, recall, and F1 scores programmatically. No manual counting.

4. No Vendor Bias
Your Beancount ledger doesn’t care which AI tool you test. Same dataset, fair comparison.

5. Real Financial Data
Not synthetic test data. Not cherry-picked examples. Your actual messy, complex, real-world transactions.

My Proposed Testing Framework

After running my own experiments, here’s the methodology I recommend:

Phase 1: Baseline (3-6 months)

Maintain a meticulously categorized Beancount ledger. Human-verified, audited, correct. This is your ground truth.

Phase 2: Blind Test

Export your transactions (amounts, dates, descriptions, accounts) but not categories. Feed this to the AI tool. Let it categorize blind.

Phase 3: Comparison

Use Beancount queries to compare AI categories against your ground truth:

SELECT 
  account_ai, 
  account_truth,
  COUNT(*) as mismatches
WHERE account_ai \!= account_truth
GROUP BY account_ai, account_truth
ORDER BY mismatches DESC

Calculate standard metrics:

Precision: Of the transactions AI categorized as “Expenses:Dining”, how many actually were?
Recall: Of all actual “Expenses:Dining” transactions, how many did AI catch?
F1 Score: Harmonic mean of precision and recall

Phase 4: Learning Curve

Correct the AI’s mistakes and measure accuracy again at 1 month, 3 months, 6 months. Does it actually learn? By how much?

Phase 5: Edge Case Analysis

Which transaction patterns consistently trip up the AI?

In my testing, failures clustered around:

Split transactions (different accounts for single purchase)
Business meals (50% deductible = needs special handling)
Mixed personal/business expenses (Costco runs, anyone?)
First-time vendors (no historical pattern)
Large unusual amounts (triggers wrong category)

What I Learned From My Experiment

The Good News:
AI is genuinely helpful for common, repetitive transactions. Grocery stores, gas stations, recurring subscriptions—the AI nailed these at 95%+ accuracy.

The Bad News:
Everything interesting happens in the edge cases. And edge cases are where tax compliance matters most. That business dinner? That home office percentage split? That professional development course that might be deductible? AI got these wrong 40-60% of the time.

The Scary News:
The AI was confident even when wrong. No “low confidence” flag. No “please review” warning. It just categorized incorrectly with the same confidence as correct categorizations.

Questions for the Community

I can’t be the only one thinking about this. So I’m curious:

Has anyone else tested AI tools against their Beancount data? What were your results?
What metrics did you use? Did you measure precision/recall, or something else?
Should we build a community benchmark? Could we create an anonymized, shared dataset for testing AI tools?
Would you share your data? Obviously sanitized/anonymized, but real transaction patterns?
What categories fail most often for you? Where should we focus validation efforts?

The Bigger Picture

I’m not anti-AI. I’m pro-transparency. If an AI tool delivers 91% accuracy out of the box and improves to 96% after 6 months, that’s amazing! Tell me that. Show me that. Let me verify that.

But don’t claim 99% when the real number is 78%. That’s not helping anyone make informed decisions.

Beancount gives us the tool to demand better. We have the ground truth. We have the query language. We have the version control. We can build the benchmark that holds AI vendors accountable.

Who’s with me?

For context: I’m a financial analyst working toward FIRE, tracking every transaction in Beancount for 3+ years. I’ve got 2,000+ transactions categorized and verified. Happy to share anonymized methodology details if others want to replicate this testing framework.