Can You Audit What You Cannot Explain? Building Explainable AI for Beancount Workflows

I had a wake-up call recently that changed how I think about AI in accounting.

The Client Question That Stumped Me

I use machine learning to categorize client transactions. The AI achieved 92% accuracy—I was proud!

Then a client asked: “Why did the AI categorize this $500 payment as ‘Consulting Expense’ instead of ‘Software Subscription’?”

I had no answer. The ML model decided based on patterns, but I couldn’t explain why.

The client’s response: “If you can’t explain the categorization logic, how do I trust it? What happens if the IRS audits me?”

That moment crystallized the problem: Black-box AI creates audit liability.

The IRS Won’t Accept “The AI Did It”

In an IRS audit, I need to defend every categorization decision. “The AI decided based on patterns” is not an acceptable explanation.

I needed a solution: Explainable AI (XAI).

My Hybrid Approach: Rules + Transparent ML

I rebuilt my system with three tiers of explainability:

Tier 1: Explicit Rules (100% explainable)

  • Example: “Vendor name contains ‘AWS’ → Cloud Services”
  • Confidence: High
  • Audit defense: “This matches Rule #23 in our categorization policy”

Tier 2: ML with Feature Importance (explainable enough)

  • Example: “Categorized as Consulting because: vendor name similarity (60%), amount pattern (25%), historical date pattern (15%)”
  • Confidence: Medium
  • Audit defense: “AI analyzed vendor patterns and matched to historical consulting expenses”

Tier 3: Human Review Flagged (low confidence or unusual)

  • Anything AI isn’t confident about
  • Anything outside normal patterns
  • Large or unusual transactions

Building the Audit Trail

Every transaction in my system now logs:

  • What was categorized
  • How (rule-based or ML)
  • Why (specific rule ID or feature importance breakdown)
  • Confidence score

When I run monthly reviews, I query: “Show me all AI-categorized transactions in March with reasoning.”

The output looks like:

  • 70% matched explicit rules (high confidence, easily defensible)
  • 25% ML with feature breakdown (medium confidence, explainable)
  • 5% flagged for human review (low confidence, full manual decision)

The Key Insight

Explainable AI isn’t about perfect accuracy—it’s about auditable reasoning.

A black-box model that’s 95% accurate but can’t explain decisions is worse for tax purposes than a rule-based system that’s 90% accurate but fully transparent.

Questions for the Community

  1. How much explainability is enough for audit defense?

  2. Has anyone been through an IRS audit with AI-assisted bookkeeping? What documentation did they request?

  3. Should we develop industry-standard audit trails for AI categorization?

This is new territory. I think we need to get ahead of it before the IRS does.

What are your thoughts on balancing automation efficiency with audit defensibility?

@tax_tina This is absolutely critical for CPA liability. When I sign a client’s tax return or financial statements, I am responsible—not the AI.

Professional Standards Perspective

I can’t hide behind “the AI did it” if categorizations are wrong. The IRS (and state boards) hold me accountable.

That’s why I love your hybrid approach: rules + explainable ML.

How I Document AI Decisions

At my firm, we maintain decision logs in Git commit messages:

2026-03-15 * "AWS Invoice"
  ; AI-categorized: Rule #47 (vendor pattern matching)
  ; Confidence: 98%
  ; Reasoning: Vendor contains "AWS", historical pattern matches Cloud:Services
  Expenses:Cloud:Services  47.18
  Assets:Checking

If audited, I can show the IRS:

  • Exact logic for each categorization
  • Confidence scores (how certain was the AI?)
  • Decision trail (was it rule-based or ML?)

The Junior Accountant Standard

I treat AI like I’d treat a junior staff accountant:

Would I accept “I don’t know why I categorized it that way” from a junior? No.

Same standard for AI: Must be able to explain reasoning.

Your three-tier system is perfect. I’m implementing something similar for our firm.

I built a similar XAI system after learning this lesson the hard way. Here’s my practical implementation:

Using Beancount Metadata for Transparency

2026-03-15 * "AWS Services" #ai-categorized
  confidence: "95%"
  rule: "vendor-pattern-aws"
  reasoning: "Vendor name contains 'AWS', historical pattern matches Cloud:Services"
  Expenses:Cloud:Services  27.50
  Assets:Checking

Benefits of This Approach

Can grep ledger for AI decisions:

grep "#ai-categorized" ledger.beancount | grep "confidence: \"[5-8]"

(Shows all AI categorizations with 50-89% confidence—my review queue)

Review reasoning later:
Did the AI’s logic make sense? Can I improve the rules?

Build confidence over time:
After 3 months of reviewing AI decisions, I trust 95%+ confidence scores

Easy to find systematic errors:
If AI consistently miscategorizes a vendor, I can see the pattern and fix the rule

Start Simple, Add Complexity Later

My advice: Don’t overcomplicate early.

Phase 1: Start with simple, explicit rules

  • 85% accuracy, 100% explainable
  • Build confidence in the system

Phase 2: Add ML for edge cases

  • 95% accuracy, mostly explainable
  • Only use ML where rules can’t handle it

Transparency > Accuracy (initially)

Better to understand an 85% accurate system than trust a 95% black box.

Once you’ve proven the foundation, you can optimize for accuracy.

From a personal finance perspective, I don’t face the same IRS audit pressure as professional accountants, but I still want explainability.

Why? Because “Why Did My Spending Spike?”

If I’m analyzing trends and AI miscategorized restaurants as groceries, my whole analysis is skewed.

I need to understand: Why did the AI make this decision?

My Dashboard: Categorization Transparency

I built a Fava dashboard showing:

Transactions by method:

  • Manual: 15%
  • Rule-based: 60%
  • ML: 25%

Confidence score distribution:

  • High (95%+): 75%
  • Medium (80-95%): 20%
  • Low (<80%): 5%

Monthly review process:
Spot-check the 5% low-confidence categorizations

The 80/20 Rule

I found that reviewing 5% of transactions (the low-confidence ones) catches 80% of errors.

Pareto principle in action: Focus review effort where AI is least certain.

Question for Professionals

Do you need a higher explainability bar than personal finance users like me?

I can tolerate some black-box ML if it saves time. But I imagine CPAs signing tax returns need full transparency.

Where’s the line?

I use XAI for a different purpose: client trust-building.

When Clients Question Categorization

Old approach: “The software categorized it this way.”

  • Client feels frustrated (black box)
  • Doesn’t understand why
  • Questions my competence

New approach: “AI categorized this based on vendor pattern + amount similarity to previous transactions. Here’s the decision log.”

  • Client understands reasoning
  • Feels confident in the process
  • If wrong: I correct and update rules, show them the improvement

The Discovery: Transparency > Accuracy for Trust

Clients don’t need perfect accuracy. They need confidence that you can explain and correct.

90% accurate + fully explainable > 95% accurate black box

When clients see the reasoning:

  • They trust the process
  • They understand value (not just “software magic”)
  • They’re willing to pay premium for audit-ready bookkeeping

Marketing Angle

I actually market this as: “Audit-Ready AI Bookkeeping with Full Decision Transparency”

Clients (especially those who’ve been audited before) are willing to pay 15-20% premium for this.

Why? Audit defense has real value.

If the IRS audits you and you can’t explain your categorizations, that audit gets expensive (accountant fees, potential penalties, stress).

If you can show: “Here’s the decision log for every AI categorization, here’s the rule or ML reasoning,” the audit is much smoother.

@tax_tina Your three-tier approach is exactly what I’m implementing. Thank you for sharing!