From 'AI Suggests' to 'AI Acts': My First Agentic Workflow Experiment with Beancount

I’ve been experimenting with what I’m calling “agentic AI” for my Beancount workflow, and I wanted to share both the exciting results and the important questions it raises.

The Fundamental Shift

For the past year, I’ve used generative AI (ChatGPT) to help with transaction categorization. The workflow was: AI suggests categories, I review them, I manually apply them. It was helpful but still required me to be in the driver’s seat for every decision.

Recently, I built something different—an agentic workflow that doesn’t just suggest, it acts autonomously within defined boundaries:

  • Monitors my bank account for new transactions
  • Automatically imports when new data is detected
  • Categorizes transactions based on learned patterns with a 95%+ confidence threshold
  • Flags anomalies for review (unusual vendor, amount outside normal range)
  • Creates draft reconciliations
  • Sends Slack notification: “15 transactions categorized, 2 need review”

The time savings are dramatic: my daily accounting workflow went from 30 minutes of manual review to 5 minutes of exception handling.

But Here’s What Keeps Me Up at Night

As a CPA, I can’t just celebrate efficiency—I have to think about risk and liability:

What if categorization logic drifts over time? The AI learns from patterns, but what if those patterns slowly shift in the wrong direction? How do I detect drift before it becomes a systematic problem?

How do I define “safe boundaries” for autonomous actions? Import and categorize feel safe. But posting directly to the main ledger without review? That feels reckless. Where’s the line?

Trust threshold: At what AI confidence level do you let it act versus just suggest? I set mine at 95%, but is that conservative enough? Too conservative?

My Current Approach: Staging + Review

I’m using a staging workflow that gives me the best of both worlds:

  1. AI outputs categorized transactions to a staging file (not the main ledger yet)
  2. I review the staging file in Fava (takes 5 minutes vs. 30 minutes of manual work)
  3. If everything looks good, I approve and git commit to the main ledger (audit trail preserved)
  4. If something’s wrong, I fix it and update the AI’s learning

This approach works because the boundaries are clear:

  • :white_check_mark: Safe for AI to do autonomously: Import, categorize, flag anomalies
  • :warning: Requires human approval: Posting to the main ledger

The AI does the tedious work. I handle the judgment calls.

Questions for the Community

For those of you thinking about or already using agentic AI in your Beancount workflows:

  1. How do you define safe boundaries for what AI can do autonomously versus what requires human approval?

  2. What confidence thresholds do you use? Does it vary by transaction type or amount?

  3. How do you monitor for drift in AI decision-making over time?

  4. When 2026 shifts from “AI suggests” to “AI acts,” what guardrails are essential for financial accuracy and professional liability?

I believe agentic AI is the future, but we need to get the boundaries right. Would love to hear your experiences and concerns.


Built using: Beancount + smart_importer + custom Python monitoring script + Slack webhooks

This is incredibly timely for me! I’m a FIRE enthusiast tracking across 7 different accounts (banks, brokerages, credit cards), and I currently spend about 45 minutes every Sunday manually importing CSVs from each institution.

The idea of agentic AI monitoring all accounts and auto-importing daily is exactly what I’ve been dreaming about. But your concern about investment accounts really resonates—cost basis tracking is CRITICAL, and if AI miscategorizes even one transaction, it could create a nightmare during tax season.

Confidence Thresholds by Account Type?

I’m thinking about implementing variable confidence thresholds based on stakes:

  • 98%+ for investment transactions (high stakes, conservative approach)
  • 90%+ for routine expenses (groceries, utilities—low risk if wrong)
  • Always human review for large transactions (anything >,000)

The reasoning: A miscategorized $8 coffee doesn’t really matter. A miscategorized $5,000 stock sale absolutely does.

Multi-Institution Monitoring

@accountant_alice Has anyone built agentic monitoring that works across multiple financial institutions? Each bank has different data formats, different update schedules, different APIs (or no APIs at all).

I’m imagining a system that:

  • Checks each account for new transactions (some daily, some real-time)
  • Normalizes different CSV formats into consistent Beancount entries
  • Applies account-specific categorization rules (my Chase credit card patterns differ from my Schwab brokerage patterns)

Is this overkill for personal finance, or is this the natural evolution once you have 5+ accounts?

The time math is compelling: 45 min/week × 52 weeks = 39 hours per year spent on manual imports. If agentic AI reduces that to 5 min/week of review, that’s 35 hours saved annually. That’s almost a full work week I could spend analyzing investments instead of downloading CSVs.

Anyone else tracking this many accounts with automation? What does your setup look like?

@accountant_alice Your staging + review approach is exactly right, and I want to share a cautionary tale about why it matters.

When I Over-Automated Too Soon

Early in my Beancount journey (about 3 years ago), I got excited about automation and built an importer that auto-posted directly to my main ledger without review. It felt amazing for about 2 months—my accounting just “happened” in the background.

Then I discovered a problem: The importer had been miscategorizing Amazon purchases. Instead of routing them to Personal:Shopping, they were going to Expenses:Office-Supplies.

By the time I noticed, I had 60+ incorrect transactions polluting my historical data. Cleaning it up was a nightmare—I had to manually review months of transactions, fix the categories, and rebuild reports.

The “Start Simple” Rule Applies to Agentic AI Too

The lesson I learned: Automation is only as good as your review checkpoints.

Your staging file approach is the right balance:

  • :white_check_mark: AI does the tedious work (importing, categorizing)
  • :white_check_mark: Human catches systematic errors before they contaminate the ledger
  • :white_check_mark: Git commit creates an explicit approval moment (and audit trail)

Trust, But Verify

I like to think of agentic AI as giving your teenager the car keys. Yes, they can drive. Yes, they’ve proven competent. But you still want to know where they’re going and when they’ll be back.

The AI can handle routine tasks autonomously, but you need:

  • Visibility into what it’s doing
  • Review checkpoints before permanent changes
  • Override capability when it makes mistakes

Don’t fear AI autonomy—it’s incredibly powerful and time-saving. But build review checkpoints into your workflow. The 5 minutes you spend reviewing the staging file will save you hours of cleanup later.

@finance_fred To your multi-account question: I track 4 accounts (2 banks, 1 brokerage, 1 credit card) with semi-automation. Each has a custom importer, but I still manually trigger imports and review before committing. The key is: automate the boring parts (parsing CSVs, matching patterns), keep human judgment where it matters (final categorization approval).

As a former IRS auditor turned tax preparer, I have to raise a critical concern about agentic AI: audit defense.

The “Why” Question

If AI categorizes transactions autonomously, can you defend those decisions during an IRS audit?

Example scenario: AI categorizes a $500 payment as “Consulting Expense” instead of “Software Subscription.” If the IRS questions it, can you explain why that categorization was made?

If your answer is “the AI decided based on patterns,” that’s not going to satisfy an auditor. They want to understand the reasoning.

The Solution: Explainable AI (XAI)

I’ve been building what I call an XAI approach for Beancount:

Tier 1: Explicit Rules (100% explainable)

  • Example: “Vendor name contains ‘AWS’ → Cloud Services”
  • Confidence: High
  • Audit defense: “This transaction matches Rule #23 in our categorization policy”

Tier 2: ML with Feature Importance (explainable enough)

  • Example: “Categorized as Consulting because: vendor name similarity (60%), amount pattern (25%), date pattern (15%)”
  • Confidence: Medium
  • Audit defense: “AI analyzed vendor patterns and matched to historical consulting expenses”

Tier 3: Human Review (low confidence or unusual)

  • Anything AI isn’t confident about
  • Anything outside normal patterns
  • Large or unusual transactions

Building an Audit Trail

Every transaction in my system logs:

  • What was categorized
  • How (rule-based or ML)
  • Why (specific rule or feature importance)
  • Confidence score

When I run my monthly review, I can query: “Show me all transactions categorized by AI in March with reasoning.”

The output looks like:

  • 70% matched explicit rules (high confidence, easily defensible)
  • 25% ML categorization with feature breakdown (medium confidence, explainable)
  • 5% flagged for human review (low confidence, full manual decision)

Key Insight

Explainable AI isn’t about achieving perfect accuracy—it’s about auditable reasoning.

A black-box ML model that’s 95% accurate but can’t explain decisions is WORSE for tax purposes than a rule-based system that’s 90% accurate but fully transparent.

@accountant_alice I love your staging approach. I’d add: Log the reasoning for each AI categorization. Not just “95% confident,” but “95% confident because X, Y, Z.”

That way, if you’re ever audited (or if a client questions a categorization), you can show your work.

Questions for the Community

  1. How much explainability is enough for audit defense?
  2. Has anyone been through an IRS audit with AI-assisted bookkeeping? What documentation did they request?
  3. Should we be developing industry-standard audit trails for AI categorization?

This is new territory, and I think we need to get ahead of it before the IRS does.

I run a small bookkeeping practice with 12 clients, and this discussion is incredibly relevant to my work. The potential for agentic AI is huge, but so are the complications.

The Current Pain Point

Right now, I manually check each client’s bank portal every week:

  • Log in to 12 different banking websites
  • Download CSVs for each client
  • Import into their respective Beancount ledgers
  • Categorize, reconcile, generate reports

This takes 4-6 hours every week. It’s tedious, error-prone (easy to forget a client or miss a week), and honestly it’s the part of bookkeeping I hate most.

The Agentic AI Dream

What I’d love: A system that monitors all 12 clients’ accounts automatically, flags when new transactions are available, and handles the import + categorization for me.

But here’s the complication that keeps me from diving in:

Client-Specific Categorization Rules

Each client has different rules for how transactions should be categorized:

Client A (restaurant owner):

  • ALL meals → “Business:Meals” (it’s their business, everything is deductible)

Client B (consultant):

  • Meals with clients → “Business:Meals” (deductible)
  • Personal dinners → “Personal:Entertainment” (not deductible)
  • Must review context to decide

Can agentic AI learn client-specific rules? Or does one-size-fits-all create categorization errors?

The Liability Question

If AI makes a mistake on a client’s books, who’s liable?

  • Is it me (the bookkeeper who set up the system)?
  • Is it the AI vendor?
  • Is it the client (who approved the automation)?

This matters because if the IRS audits my client and finds miscategorized expenses, I’m the one who signed off on the books. I can’t hide behind “the AI did it.”

My Proposed Solution

I’m thinking about implementing client-specific confidence thresholds and review workflows:

High-automation clients (simple businesses, clear categorization rules):

  • 90% confidence threshold
  • Weekly review of flagged transactions
  • Monthly full reconciliation

High-touch clients (complex businesses, lots of judgment calls):

  • 98% confidence threshold
  • Daily review of flagged transactions
  • Bi-weekly full reconciliation

This way, I can use agentic AI where it makes sense, but keep tighter controls where the stakes or complexity are higher.

@accountant_alice @tax_tina How do you think about professional liability when using AI for client work? Do you disclose to clients that AI is part of your workflow? Do you adjust your E&O insurance?

I want the efficiency gains, but I can’t afford to lose clients or face liability claims because of AI errors.