I’ve been experimenting with what I’m calling “agentic AI” for my Beancount workflow, and I wanted to share both the exciting results and the important questions it raises.
The Fundamental Shift
For the past year, I’ve used generative AI (ChatGPT) to help with transaction categorization. The workflow was: AI suggests categories, I review them, I manually apply them. It was helpful but still required me to be in the driver’s seat for every decision.
Recently, I built something different—an agentic workflow that doesn’t just suggest, it acts autonomously within defined boundaries:
- Monitors my bank account for new transactions
- Automatically imports when new data is detected
- Categorizes transactions based on learned patterns with a 95%+ confidence threshold
- Flags anomalies for review (unusual vendor, amount outside normal range)
- Creates draft reconciliations
- Sends Slack notification: “15 transactions categorized, 2 need review”
The time savings are dramatic: my daily accounting workflow went from 30 minutes of manual review to 5 minutes of exception handling.
But Here’s What Keeps Me Up at Night
As a CPA, I can’t just celebrate efficiency—I have to think about risk and liability:
What if categorization logic drifts over time? The AI learns from patterns, but what if those patterns slowly shift in the wrong direction? How do I detect drift before it becomes a systematic problem?
How do I define “safe boundaries” for autonomous actions? Import and categorize feel safe. But posting directly to the main ledger without review? That feels reckless. Where’s the line?
Trust threshold: At what AI confidence level do you let it act versus just suggest? I set mine at 95%, but is that conservative enough? Too conservative?
My Current Approach: Staging + Review
I’m using a staging workflow that gives me the best of both worlds:
- AI outputs categorized transactions to a staging file (not the main ledger yet)
- I review the staging file in Fava (takes 5 minutes vs. 30 minutes of manual work)
- If everything looks good, I approve and git commit to the main ledger (audit trail preserved)
- If something’s wrong, I fix it and update the AI’s learning
This approach works because the boundaries are clear:
Safe for AI to do autonomously: Import, categorize, flag anomalies
Requires human approval: Posting to the main ledger
The AI does the tedious work. I handle the judgment calls.
Questions for the Community
For those of you thinking about or already using agentic AI in your Beancount workflows:
-
How do you define safe boundaries for what AI can do autonomously versus what requires human approval?
-
What confidence thresholds do you use? Does it vary by transaction type or amount?
-
How do you monitor for drift in AI decision-making over time?
-
When 2026 shifts from “AI suggests” to “AI acts,” what guardrails are essential for financial accuracy and professional liability?
I believe agentic AI is the future, but we need to get the boundaries right. Would love to hear your experiences and concerns.
Built using: Beancount + smart_importer + custom Python monitoring script + Slack webhooks