Finally got 95% automated expense categorization working with Beancount + LLMs

After months of iteration, I finally have a workflow that auto-categorizes about 95% of my transactions correctly. Wanted to share what worked.

The Setup

I’m using beancount-reds-importers as the foundation, but added an LLM layer for edge cases. The key insight came from reading about Relay Financial’s RAG approach - they use Retrieval-Augmented Generation to look up similar past transactions before categorizing new ones.

My Architecture

  1. Rule-based first pass - beanhub-import handles 60-70% with declarative rules
  2. LLM for ambiguous transactions - Claude API call for anything the rules miss
  3. Confidence scoring - LLM returns confidence %, anything below 80% gets flagged for review
  4. Feedback loop - My corrections get added to the RAG context

Results After 3 Months

  • Started at ~70% accuracy with rules alone
  • Now at 95% with the hybrid approach
  • Review time dropped from 2 hours/week to about 15 minutes
  • The LLM particularly excels at vendor name normalization (e.g., “SQ *COFFEE SHOP” → Expenses:Food:Coffee)

Code Snippet

def categorize_transaction(txn, historical_context):
    # Try rules first
    category = apply_rules(txn)
    if category:
        return category, 1.0
    
    # Fall back to LLM with RAG
    similar_txns = vector_search(txn.description, historical_context, k=5)
    prompt = build_categorization_prompt(txn, similar_txns)
    
    response = claude_client.messages.create(
        model="claude-3-5-sonnet-20241022",
        messages=[{"role": "user", "content": prompt}]
    )
    
    return parse_category_response(response)

Costs

Running maybe 200-300 LLM calls per month, costs about $2-3 total. The time savings are absolutely worth it.

Has anyone else experimented with this approach? Curious if you’ve found better embedding models for financial transaction similarity search.

This is really interesting, but I’m curious about your verification workflow for catching hallucinations.

Stanford Law did some research on LLMs for tax applications and found that while GPT-4 performs surprisingly well on complex questions, it’s not perfect - and in financial contexts, even small errors compound over time.

A few specific concerns:

  1. Category drift - How do you ensure the LLM doesn’t slowly drift to incorrect categorizations that “look” right but violate your accounting principles?

  2. Edge case handling - What happens when the LLM encounters a genuinely new transaction type it’s never seen? Does it flag it or just guess?

  3. Audit trail - Do you keep logs of LLM decisions for later review?

I’ve been using pure rule-based imports for 5+ years and the predictability is really valuable. Not saying LLMs aren’t worth it, but I’d want robust guardrails before trusting $2/month AI with my financial records.

I appreciate the detailed write-up, but I’m wondering if the LLM layer is overkill for personal use.

I’ve been running a pure Python + regex approach for 3 years now:

RULES = [
    (r'AMAZON|AMZN', 'Expenses:Shopping:Online'),
    (r'UBER|LYFT', 'Expenses:Transport:Rideshare'),
    (r'NETFLIX|SPOTIFY|HULU', 'Expenses:Subscriptions'),
    # ... 200+ rules
]

Gets me to about 85% accuracy. The remaining 15% takes maybe 30 minutes per month to manually categorize - and honestly, that manual review helps me stay aware of my spending patterns.

My questions:

  • How much time did you spend building the RAG pipeline and prompt engineering?
  • What’s your total setup time vs the ongoing 1h45m/week savings?
  • Are you worried about API changes breaking your workflow?

Not trying to dismiss the approach - I’m genuinely curious if the ROI makes sense for a single person vs a business with thousands of transactions.

This is exactly what I’ve been looking for! I switched to Beancount 6 months ago from Mint (RIP) and the CSV import struggle is real.

I have accounts at:

  • Chase (checking + credit)
  • Fidelity (brokerage)
  • Marcus (savings)

Each has a different CSV format and I’m spending hours every month manually fixing categorizations. The rule-based approach makes sense but I don’t even know where to start with beancount-reds-importers.

Could you share more details on:

  1. How you set up the initial importers for different banks?
  2. What your RAG context structure looks like (just past transactions, or do you include account metadata too)?
  3. Any resources for someone who knows Python but is new to the Beancount ecosystem?

Would happily pay for a course or detailed tutorial on this. The plaintextaccounting.org site is great but feels overwhelming when you’re just trying to get started.

Great questions everyone! Let me address them:

@ledger_vet - Verification & Guardrails

You’re right to be cautious. Here’s how I handle each concern:

Category drift: I run a monthly consistency check that compares LLM categorizations against my rule-based baseline. If they diverge more than 5%, I investigate. Also, I use autobean.xcheck to cross-check against bank statements.

Edge cases: The LLM returns a confidence score. Anything below 80% goes into a TODO category that I review manually. New transaction types almost always trigger this - which is exactly what I want.

Audit trail: Every LLM call is logged with input, output, confidence, and timestamp. I keep 2 years of logs. Actually saved me once when I realized a prompt change had degraded accuracy.

@py_scripter - ROI Analysis

Fair question! For me:

  • Initial setup: ~15 hours (spread over 2 weekends)
  • Ongoing maintenance: ~1 hour/month
  • Time saved: ~7 hours/month (1h45m × 4 weeks)

So I broke even after month 3. But honestly, the bigger win is cognitive load reduction. Not having to think about categorization lets me focus on actual financial analysis.

For someone with 200 solid rules already, you might be right that it’s overkill. The LLM shines when you have: (1) high transaction volume, (2) lots of edge cases, or (3) frequently changing spending patterns.

@pta_newbie - Getting Started

For your setup, I’d recommend:

  1. Start with beancount-reds-importers - they have built-in support for Chase and Fidelity. For Marcus, you’ll need a custom CSV importer (not hard).

  2. RAG context structure - I include:

    • Transaction description
    • Amount range bucket (helps distinguish “is this a regular coffee or a coffee machine?”)
    • Account source
    • 5 most similar historical transactions with their categories
  3. Resources:

    • awesome-beancount.com - curated tool list
    • The beancount mailing list is incredibly helpful
    • I learned importers from the beancount-reds-importers examples on GitHub

Happy to share my Chase importer config if that helps!