Building a personal finance AI agent that reads my Beancount ledger - architecture discussion

I’ve been experimenting with building a personal finance AI agent that uses my Beancount ledger as its data source. Wanted to share my architecture and get feedback.

The Goal

Natural language queries against my financial data:

  • “What did I spend on restaurants in Q3?”
  • “Am I on track for my annual savings goal?”
  • “Compare my utility bills year-over-year”
  • “Flag any unusual transactions this month”

Current Architecture

┌─────────────┐     ┌──────────┐     ┌─────────┐
│  Telegram   │────▶│  Agent   │────▶│ Claude  │
│    Bot      │◀────│  Layer   │◀────│   API   │
└─────────────┘     └────┬─────┘     └─────────┘
                         │
                    ┌────▼─────┐
                    │ Beancount│
                    │  + BQL   │
                    └──────────┘

Components:

  1. Telegram Bot - Interface (inspired by the Beancount Telegram Bot project)
  2. Agent Layer - Translates NL → BQL, handles context
  3. Claude API - Powers the NL understanding and response generation
  4. Beancount - Source of truth, queried via BQL

What’s Working

Simple queries work great:

  • “Total expenses last month” → translates to BQL, returns accurate number
  • “List all Amazon purchases” → finds transactions, formats nicely

Proactive insights are the real win:

  • “Your utility bills increased 30% vs last quarter - PG&E specifically went from $85/mo to $120/mo. Want to investigate?”
  • “You’ve spent 80% of your dining budget with 10 days left in the month”

Research shows ML-integrated personal finance systems can hit ~97% recommendation accuracy. I’m not there yet, but the potential is clear.

Challenges

  1. Complex queries - “What percentage of my income goes to fixed vs variable expenses?” requires multiple BQL queries and reasoning
  2. Time context - “Last month” vs “past 30 days” vs “November” - LLM sometimes picks wrong interpretation
  3. Account hierarchy - Need to teach the model my specific chart of accounts

Open Questions

  1. Local vs cloud LLM - Currently using Claude API, but sending financial data to the cloud feels uncomfortable. Has anyone tried local models (Llama 3, Mistral) for this?

  2. Caching strategy - Re-running BQL for every query is slow. Thinking about pre-computing common aggregations.

  3. Multi-step reasoning - For complex questions, should I use tool-calling to let the LLM iterate, or pre-define query patterns?

Anyone else building something similar? Would love to compare approaches.

This is a cool project, but I have to push back on the architecture for privacy reasons.

The Data Security Problem

Your Beancount ledger contains:

  • Every merchant you’ve ever paid
  • Your income sources and amounts
  • Your investment holdings
  • Your spending patterns and habits

Sending this to Claude API (or any cloud LLM) means Anthropic now has a complete picture of your financial life. Even with their privacy policies, that data exists on their servers.

A recent survey found 60% of financial executives acknowledge AI’s value, but data security remains the #1 concern preventing adoption. This isn’t paranoia - it’s rational risk assessment.

Local LLM Alternatives

I’ve been running Llama 3 8B locally for similar tasks. Performance comparison:

Task Claude 3.5 Llama 3 8B (local)
Simple BQL generation Excellent Good
Complex multi-step Excellent Moderate
Response latency 1-2s 3-5s (M2 Mac)
Privacy :cross_mark: :white_check_mark:

For personal finance queries, Llama 3 is “good enough.” You don’t need frontier model capabilities to translate “spending last month” into BQL.

Hybrid Approach

If you need Claude’s capabilities for complex reasoning:

  1. Pre-aggregate sensitive data locally
  2. Send only anonymized summaries to Claude
  3. Never send raw transactions

Example: Instead of sending “Paid $5,000 to Dr. Smith Psychiatry”, send “Medical expense: $5,000”

Would love to hear if others have found good local model setups for this use case.

I’ve been using the Beancount Telegram Bot project you mentioned for about 6 months. Some thoughts:

What Works Well

Multi-user support - My wife and I both use it. She can log expenses without learning Beancount syntax. She just types “Coffee at Starbucks $5.50” and it creates the transaction.

Web interface - There’s a companion web UI for reviewing/editing what the bot created. Essential for catching mistakes.

LLM integration - The newer versions have LLM support for parsing complex transactions. “Split dinner with John, my share was $45 including tip” → correctly creates the transaction with proper accounts.

Limitations I’ve Hit

  1. Query capabilities are basic - It’s designed more for data entry than analysis. Your architecture with BQL translation is more powerful.

  2. Context window - For complex queries, you need transaction history in context. The bot doesn’t handle this well.

  3. No proactive insights - It’s reactive only. Your “utility bills increased 30%” feature is the killer app I wish it had.

Suggestions for Your Build

  • Tool calling is the way to go for multi-step reasoning. Let the LLM decide what queries to run, examine results, run follow-ups. Much more flexible than pre-defined patterns.

  • For caching, consider SQLite with pre-computed monthly/quarterly aggregates. Update on ledger change (use file watcher).

  • Time context: I solved this by always including “Today is {date}” in system prompt, plus explicit rules like “last month = previous calendar month, not past 30 days”

Happy to share my prompt engineering if helpful.

Honest question: what does the LLM add that Fava + BQL doesn’t already do?

Fava Already Handles Most of This

Your example queries:

  • “What did I spend on restaurants in Q3?” → Fava’s Income Statement, filter by account and date
  • “Am I on track for my annual savings goal?” → Fava’s Net Worth chart + custom budget queries
  • “Compare my utility bills year-over-year” → BQL query, 2 minutes to write
  • “Flag any unusual transactions” → Fava has built-in duplicate detection; for anomalies, write a custom query

I have a queries.beancount file with ~30 saved BQL queries covering 95% of my analysis needs. Takes maybe 10 seconds to run one.

Where I See LLM Value

Maybe:

  1. First-time users who don’t know BQL
  2. Truly ad-hoc questions you’ve never asked before
  3. Conversational context - follow-up questions

But for someone who already knows Fava and BQL, is the LLM layer worth the:

  • Privacy concerns (per @privacy_first)
  • Added complexity
  • API costs
  • Potential for hallucinated wrong answers

Not trying to be negative - genuinely curious what the compelling use case is for experienced Beancount users. The proactive insights are interesting, but couldn’t those be cron jobs with BQL + threshold alerts?

Really valuable feedback, all of you. Let me address each point:

@privacy_first - Local LLMs

You’ve convinced me. I’m going to try Llama 3 8B for the BQL translation layer. Your point about “good enough” is key - I don’t need GPT-4 level reasoning to parse “spending last month.”

The anonymization strategy is smart for when I do need cloud capabilities. I could do:

  • Local Llama for query translation
  • Local aggregation and anonymization
  • Cloud LLM only for high-level insights on anonymized data

That keeps raw transactions entirely local while still getting sophisticated analysis. Will report back on latency with M1 Max.

@bot_builder - Practical Tips

Tool calling it is! I was hesitant because of complexity, but you’re right - pre-defined patterns are too rigid for open-ended questions.

The SQLite caching idea is perfect. I’ll use inotify to watch the .beancount file and rebuild aggregates on change. Most queries hit pre-computed data, complex ones fall back to BQL.

And yes, please share your prompts! The time context handling is exactly where I’ve been struggling. DM me or happy to start a separate thread.

@viz_fan - The “Why” Question

Fair challenge. For me, the answer is cognitive accessibility.

Yes, I can write that BQL query for year-over-year utility comparison. It takes 2 minutes. But there’s friction:

  1. Open Fava
  2. Navigate to query page
  3. Remember/look up BQL syntax
  4. Type it out
  5. Run and interpret

With the bot, I’m lying in bed and think “hm, electricity bill felt high” → type question → get answer. The barrier to asking is near zero.

It’s like the difference between “I could Google that” and asking a knowledgeable friend. Technically equivalent, but one happens more often.

The proactive insights are the bigger deal though. You’re right they could be cron jobs, but:

  • Cron jobs need threshold configuration upfront
  • LLM can notice unexpected patterns I didn’t think to watch for
  • Natural language alerts are easier to grok than “ALERT: Expenses:Food:Dining > $500”

That said, I think we’re targeting different use cases. If you have 30 BQL queries dialed in and know exactly what you want to monitor, the LLM adds marginal value. If you want exploratory, conversational analysis, it’s a big upgrade.

Thanks all - this thread gave me a clear action plan. Will post updates as I implement.