Using Claude/GPT to generate tax reports from Beancount - my 2024 workflow

Just finished my 2024 taxes using a workflow that combines Beancount with Claude for analysis and report generation. Sharing what worked in case others want to try it.

The Problem

Every tax season I’d spend 8-10 hours:

  • Categorizing gray-area expenses
  • Calculating home office deduction
  • Pulling together Schedule C numbers
  • Double-checking everything for audit triggers

This year I tried using LLMs to speed things up.

My Workflow

Step 1: Export Beancount Data

BQL queries to extract everything the LLM needs:

SELECT date, narration, account, position 
WHERE account ~ "Expenses" AND year = 2024

Export to JSON for easier LLM parsing.

Step 2: Claude for Analysis

I use Claude 3.5 Sonnet for:

Expense categorization review

  • “Review these transactions and flag any that might be miscategorized for tax purposes”
  • Found 12 expenses I had wrong (e.g., software subscription I marked as Office Supplies should be Computer & Internet)

Deduction optimization

  • “Given these business expenses, identify any potential deductions I might be missing”
  • Suggested I could deduct a portion of my phone bill (hadn’t thought of it)

Audit risk assessment

  • “Flag any expenses that might trigger IRS scrutiny”
  • Identified a $3,000 client dinner that needs better documentation

Step 3: Generate Reports

Ask Claude to format Schedule C categories:

Based on this expense data, generate a summary for Schedule C including:
- Line 8: Advertising
- Line 17: Legal and professional services
- Line 18: Office expense
...

Results

  • Time spent: ~3 hours (down from 8-10)
  • Found ~$2,000 in deductions I would have missed
  • Much more confident in categorizations

Tools Used

  • TaxGPT - I also tested this. It uses GPT-4o and Claude 3.5, specifically trained on tax law. Less hallucination risk than raw ChatGPT.
  • Claude 3 Opus - Testing showed it performs best for tax return review tasks (per some Medium articles I found)

Concerns & Caveats

Hallucination risk is real. One study from Stanford Law found GPT-4 does well on tax questions but isn’t perfect. I verify everything Claude suggests against IRS publications.

This is NOT tax advice. I still have a CPA review my return. The LLM is for initial analysis and organization, not final decisions.

Privacy - Yes, I’m sending financial data to Anthropic. I’m comfortable with their privacy policy, but you might not be. Local LLMs are an alternative.

Questions for the Community

  1. Anyone using the beancount.io IRS audit preparation guide? How does it compare?
  2. Better ways to structure the BQL → LLM pipeline?
  3. Any Beancount plugins specifically for tax reporting?

Would love to hear other approaches.

This is a creative approach, but I want to add some professional cautions. (I’m a CPA who uses Beancount for my own books.)

Liability Concerns

When you use LLM suggestions for tax decisions, you’re taking on risk:

  1. No professional liability protection - If Claude tells you something is deductible and it’s not, you pay the penalty. A CPA has malpractice insurance.

  2. No engagement letter - There’s no defined scope. Claude will answer anything you ask, even if it’s the wrong question.

  3. No representation - If you get audited, Claude can’t represent you. Your CPA can.

What I Think LLMs Are Good For

:white_check_mark: Organization and formatting - Generating Schedule C summaries from your data
:white_check_mark: First-pass review - Flagging obvious miscategorizations
:white_check_mark: Research starting point - “What are the rules for home office deduction?”

What I Think They’re NOT Good For

:cross_mark: Complex tax situations - Multi-state, international, crypto
:cross_mark: Gray areas - “Is this a hobby or business?”
:cross_mark: Audit defense - Anything involving IRS correspondence

Re: beancount.io IRS Audit Guide

Yes, I’ve used it! The guide is excellent for documentation practices:

  • How to structure your chart of accounts for tax categories
  • Attaching receipt images as metadata
  • Generating audit-ready reports

It’s more about process than LLM assistance. The idea is that if you set up Beancount correctly, you can produce IRS-ready documentation in minutes during an audit. Git history provides timestamp proof.

My Recommendation

Use the LLM as a first draft generator, not a decision maker. Treat its output like you’d treat a junior accountant’s work - helpful but needs review.

Your CPA review step is exactly right. Keep doing that.

Interesting workflow! I take a different approach - pure BQL for tax reporting, no LLM involved.

My Tax Query Library

I maintain a tax-reports.beancount file with queries for every Schedule C line:

2024-01-01 query "schedule-c-line-8-advertising"
  SELECT sum(position) WHERE 
    account ~ "Expenses:Business:Advertising" 
    AND year = 2024

2024-01-01 query "schedule-c-line-17-legal"
  SELECT sum(position) WHERE 
    account ~ "Expenses:Business:Legal|Professional" 
    AND year = 2024

2024-01-01 query "schedule-c-line-18-office"
  SELECT sum(position) WHERE 
    account ~ "Expenses:Business:Office" 
    AND NOT account ~ "HomeOffice"
    AND year = 2024

Run all of them with bean-query ledger.beancount tax-reports.beancount and I get exact numbers for every line.

Why I Prefer This

  1. Deterministic - Same query, same result. No hallucination risk.
  2. Auditable - I can show exactly how each number was calculated
  3. Fast - Sub-second execution vs waiting for API response
  4. Free - No API costs

For Gray Area Categorization

I handle this at transaction entry time, not at tax time:

2024-06-15 * "Adobe Creative Cloud" #tax-deductible
  Expenses:Business:Software  54.99 USD
    tax_category: "computer-and-internet"
    business_use_percent: 80

The tax_category metadata means I never have to re-categorize later.

When I’d Use an LLM

Maybe for:

  • Initial chart of accounts setup (“what categories do I need for Schedule C?”)
  • One-time research (“what’s the mileage rate for 2024?”)

But not for ongoing tax prep. The queries are a solved problem - why add complexity?

That said, your “flag unusual expenses” use case is interesting. I might try that as a one-time audit.

I’m curious about the hallucination risk in practice. You mentioned Stanford Law research - can you share more specifics?

My Concern

Tax law is full of edge cases and exceptions. An LLM that’s “mostly right” can be dangerous because:

  1. You might trust incorrect advice that sounds confident
  2. The wrong answer might be more expensive than no answer
  3. Penalties for incorrect deductions are real ($$$)

Specific Questions

On categorization:
When Claude suggested moving a software subscription from Office Supplies to Computer & Internet - did it cite the specific IRS publication? Or just make a judgment call? Both might be technically correct in different contexts.

On the phone bill deduction:
Did it explain the safe harbor rules vs actual expense method? Did it warn you about documentation requirements? A partial answer here could actually hurt you.

On the $3k dinner flag:
Good catch, but did it explain why large entertainment expenses get scrutiny? The business purpose requirement, attendee documentation, etc?

What I’d Want to See

If LLMs are going to be used for tax prep, I’d want:

  1. Citations - “Per IRS Publication 535, Section 12…”
  2. Confidence scores - “80% certain this applies to your situation”
  3. Explicit limitations - “I cannot determine if this qualifies as a business expense without knowing X”

TaxGPT claims to reduce hallucinations by searching tax law in real-time. Have you compared its answers to raw Claude? Curious if it’s actually better.

I’m not anti-AI for taxes - just want to understand the failure modes before adopting.

Great discussion - exactly the kind of critical feedback I needed. Let me address each point:

@cpa_view - Liability Framework

You’re absolutely right about the risk allocation. I should have been clearer: the LLM is my pre-CPA filter, not a replacement.

The workflow is:

  1. LLM generates first draft and flags issues
  2. I review and research anything flagged
  3. CPA gets organized, pre-reviewed package
  4. CPA makes final decisions

My CPA actually likes this - she said I’m one of her most organized clients now. The LLM does the tedious categorization work, she focuses on judgment calls.

Thanks for the beancount.io audit guide link. I’ll implement their metadata practices for receipts.

@query_master - BQL Tax Library

Your tax_category metadata approach is really smart. I’m going to adopt that.

2024-06-15 * "Adobe Creative Cloud"
  Expenses:Business:Software  54.99 USD
    tax_category: "computer-and-internet"
    business_use_percent: 80

This moves the categorization decision to entry time when context is fresh, rather than tax time when I’ve forgotten why I bought something.

For the query library - would you be open to sharing your full set of Schedule C queries? I’d love to start with those and only use the LLM for edge cases.

@ai_skeptic - Fair Challenges

You’re right that I glossed over the verification details. Let me be more specific:

On the software categorization: Claude didn’t cite specific IRS pubs initially. When I asked it to justify, it referenced Pub 535 (Business Expenses). I then verified in the actual publication. It was correct, but you’re right - I shouldn’t have trusted it without the citation.

On phone bill: It mentioned needing to document business vs personal use but didn’t specify safe harbor vs actual expense method. This is exactly the kind of incomplete answer that could cause problems. I researched separately.

Stanford reference: The paper is “Large Language Models as Tax Attorneys” from Stanford Law - you can find the PDF on their site. Key finding: GPT-4 performs well but makes errors, especially on multi-step reasoning. They recommend human verification for anything beyond simple lookups.

TaxGPT comparison: I did test it. It does cite sources in real-time which is better than raw Claude. Still needs verification, but the citations make that faster.

Updated Recommendations

Based on this thread, I’d refine my advice:

  1. Use LLM for organization and formatting - Low risk, high value
  2. For tax rules, require citations - “Explain why, citing IRS publications”
  3. Verify everything - Even with citations, check the source
  4. Categorize at entry time - @query_master’s metadata approach
  5. CPA reviews everything - The LLM is never the final word

Thanks for the sanity checks. This community is great for catching blind spots.