Building Custom Importers for Weird Data Sources: A Developer's Guide

I’ve been using Beancount for 3 years now to track my journey to FIRE, and one of the biggest initial hurdles was getting my financial data into Beancount in the first place. Not every bank plays nice with standard formats.

The Problem with “Weird” Data Sources

When I first started, my local credit union provided CSV exports, but they were… special. Columns in random order, date formats that changed between exports, transaction descriptions split across multiple fields, and my favorite: negative numbers for deposits (yes, really).

The standard importers couldn’t handle it. I tried tweaking CSV configurations, but eventually realized I needed to write a custom importer.

Why Standard Importers Fail

Most institutions provide data in one of these formats:

  • OFX/QFX: The gold standard - structured, includes balances, rarely breaks
  • CSV: The wild west - every bank does it differently
  • PDF statements: The nightmare - presentation format, not data format
  • Proprietary exports: Good luck with that

When you hit a “weird” data source, you’re usually dealing with CSV quirks or PDF-only statements. Here’s what I’ve encountered:

Credit Union CSV Quirks

  • Non-standard column names (“Trans Date” vs “Date” vs “Posted Date”)
  • Multiple date formats in the same file
  • Missing or inconsistent transaction IDs
  • Description fields that span multiple columns

Investment Platform PDF Statements

  • No CSV export at all
  • Tables with varying column layouts
  • Multi-line transactions
  • Summary sections mixed with transaction data

International Banks

  • Date formats (DD/MM/YYYY vs MM/DD/YYYY)
  • Currency symbols embedded in amounts
  • Unicode characters in descriptions
  • Time zones that affect posting dates

Building a Custom Importer: The Framework Approach

The good news: You don’t have to start from scratch. The beancount-reds-importers framework makes this much easier.

Here’s the basic structure:

from beancount_reds_importers.libreader import csvreader
from beancount.core import data, amount
import re

class WeirdCreditUnionImporter(csvreader.Importer):
    def initialize_reader(self, file):
        self.reader = csvreader.Reader(file.name)
        self.reader.set_header_map({
            'Trans Date': 'date',
            'Description': 'payee', 
            'Amount': 'amount'
        })
    
    def parse_date(self, date_str):
        # Handle their weird date format
        return datetime.strptime(date_str, '%m-%d-%y').date()
    
    def parse_amount(self, amount_str):
        # Fix their backwards negative amounts
        clean = amount_str.replace('$', '').replace(',', '')
        value = Decimal(clean)
        # They mark deposits as negative, so flip the sign
        return -value if 'DEP' in self.current_row['Description'] else value

This isn’t a complete importer, but it shows the key parts: header mapping, date parsing, and amount handling with institution-specific quirks.

PDF Statements: The Nuclear Option

For institutions that only provide PDF statements, you need to extract text first. I’ve had success with pdfplumber:

import pdfplumber

def extract_transactions_from_pdf(pdf_path):
    transactions = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            # Extract tables
            tables = page.extract_tables()
            for table in tables:
                for row in table:
                    if looks_like_transaction(row):
                        transactions.append(parse_row(row))
    return transactions

Fair warning: PDFs are fragile. The layout changes, your importer breaks. This is why we beg institutions for OFX or at least clean CSV.

Testing Your Importer

The bean-identify and bean-extract commands are your best friends:

# Test if your importer recognizes the file
bean-identify importers.py statement.csv

# Extract transactions
bean-extract importers.py statement.csv > output.beancount

# Review before importing
cat output.beancount

I always do a manual review of extracted transactions before merging them into my main ledger. Trust, but verify.

Pro Tips

  1. Start simple: Get one statement working before adding features
  2. Version control your importers: They’re code, treat them like code
  3. Document quirks: Future you will forget why you added that regex
  4. Build test cases: Save sample statements for regression testing
  5. Consider the maintenance burden: Sometimes manual entry is faster than maintaining a brittle importer

The Community Needs You

If you’ve built custom importers, please share them on GitHub. I’ve learned so much from others’ importers for institutions I don’t even use - they taught me patterns and approaches.

I’m putting my importers (anonymized) on GitHub this week. If there’s interest, I can do a follow-up post on testing strategies and handling edge cases.

What’s the weirdest data source you’ve had to import from? Anyone dealt with POS systems like Square or international banks with multi-currency statements?

This is an excellent guide, Fred! Really appreciate the detailed code examples and the practical advice.

I went through a similar journey about 3 years ago when I migrated from GnuCash. GnuCash had this… let’s call it “unique” CSV export format that included internal account IDs that meant nothing to Beancount. I spent a weekend fighting with it before I finally gave up and wrote a custom importer.

The Most Important Lesson I Learned

Start with the simplest possible importer that works.

I see a lot of newcomers (and I definitely did this myself) trying to handle every edge case from day one. Don’t do that. Get 80% of your transactions importing cleanly, then gradually add handling for the weird ones as you encounter them.

My first credit union importer was literally:

  • Read CSV
  • Map date/amount/description columns
  • Assume everything is an expense
  • Done

Then over the next few months I added:

  • Detection of income vs expense based on amount sign
  • Better payee extraction from messy description fields
  • Duplicate detection
  • Balance assertion generation

Each addition came from a real need, not anticipating hypothetical problems.

Debugging Tips

The bean-identify and bean-extract commands Fred mentioned are essential. But here’s my workflow:

  1. Test on a SINGLE statement first - Don’t throw your entire year of data at a new importer
  2. Print debug output liberally - I add print statements showing what the importer sees vs what it outputs
  3. Keep “known good” test files - When your importer works on a particular statement, save that as a test case
  4. Check the actual bytes - Sometimes banks put weird invisible characters in CSVs. hexdump has saved me multiple times.

My Offer to the Community

If anyone here is struggling with a specific bank or data format, feel free to share a sanitized sample (remove account numbers, real names, amounts, etc.) and I’m happy to help troubleshoot. The Beancount community helped me when I was starting out - just paying it forward.

I migrated from GnuCash, dealt with rental property tracking across multiple banks, and imported 5+ years of historical data. If I could figure it out, you can too. And honestly, the first importer is the hardest - after that, you start recognizing patterns.

This thread hits close to home! I work with small business clients who use various POS systems, and their data exports are consistently challenging.

The POS Problem

Fred asked about Square - that’s actually one of the better ones because at least they provide a comprehensive Transaction Report. But here’s what makes it tricky:

The Transaction Report includes:

  • Card payments (with processing fees)
  • Cash sales (no processing fees)
  • Tips (which may or may not be included in the sale line)
  • Refunds (shown as negative amounts)
  • Deposit transfers (the actual money hitting your bank)

Each needs different accounting treatment, and they’re all in one CSV file.

Last year for a restaurant client, I built an importer that could:

  1. Separate gross sales from processing fees
  2. Track tips separately for payroll purposes
  3. Reconcile daily deposits to the bank account
  4. Handle refunds properly (credit the original expense category, not create negative income)

The client kept asking “can’t QuickBooks just import this automatically?” I had to explain that their accounting software doesn’t understand their specific POS quirks any better than a custom script would.

The Value of Getting It Right Once

Here’s my pitch to clients: You build it once, you benefit forever.

That restaurant now has automated monthly books that previously took 20+ hours of manual work. The importer runs in 5 minutes. In the first year, they saved over 200 hours of bookkeeping time. That completely justifies the upfront development investment.

Other POS Systems I’ve Encountered

  • Clover: Multiple report types that need to be combined (Sales + Fees reports)
  • Toast (restaurant POS): They change their CSV format periodically
  • Lightspeed: Pretty good, but their “payments” vs “sales” distinction confuses people
  • Shopify: E-commerce with similar challenges - sales, shipping, taxes, fees, refunds need separation

Has anyone dealt with legacy POS systems that export to proprietary formats? I have one client with an ancient system that only exports .dbf format from the 1990s.

To Mike’s point about starting simple: Absolutely correct. My first POS importer just imported gross sales and ignored everything else. Then I incrementally added fee handling, then tips, then refunds. Each version improved accuracy without overwhelming complexity.

This is exactly what I needed! I’ve been procrastinating on setting up Beancount because I’m intimidated by the data import part.

I’m coming from spreadsheets where I manually entered everything (yes, really :sweat_smile:). My bank provides CSV exports but they’re… not great. The description column is a mess of merchant names, transaction IDs, and random codes all smooshed together.

My Specific Questions

1. PDF statements - which library should I use?

Fred mentioned pdfplumber - is that the current recommended tool? I’ve seen tabula-py and pdfminer mentioned elsewhere. Does it matter? My investment brokerage only provides PDF statements and I have like 3 years of historical data I want to import.

2. How do I structure my importer project?

I’m used to git workflows from my dev job. Should I create a separate repo for my importers? Keep them with my main ledger? What’s the typical directory structure?

my-finances/
  ├── ledger/
  │   └── main.beancount
  ├── importers/
  │   ├── my_bank.py
  │   └── my_credit_card.py
  └── documents/
      └── statements/

Something like that?

3. Starter template?

Is there a minimal working example importer I can copy and modify? The beancount-reds-importers framework Fred mentioned looks promising but the documentation is… sparse. I learn best from working code examples.

What Excites Me

The automation possibilities here are incredible. My DevOps brain is already thinking:

  • Git workflow for my finances (!!!)
  • Automated balance assertions as tests
  • CI/CD for financial reports (okay, maybe that’s overkill)
  • Version-controlled importers that I can improve over time

Mike’s advice about starting simple really resonates. I’m going to try to get ONE month of bank transactions importing cleanly before I worry about historical data or investment accounts.

Has anyone written a “Your First Importer” tutorial that walks through the absolute basics? I’d happily contribute back once I figure this out myself.

Wow, this thread took off! Love the engagement. Let me address some of the questions and share resources.

Responding to Sarah’s Questions

PDF Libraries: I’ve used both pdfplumber and tabula-py. Here’s my take:

  • pdfplumber: Better for complex layouts, gives you more control over table extraction, but requires more code
  • tabula-py: Simpler API, works great for straightforward tables, but can struggle with weird layouts
  • pdfminer: Lower-level, only use if the other two fail

For investment statements, start with tabula-py. If it can’t handle the layout, switch to pdfplumber. I’ve found brokerage statements are usually table-based enough for tabula.

Project Structure: Your proposed structure is almost exactly what I use!

my-finances/
  ├── ledger/
  │   ├── main.beancount
  │   └── accounts/
  ├── importers/
  │   ├── __init__.py
  │   ├── my_bank.py
  │   ├── my_credit_card.py
  │   └── tests/
  ├── downloads/           # Raw statement files
  └── documents/           # Filed statements (organized by bean-file)

I keep importers in the same repo as my ledger because they’re tightly coupled. But I know others who maintain a separate “financial-tools” repo they use across multiple years.

Starter Template: I’m putting together a GitHub repo this week with:

  • Minimal working CSV importer (20 lines of code)
  • Medium complexity importer with edge cases
  • PDF statement importer example
  • Test setup using pytest

Will share the link here when it’s ready. The goal is “Your First Importer” - copy, modify, run.

Responding to Tina’s POS Challenges

Square Importer: I actually have a working Square importer! It handles:

  • Gross sales vs net deposits
  • Processing fees as separate expense
  • Tips tracked separately
  • Refund handling

Want me to anonymize and share it? I built it for a side project (friend’s food truck) but it should work for any Square merchant.

Legacy .dbf format: Oh wow, that’s ancient. I think you’d need to use the dbf Python package to read those files, then convert to a standard format before importing. That’s… painful. Any chance you can convince them to upgrade their POS?

What I’m Learning From This Thread

Mike’s point about “start simple” is so important. I think I scared people with my feature-complete examples.

Here’s what “start simple” looks like in practice:

import csv
from datetime import datetime
from decimal import Decimal

def simple_csv_to_beancount(csv_file):
    with open(csv_file) as f:
        reader = csv.DictReader(f)
        for row in reader:
            date = datetime.strptime(row['Date'], '%m/%d/%Y').strftime('%Y-%m-%d')
            amount = Decimal(row['Amount'])
            desc = row['Description']
            
            print(f"{date} * \"{desc}\"")
            print(f"  Assets:Bank:Checking  {amount} USD")
            print(f"  Expenses:Unknown")
            print()

That’s it. 15 lines. It’s not perfect, but it converts CSV to Beancount format. You can iterate from there.

Community Resource Idea

Would there be interest in a Beancount Importer Gallery?

A GitHub org where people contribute anonymized importers for specific institutions? Not as maintained packages, but as reference implementations. “Oh, you bank at Chase? Here’s how someone else handled their CSV quirks.”

The hard part of writing importers isn’t the Python - it’s figuring out the institution-specific quirks. A gallery would crowdsource that knowledge.

Thoughts?