My bank is stuck in the stone age. They offer PDF statements and that’s it - no CSV export, no OFX download, nothing that plays nice with importers. I finally cracked the problem and wanted to share my workflow for anyone else dealing with stubborn financial institutions.
The Problem: PDF-Only Banks
Some institutions, especially credit unions, foreign banks, and older regional banks, only provide PDF statements. This creates a few headaches:
- No direct import path - Beancount importers expect structured data
- Tables look like text - PDFs store visual layout, not data semantics
- Inconsistent formatting - Each bank’s PDF is a unique snowflake
- Manual entry is slow - Typing 50+ transactions per statement is brutal
After months of manual entry hell, I decided to automate this.
The Tool Stack
Here’s what I ended up using:
1. pdftotext (from poppler-utils)
# Install on Ubuntu/Debian
sudo apt install poppler-utils
# Install on macOS
brew install poppler
# The magic flag: -layout preserves table structure
pdftotext -layout statement.pdf statement.txt
The -layout flag is essential. Without it, pdftotext mangles tables into unusable text. With it, columns stay aligned and you can actually parse the output.
2. Tabula (for complex tables)
When pdftotext doesn’t cut it, Tabula is the nuclear option:
# Install tabula-py (requires Java)
pip install tabula-py
# Extract tables to CSV
tabula statement.pdf --pages all --outfile transactions.csv
Tabula uses machine learning to detect table boundaries and does surprisingly well with bank statements.
3. The Simple Importer Pattern
Here’s my basic workflow for converting extracted text to Beancount:
import re
from datetime import datetime
def parse_statement_line(line):
"""Parse a transaction line from pdftotext output."""
# Pattern: MM/DD/YYYY Description Amount
pattern = r'(\d{2}/\d{2}/\d{4})\s+(.+?)\s+(-?\$[\d,]+\.\d{2})'
match = re.search(pattern, line)
if match:
date_str, desc, amount_str = match.groups()
date = datetime.strptime(date_str, '%m/%d/%Y')
amount = float(amount_str.replace('$', '').replace(',', ''))
return {
'date': date,
'description': desc.strip(),
'amount': amount
}
return None
def generate_beancount(transactions, account):
"""Generate Beancount entries from parsed transactions."""
output = []
for txn in transactions:
entry = f'''{txn['date'].strftime('%Y-%m-%d')} * "{txn['description']}"
{account} {txn['amount']:.2f} USD
Expenses:Uncategorized
'''
output.append(entry)
return '\n'.join(output)
My Actual Workflow
Step 1: Extract with pdftotext
pdftotext -layout ~/Downloads/statement-2025-01.pdf ~/tmp/statement.txt
Step 2: Inspect the Output
cat ~/tmp/statement.txt
This shows me the raw layout. Usually transactions look something like:
01/03/2025 AMAZON MARKETPLACE -47.23
01/05/2025 DIRECT DEPOSIT ACME INC 2,450.00
01/07/2025 SHELL OIL 12345 -45.67
Step 3: Run Through My Parser
python parse_credit_union.py ~/tmp/statement.txt >> journal/2025/01-january.beancount
Step 4: Validate and Categorize
bean-check main.beancount
# Then manually review/categorize in Fava
Tips for Better Extraction
Tip 1: Handle Multiple Formats
Different banks format amounts differently:
def parse_amount(text):
"""Handle various amount formats."""
text = text.strip()
# Remove currency symbols and spaces
text = re.sub(r'[$€£\s]', '', text)
# Handle parentheses as negative
if text.startswith('(') and text.endswith(')'):
text = '-' + text[1:-1]
# Handle CR/DR suffixes
if text.endswith('CR'):
text = text[:-2]
elif text.endswith('DR'):
text = '-' + text[:-2]
return float(text.replace(',', ''))
Tip 2: Save Your Templates
Once you figure out a bank’s format, save the parser. My folder structure:
importers/
pdf/
credit_union_xyz.py
foreign_bank_abc.py
investment_firm.py
Tip 3: Use Tabula Templates
Tabula can save templates for recurring statement formats:
import tabula
# First time: define table areas manually
tables = tabula.read_pdf(
'statement.pdf',
pages='all',
area=[100, 0, 700, 600], # top, left, bottom, right
columns=[50, 200, 400, 500] # column boundaries
)
# Save template for future use
tabula.convert_into_by_batch(
'statements/*.pdf',
output_format='csv',
template='credit_union_template.json'
)
When to Use Which Tool
| Situation | Tool | Why |
|---|---|---|
| Simple text tables | pdftotext -layout | Fastest, no dependencies |
| Complex multi-column tables | Tabula | Better table detection |
| Scanned/image PDFs | Tesseract OCR | Text recognition needed |
| Consistent bank format | Custom parser | Most reliable long-term |
The 80% Rule
Here’s my honest take: I don’t try to automate 100%. My goal is:
- Automate 80% - Parse what’s clearly structured
- Manual 20% - Handle edge cases, fees, adjustments
If a transaction doesn’t parse cleanly, I print it for manual review rather than guessing. Better to type 10 weird transactions than debug a parser that’s wrong about 50.
Questions for Discussion
-
What tools do you use for PDF extraction? I’ve heard good things about Camelot but haven’t tried it.
-
Anyone using cloud OCR services? Curious about accuracy vs privacy tradeoffs.
-
How do you handle statement format changes? Banks love to redesign their PDFs every few years.
Would love to hear how others solve this problem. PDF-only banks aren’t going away anytime soon.