Bank Statement PDFs to Beancount: OCR and Parsing for the Automation-Averse

My bank is stuck in the stone age. They offer PDF statements and that’s it - no CSV export, no OFX download, nothing that plays nice with importers. I finally cracked the problem and wanted to share my workflow for anyone else dealing with stubborn financial institutions.

The Problem: PDF-Only Banks

Some institutions, especially credit unions, foreign banks, and older regional banks, only provide PDF statements. This creates a few headaches:

  1. No direct import path - Beancount importers expect structured data
  2. Tables look like text - PDFs store visual layout, not data semantics
  3. Inconsistent formatting - Each bank’s PDF is a unique snowflake
  4. Manual entry is slow - Typing 50+ transactions per statement is brutal

After months of manual entry hell, I decided to automate this.

The Tool Stack

Here’s what I ended up using:

1. pdftotext (from poppler-utils)

# Install on Ubuntu/Debian
sudo apt install poppler-utils

# Install on macOS
brew install poppler

# The magic flag: -layout preserves table structure
pdftotext -layout statement.pdf statement.txt

The -layout flag is essential. Without it, pdftotext mangles tables into unusable text. With it, columns stay aligned and you can actually parse the output.

2. Tabula (for complex tables)

When pdftotext doesn’t cut it, Tabula is the nuclear option:

# Install tabula-py (requires Java)
pip install tabula-py

# Extract tables to CSV
tabula statement.pdf --pages all --outfile transactions.csv

Tabula uses machine learning to detect table boundaries and does surprisingly well with bank statements.

3. The Simple Importer Pattern

Here’s my basic workflow for converting extracted text to Beancount:

import re
from datetime import datetime

def parse_statement_line(line):
    """Parse a transaction line from pdftotext output."""
    # Pattern: MM/DD/YYYY  Description  Amount
    pattern = r'(\d{2}/\d{2}/\d{4})\s+(.+?)\s+(-?\$[\d,]+\.\d{2})'
    match = re.search(pattern, line)
    if match:
        date_str, desc, amount_str = match.groups()
        date = datetime.strptime(date_str, '%m/%d/%Y')
        amount = float(amount_str.replace('$', '').replace(',', ''))
        return {
            'date': date,
            'description': desc.strip(),
            'amount': amount
        }
    return None

def generate_beancount(transactions, account):
    """Generate Beancount entries from parsed transactions."""
    output = []
    for txn in transactions:
        entry = f'''{txn['date'].strftime('%Y-%m-%d')} * "{txn['description']}"
  {account}  {txn['amount']:.2f} USD
  Expenses:Uncategorized
'''
        output.append(entry)
    return '\n'.join(output)

My Actual Workflow

Step 1: Extract with pdftotext

pdftotext -layout ~/Downloads/statement-2025-01.pdf ~/tmp/statement.txt

Step 2: Inspect the Output

cat ~/tmp/statement.txt

This shows me the raw layout. Usually transactions look something like:

01/03/2025  AMAZON MARKETPLACE          -47.23
01/05/2025  DIRECT DEPOSIT ACME INC   2,450.00
01/07/2025  SHELL OIL 12345            -45.67

Step 3: Run Through My Parser

python parse_credit_union.py ~/tmp/statement.txt >> journal/2025/01-january.beancount

Step 4: Validate and Categorize

bean-check main.beancount
# Then manually review/categorize in Fava

Tips for Better Extraction

Tip 1: Handle Multiple Formats

Different banks format amounts differently:

def parse_amount(text):
    """Handle various amount formats."""
    text = text.strip()

    # Remove currency symbols and spaces
    text = re.sub(r'[$€£\s]', '', text)

    # Handle parentheses as negative
    if text.startswith('(') and text.endswith(')'):
        text = '-' + text[1:-1]

    # Handle CR/DR suffixes
    if text.endswith('CR'):
        text = text[:-2]
    elif text.endswith('DR'):
        text = '-' + text[:-2]

    return float(text.replace(',', ''))

Tip 2: Save Your Templates

Once you figure out a bank’s format, save the parser. My folder structure:

importers/
  pdf/
    credit_union_xyz.py
    foreign_bank_abc.py
    investment_firm.py

Tip 3: Use Tabula Templates

Tabula can save templates for recurring statement formats:

import tabula

# First time: define table areas manually
tables = tabula.read_pdf(
    'statement.pdf',
    pages='all',
    area=[100, 0, 700, 600],  # top, left, bottom, right
    columns=[50, 200, 400, 500]  # column boundaries
)

# Save template for future use
tabula.convert_into_by_batch(
    'statements/*.pdf',
    output_format='csv',
    template='credit_union_template.json'
)

When to Use Which Tool

Situation Tool Why
Simple text tables pdftotext -layout Fastest, no dependencies
Complex multi-column tables Tabula Better table detection
Scanned/image PDFs Tesseract OCR Text recognition needed
Consistent bank format Custom parser Most reliable long-term

The 80% Rule

Here’s my honest take: I don’t try to automate 100%. My goal is:

  • Automate 80% - Parse what’s clearly structured
  • Manual 20% - Handle edge cases, fees, adjustments

If a transaction doesn’t parse cleanly, I print it for manual review rather than guessing. Better to type 10 weird transactions than debug a parser that’s wrong about 50.

Questions for Discussion

  1. What tools do you use for PDF extraction? I’ve heard good things about Camelot but haven’t tried it.

  2. Anyone using cloud OCR services? Curious about accuracy vs privacy tradeoffs.

  3. How do you handle statement format changes? Banks love to redesign their PDFs every few years.

Would love to hear how others solve this problem. PDF-only banks aren’t going away anytime soon.

This is exactly what I needed! I manage bookkeeping for about 15 small business clients, and at least 5 of them have banks that only provide PDF statements. Your workflow mirrors what I’ve cobbled together over the years.

The Multi-Client PDF Challenge

When you’re dealing with PDFs from multiple banks across multiple clients, consistency becomes critical. Here’s how I’ve adapted your approach:

Standardized Parser Structure

I created a base class that all my PDF parsers inherit from:

class PDFStatementParser:
    """Base class for PDF statement parsers."""

    def __init__(self, pdf_path):
        self.pdf_path = pdf_path
        self.text = self._extract_text()
        self.transactions = []

    def _extract_text(self):
        """Extract text using pdftotext with layout preservation."""
        import subprocess
        result = subprocess.run(
            ['pdftotext', '-layout', self.pdf_path, '-'],
            capture_output=True, text=True
        )
        return result.stdout

    def parse(self):
        """Override in subclass with bank-specific logic."""
        raise NotImplementedError

    def to_beancount(self, account):
        """Convert parsed transactions to Beancount format."""
        entries = []
        for txn in self.transactions:
            entry = f'{txn["date"]} * "{txn["payee"]}" "{txn["narration"]}"\n'
            entry += f'  {account}  {txn["amount"]:.2f} USD\n'
            entry += f'  Expenses:Pending:ToReview\n'
            entries.append(entry)
        return '\n'.join(entries)

Then each client’s bank gets its own subclass:

class FirstNationalParser(PDFStatementParser):
    """Parser for First National Bank PDF statements."""

    TRANSACTION_PATTERN = r'(\d{2}-\d{2})\s+(.+?)\s+\$([\d,]+\.\d{2})'

    def parse(self):
        for line in self.text.split('\n'):
            match = re.search(self.TRANSACTION_PATTERN, line)
            # ... parsing logic

The “PDF Profile” Approach

For each bank, I maintain a profile file that captures the quirks:

{
    "bank_name": "First National Credit Union",
    "statement_type": "monthly",
    "date_format": "%m-%d",
    "amount_format": "currency_symbol",
    "debit_indicator": "negative",
    "header_skip_lines": 8,
    "footer_skip_lines": 3,
    "transaction_pattern": "MM-DD  description  $amount"
}

This documentation saves me hours when I come back to a parser after months.

Tabula Has Saved Me Dozens of Hours

For one particularly nasty client - a restaurant with statements from a small regional bank - the pdftotext output was totally mangled. Tabula was the only thing that worked:

import tabula

def parse_regional_bank(pdf_path):
    # The bank's PDF has tables that span multiple pages
    tables = tabula.read_pdf(
        pdf_path,
        pages='all',
        multiple_tables=True,
        pandas_options={'header': None}
    )

    transactions = []
    for df in tables:
        # Skip non-transaction tables (summaries, ads, etc.)
        if len(df.columns) < 3:
            continue

        for _, row in df.iterrows():
            # Column 0: Date, Column 1: Description, Column 2: Amount
            transactions.append({
                'date': parse_date(row[0]),
                'description': str(row[1]),
                'amount': parse_amount(row[2])
            })

    return transactions

My Advice: Don’t Over-Engineer Early

Fred, you mentioned the 80% rule and I completely agree. When I first started automating PDF imports, I tried to handle every edge case. Big mistake.

Now I follow this progression:

  1. Month 1-2: Manual entry, note patterns
  2. Month 3: Basic pdftotext + regex
  3. Month 6+: Refined parser only if the basic one fails often

Most of my parsers are surprisingly simple - under 50 lines of Python. The complex ones are for banks with genuinely weird formats.

One Gotcha: Encoding Issues

Watch out for character encoding in PDFs. I’ve seen:

  • Smart quotes that break regex
  • Non-breaking spaces instead of regular spaces
  • Unicode currency symbols

My fix:

def clean_text(text):
    """Normalize text extracted from PDF."""
    import unicodedata
    # Normalize unicode
    text = unicodedata.normalize('NFKD', text)
    # Replace smart quotes
    text = text.replace('\u2018', "'").replace('\u2019', "'")
    text = text.replace('\u201c', '"').replace('\u201d', '"')
    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text)
    return text

This has saved me more debugging sessions than I can count.

Question for the Group

Anyone dealing with scanned PDF statements? One of my clients receives paper statements and scans them. Tesseract OCR works okay, but accuracy is maybe 85-90%. That 10-15% error rate means I end up manually reviewing everything anyway.

Would love to hear if anyone has found a better approach for scanned documents.

As someone who manages finances for a couple of rental properties plus my personal accounts, I’ve been through this PDF extraction journey. Let me share what I’ve learned over 4+ years.

Start Simpler Than You Think

When I first hit the PDF problem, I immediately tried to build the perfect automated solution. That was a mistake. Here’s what I wish I’d done instead:

The “Screenshot to Notes” Phase

Before I automated anything, I spent three months just:

  1. Opening the PDF
  2. Typing transactions by hand
  3. Writing notes about the pattern I saw

This sounds tedious, but it taught me exactly what I was dealing with. When I finally wrote code, I knew every quirk of that bank’s format.

The Hybrid Approach That Actually Works

For statements with 20-30 transactions, sometimes the fastest approach is semi-manual:

# Extract to text
pdftotext -layout statement.pdf - | tee /dev/tty | pbcopy
# (On macOS, this shows output AND copies to clipboard)

Then paste into a text file and use multi-cursor editing to transform it into Beancount. In VS Code or Sublime, this takes maybe 5 minutes per statement.

Real Talk: When NOT to Automate

I’ve learned to recognize when automation isn’t worth it:

Don’t automate if:

  • You only get 4-6 statements per year from this source
  • The format changes every few months
  • The statement has fewer than 15 transactions
  • You’re spending more time debugging than you’d spend typing

Do automate if:

  • Monthly statements with 50+ transactions
  • Stable format that hasn’t changed in years
  • Multiple accounts with the same bank
  • You’re managing this for clients (Bob’s situation)

My Actual PDF Toolkit

After trying everything, here’s what I actually use:

1. pdftotext (90% of cases)

pdftotext -layout statement.pdf - > statement.txt

Works for most bank statements where tables are rendered as text.

2. pdf2text with different options

Sometimes -raw works better than -layout:

pdftotext -raw statement.pdf -  # No layout preservation
pdftotext -fixed 4 statement.pdf -  # Fixed pitch, 4-column

Try both and see which gives cleaner output for your bank.

3. Tabula (complex tables)

For statements where pdftotext fails completely:

import tabula
df = tabula.read_pdf('statement.pdf', pages='all')[0]
print(df.to_csv(index=False))

4. The Manual Fallback

For really stubborn PDFs, I use a simple template:

; Paste amounts here, one per line:
; 45.23
; 123.45
; 67.89

; Then search-replace to build transactions

A Lesson Learned: Version Your Parsers

One thing I didn’t do early enough: version control my parsers. Banks change their PDF formats more often than you’d think.

Now I tag each parser with the statement date it was tested against:

"""
Credit Union XYZ PDF Parser
Last tested: 2025-01 statement
Known working range: 2023-06 through 2025-01
"""

class CreditUnionXYZ:
    VERSION = "2.1"
    LAST_VERIFIED = "2025-01"
    # ...

When a parser fails, I can check if it’s a format change or a bug.

Encouragement for Newcomers

If you’re new to Beancount and facing PDF-only statements, don’t let this discourage you. Start with manual entry. Seriously.

I know it feels inefficient, but:

  • You’ll learn your bank’s patterns
  • You’ll understand your spending better
  • You’ll know when automation is worth the effort

The time to automate is when manual entry becomes genuinely painful - usually around month 4-6 of any new data source.

Fred’s approach is solid. The pdftotext + simple parser combo works for most situations. Don’t over-engineer until you have to.

This discussion has been incredibly helpful. Let me consolidate what we’ve learned into a practical decision tree for anyone facing the PDF extraction problem.

The PDF Extraction Decision Tree

START: You have a PDF bank statement
│
├─ Is it a native PDF (text is selectable)?
│  │
│  ├─ YES: Use pdftotext -layout
│  │  │
│  │  └─ Does output preserve table structure?
│  │     ├─ YES: Write a simple regex parser
│  │     └─ NO: Try Tabula or pdfplumber
│  │
│  └─ NO (scanned/image PDF):
│     │
│     └─ Is scan quality good (300+ DPI, clear text)?
│        ├─ YES: Tesseract with preprocessing
│        │  └─ Accuracy > 95%? Use it. Otherwise: manual.
│        └─ NO: Manual entry (or cloud OCR if privacy allows)

My Updated Toolkit (Post-Discussion)

Based on everyone’s input, here’s my refined stack:

Layer 1: Native PDFs (Most Common)

# Always try this first
pdftotext -layout statement.pdf statement.txt

# If layout is mangled, try raw mode
pdftotext -raw statement.pdf statement.txt

Layer 2: Table-Heavy PDFs

# pdfplumber gives more control than pdftotext
import pdfplumber

def extract_with_pdfplumber(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        all_text = []
        for page in pdf.pages:
            # Extract tables separately from text
            tables = page.extract_tables()
            text = page.extract_text()

            # Tables are returned as list of lists
            for table in tables:
                for row in table:
                    all_text.append('\t'.join(str(cell) for cell in row if cell))

        return '\n'.join(all_text)

Layer 3: Scanned Documents

# Only when necessary - accuracy varies
import pdf2image
import pytesseract

def ocr_scanned_pdf(pdf_path):
    # Convert PDF pages to images
    images = pdf2image.convert_from_path(pdf_path, dpi=300)

    all_text = []
    for img in images:
        # Tesseract with bank statement-optimized settings
        text = pytesseract.image_to_string(
            img,
            config='--psm 6 --oem 3'
        )
        all_text.append(text)

    return '\n'.join(all_text)

Layer 4: Manual Entry Template

For when automation isn’t worth it:

; ===========================================
; MANUAL ENTRY TEMPLATE
; Bank: [Bank Name]
; Statement Period: [Start Date] to [End Date]
; ===========================================

; Instructions:
; 1. Open PDF side-by-side with this file
; 2. Copy dates and amounts directly
; 3. Use smart categorization below

; Common payees (copy-paste these):
; "AMAZON" -> Expenses:Shopping:Online
; "WHOLE FOODS" -> Expenses:Food:Groceries
; "SHELL OIL" -> Expenses:Auto:Gas

; === TRANSACTIONS ===

2025-01-03 * "PAYEE" "Description"
  Assets:Bank:CreditUnion  -XX.XX USD
  Expenses:Uncategorized

; === END OF STATEMENT ===

Key Takeaways from This Thread

Principle Implementation
Start simple pdftotext before anything else
Don’t over-automate Manual is fine for < 20 txns
Document your parsers Comments, version numbers, test dates
Validate OCR output Never trust extracted data blindly
Handle encoding Normalize unicode before parsing
Privacy matters Local tools for sensitive documents

The “Good Enough” Parser

Here’s a minimal, battle-tested pattern that works for most bank statements:

#!/usr/bin/env python3
"""
Minimal PDF statement parser.
Handles 80% of cases with 20% of code.
"""

import re
import subprocess
import sys
from datetime import datetime

def extract_text(pdf_path):
    """Get text from PDF using pdftotext."""
    result = subprocess.run(
        ['pdftotext', '-layout', pdf_path, '-'],
        capture_output=True, text=True
    )
    return result.stdout

def parse_transactions(text, pattern, date_fmt):
    """Extract transactions matching pattern."""
    transactions = []
    for match in re.finditer(pattern, text):
        try:
            date_str = match.group(1)
            desc = match.group(2).strip()
            amount_str = match.group(3)

            date = datetime.strptime(date_str, date_fmt)
            amount = float(amount_str.replace(',', '').replace('$', ''))

            transactions.append({
                'date': date.strftime('%Y-%m-%d'),
                'description': desc,
                'amount': amount
            })
        except (ValueError, IndexError):
            continue  # Skip malformed lines

    return transactions

def to_beancount(transactions, account):
    """Format as Beancount entries."""
    output = []
    for txn in transactions:
        output.append(
            f'{txn["date"]} * "{txn["description"]}"\n'
            f'  {account}  {txn["amount"]:.2f} USD\n'
            f'  Expenses:Uncategorized\n'
        )
    return '\n'.join(output)

if __name__ == '__main__':
    if len(sys.argv) < 2:
        print("Usage: parse_statement.py statement.pdf")
        sys.exit(1)

    # Configure for your bank
    PATTERN = r'(\d{2}/\d{2}/\d{4})\s+(.+?)\s+\$?([\d,]+\.\d{2})'
    DATE_FMT = '%m/%d/%Y'
    ACCOUNT = 'Assets:Bank:CreditUnion'

    text = extract_text(sys.argv[1])
    transactions = parse_transactions(text, PATTERN, DATE_FMT)
    print(to_beancount(transactions, ACCOUNT))

Copy this, adjust the pattern and date format for your bank, and you’re done.

Thanks everyone for the excellent discussion. This thread has significantly improved my PDF workflow.

Great thread! I wanted to add a different perspective as someone who deals with a LOT of varied document types for my freelance consulting clients.

The Modern OCR Landscape (2025-2026)

The tools mentioned here are solid, but the OCR/extraction space has evolved significantly. Here’s what I’ve learned from processing hundreds of PDFs:

Commercial vs Open Source

Approach Cost Accuracy Privacy
pdftotext/Tabula Free 90-95% (native PDFs) Local
Tesseract OCR Free 85-90% (scanned) Local
AWS Textract ~$1.50/1000 pages 98%+ Cloud
Google Document AI ~$1.50/1000 pages 98%+ Cloud
DocuClipper/Klippa Subscription 99%+ Cloud

For personal use, the free tools are plenty. For client work where accuracy is critical, sometimes the cloud services are worth it.

My Hybrid Workflow

I use different tools depending on the document:

def smart_extract(pdf_path):
    """Choose extraction method based on PDF type."""

    # Check if PDF contains selectable text
    if has_native_text(pdf_path):
        # pdftotext is fast and accurate for native PDFs
        return pdftotext_extract(pdf_path)

    # For scanned documents, need OCR
    if is_high_quality_scan(pdf_path):
        # Tesseract works well for clean scans
        return tesseract_extract(pdf_path)

    # Low quality or complex scans
    # Fall back to cloud service
    return cloud_ocr_extract(pdf_path)

def has_native_text(pdf_path):
    """Check if PDF has embedded text or is image-based."""
    import subprocess
    result = subprocess.run(
        ['pdftotext', pdf_path, '-'],
        capture_output=True, text=True
    )
    # If output is mostly whitespace, it's probably scanned
    text = result.stdout.strip()
    return len(text) > 100

Dealing with Scanned Statements

Bob asked about scanned PDFs. This is where things get tricky. Here’s my approach:

Step 1: Pre-process the Image

Before OCR, clean up the scan:

from PIL import Image
import cv2
import numpy as np

def preprocess_for_ocr(image_path):
    """Clean up scanned image for better OCR accuracy."""
    img = cv2.imread(image_path)

    # Convert to grayscale
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # Remove noise
    denoised = cv2.fastNlMeansDenoising(gray)

    # Increase contrast
    clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8))
    enhanced = clahe.apply(denoised)

    # Binarize (black and white)
    _, binary = cv2.threshold(enhanced, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)

    return binary

Step 2: Use Tesseract with Proper Config

import pytesseract

def ocr_statement(image):
    """Extract text from preprocessed image."""
    # Use PSM 6 for uniform text blocks (like statements)
    # OEM 3 uses LSTM neural network
    config = '--psm 6 --oem 3 -c preserve_interword_spaces=1'

    text = pytesseract.image_to_string(image, config=config)
    return text

Step 3: Validate and Correct

The key insight: don’t trust OCR output blindly.

def validate_transaction(txn):
    """Check if extracted transaction makes sense."""
    issues = []

    # Date should be within statement period
    if not (statement_start <= txn['date'] <= statement_end):
        issues.append('date_out_of_range')

    # Amount should be reasonable
    if abs(txn['amount']) > 100000:
        issues.append('amount_suspicious')

    # Description should have some letters
    if not re.search(r'[a-zA-Z]', txn['description']):
        issues.append('description_garbled')

    return issues

I flag any transaction with validation issues for manual review. This catches most OCR errors before they corrupt my ledger.

The Real Cost of Errors

Here’s why I’m careful with OCR accuracy: I had a client whose scanned statement showed “$1,234.56” but OCR read it as “$1,284.56”. A single digit difference of $50.

It took me 2 hours to find that error during quarterly reconciliation. The time “saved” by automation was completely lost - plus some.

Now I follow this rule: if OCR accuracy is below 95%, I do manual entry. The mental overhead of constant validation isn’t worth it.

Privacy Considerations

For anyone tempted by cloud OCR services: think carefully about what you’re uploading.

Bank statements contain:

  • Account numbers
  • Transaction history
  • Balance information
  • Sometimes SSN/tax IDs

I only use cloud services for:

  • Anonymized test documents
  • Clients who explicitly consent
  • Documents without sensitive identifiers

For personal finances, I stick to local tools exclusively. The accuracy tradeoff is worth the privacy.

Tools I’d Recommend Trying

If pdftotext + Tabula aren’t cutting it:

  1. Camelot - Python library specifically for PDF tables
  2. pdfplumber - More control than pdftotext, pure Python
  3. pdf2image + Tesseract - For scanned documents
  4. PyMuPDF (fitz) - Low-level PDF manipulation

Each has strengths for different PDF types. Worth experimenting to find what works for your specific statements.