Building Smart Transaction Categorization Rules for Plaid Imports

One of the most time-consuming parts of bank transaction imports is categorization. After helping numerous clients set up Beancount with automated imports, I’ve developed a systematic approach to building categorization rules that handle most transactions automatically.

The Categorization Challenge

When you pull transactions from Plaid (or any bank feed), you get:

  • A payee name (often cryptic: “CHECKCARD 0127 WHOLEFDS MKT #10847”)
  • An amount
  • A date
  • Plaid’s AI-suggested category (helpful but not always accurate)

Your goal is to map these to proper Beancount accounts like Expenses:Food:Groceries.

My Categorization Strategy

Level 1: Exact Match Rules (Highest Priority)

For recurring, predictable transactions:

# Monthly subscriptions - exact amounts help identify
NETFLIX.COM|Expenses:Entertainment:Streaming
SPOTIFY.*|Expenses:Entertainment:Streaming
GITHUB.*|Expenses:Software:Development

# Utilities with predictable names
XCEL ENERGY.*|Expenses:Utilities:Electric
COMCAST.*|Expenses:Utilities:Internet

Level 2: Pattern-Based Rules

For common vendors with variable transaction descriptions:

# Groceries - multiple store formats
WHOLE FOODS.*|Expenses:Food:Groceries
WHOLEFDS.*|Expenses:Food:Groceries
TRADER JOE.*|Expenses:Food:Groceries
KROGER.*|Expenses:Food:Groceries
SAFEWAY.*|Expenses:Food:Groceries

# Gas stations
SHELL.*|Expenses:Transportation:Fuel
CHEVRON.*|Expenses:Transportation:Fuel
EXXON.*|Expenses:Transportation:Fuel
BP.*GAS.*|Expenses:Transportation:Fuel

# Amazon - tricky because it could be anything
# Use subcategories based on common patterns
AMAZON.COM.*AMZN.*|Expenses:Shopping:Online
AMAZON PRIME.*|Expenses:Entertainment:Streaming
AMZN MKTP.*|Expenses:Shopping:Online

Level 3: Category Fallbacks

For less common transactions, use broader patterns:

# Restaurants - catch-all after specific favorites
DOORDASH.*|Expenses:Food:Delivery
UBER EATS.*|Expenses:Food:Delivery
GRUBHUB.*|Expenses:Food:Delivery
.*RESTAURANT.*|Expenses:Food:DiningOut
.*CAFE.*|Expenses:Food:DiningOut
.*PIZZA.*|Expenses:Food:DiningOut

# Generic retail
.*PHARMACY.*|Expenses:Health:Pharmacy
.*HARDWARE.*|Expenses:Home:Maintenance

Advanced Techniques

Using Plaid Categories as Hints

Plaid provides categories like “Food and Drink > Restaurants”. You can use these in your template:

{date} * "{payee}"
  plaid_category: "{category}"
  {account}  {amount} {currency}
  {posting_account}

Then write a post-processing script that uses uncategorized transactions with Plaid hints:

def suggest_account_from_plaid_category(plaid_cat):
    mapping = {
        "Food and Drink > Restaurants": "Expenses:Food:DiningOut",
        "Food and Drink > Groceries": "Expenses:Food:Groceries",
        "Travel > Airlines": "Expenses:Travel:Flights",
        "Transfer > Payroll": "Income:Salary",
    }
    return mapping.get(plaid_cat, "Expenses:Uncategorized")

Handling Regex Edge Cases

Some payee names are tricky. Here’s a debugging approach:

import re

test_payees = [
    "CHECKCARD 0127 WHOLEFDS MKT #10847 CA",
    "POS DEBIT VISA CHECKCARD 0814 SHELL OIL 57442",
    "ELECTRONIC/ACH DEBIT XCEL ENERGY",
]

rules = [
    (r"WHOLEFDS|WHOLE FOODS", "Expenses:Food:Groceries"),
    (r"SHELL.*|CHEVRON.*", "Expenses:Transportation:Fuel"),
    (r"XCEL ENERGY", "Expenses:Utilities:Electric"),
]

for payee in test_payees:
    matched = False
    for pattern, account in rules:
        if re.search(pattern, payee, re.IGNORECASE):
            print(f"{payee[:40]:40} -> {account}")
            matched = True
            break
    if not matched:
        print(f"{payee[:40]:40} -> UNCATEGORIZED")

Split Transaction Handling

Some transactions need manual splitting (e.g., Target where you bought groceries AND household items). I mark these for review:

# Big box stores - flag for manual review
TARGET.*|Expenses:Shopping:BigBox:NeedsReview
WALMART.*|Expenses:Shopping:BigBox:NeedsReview
COSTCO.*|Expenses:Shopping:BigBox:NeedsReview

Then I have a weekly task to review anything in :NeedsReview accounts and properly split them.

My Categorization Accuracy Over Time

After building rules for 6 months:

  • Month 1: ~60% auto-categorized correctly
  • Month 3: ~80% auto-categorized correctly
  • Month 6: ~92% auto-categorized correctly

The key is reviewing uncategorized transactions weekly and adding new rules as patterns emerge.

Questions for Discussion

  1. How do others handle the “Amazon could be anything” problem?
  2. Anyone using machine learning for categorization (like smart_importer)?
  3. What’s your strategy for handling split transactions from big box stores?

Would love to hear how others approach this challenge!

This is an excellent systematic approach! Let me share some of my regex patterns that handle common edge cases:

Complex Payee Name Patterns

Bank transaction descriptions are notoriously inconsistent. Here are patterns I’ve refined:

# Handle various bank prefixes
PATTERNS = [
    # Strip common prefixes first
    (r"^(CHECKCARD|POS DEBIT|VISA |DEBIT |ACH |ELECTRONIC/)?\s*\d{4}\s*", ""),
    
    # Then match the actual merchant
    (r"AMZN\.COM/BILL.*|AMAZON\.COM\*.*|AMZN MKTP US\*.*", "Expenses:Shopping:Amazon"),
    (r"SQ \*|SQUARE \*", ""),  # Square merchants need special handling
]

The Amazon Problem

I’ve actually given up on auto-categorizing Amazon. Instead, I:

  1. Put all Amazon in Expenses:Shopping:Amazon:ToReview
  2. Cross-reference with my Amazon order history CSV (you can download this!)
  3. Use a Python script to suggest splits based on order details
def match_amazon_order(transaction_amount, transaction_date):
    """Try to match a transaction to an Amazon order."""
    # Load Amazon order history
    orders = load_amazon_orders()
    
    # Find orders within 3 days of transaction
    candidates = [o for o in orders 
                  if abs((o.date - transaction_date).days) <= 3
                  and abs(o.total - transaction_amount) < 0.01]
    
    return candidates

It’s not fully automated, but it makes the review process much faster.

smart_importer Experience

Yes! I’ve used smart_importer and it’s quite good once trained:

from smart_importer import PredictPostings

class MyImporter(Importer):
    @PredictPostings()
    def extract(self, file):
        # ... your extraction logic

After ~500 manually categorized transactions, it started predicting with ~85% accuracy. The key is correcting its mistakes - it learns from your corrections.

Downside: Requires maintaining a training dataset and can be confused by unusual transactions.

The split transaction problem is my biggest headache with automation!

My Big Box Store Approach

For clients who shop at Target/Walmart/Costco regularly, I’ve tried a few things:

Option 1: Default Category + Manual Review

Like @accountant_alice, I flag them for review. But I go further by defaulting to the most common category:

# Default to groceries since that is 70% of Costco trips
COSTCO.*|Expenses:Food:Groceries:CostcoReview

Then during review, I only need to split the non-grocery items rather than categorizing everything from scratch.

Option 2: Use Receipt Data

This is more work upfront but pays off:

  1. Client photographs receipts (using an app like Expensify or just their phone)
  2. I match receipts to transactions by date/amount
  3. Split based on actual line items

For one client who does a lot of Costco runs, we set up a simple spreadsheet where they log their Costco trips with rough splits (groceries/household/office). Not perfect, but way better than guessing.

Option 3: Fixed Percentage Split

For clients who don’t want to track receipts, we use historical averages:

# Based on 6 months of data, this client's Costco is:
# 65% groceries, 25% household, 10% office supplies

def split_costco(amount):
    return [
        ("Expenses:Food:Groceries", amount * 0.65),
        ("Expenses:Home:Household", amount * 0.25),
        ("Expenses:Office:Supplies", amount * 0.10),
    ]

Not perfect for any single transaction, but accurate over time.

Question Back to the Group

For those using Plaid: does Plaid provide any itemized transaction data, or is it always just the total? I know some banks show item-level detail in their apps - does that flow through to Plaid?

To answer @bookkeeper_bob’s question: No, Plaid only provides transaction-level data (date, amount, merchant), not itemized line items. The item-level detail you see in banking apps typically comes from the merchant’s receipt data, which doesn’t flow through bank feeds.

That said, I want to add a tax compliance perspective on split transactions:

Why Splits Matter for Taxes

If you’re using Beancount for business or self-employment:

  1. Deductibility varies by category: Groceries aren’t deductible (usually), but office supplies are. Getting splits wrong means over or under-claiming deductions.

  2. Audit risk: The IRS may question large retail purchases that are fully categorized as business expenses. Having proper documentation for splits protects you.

  3. Mixed-use purchases: If you buy both business and personal items in one transaction, you MUST split them. The “percentage split” approach @bookkeeper_bob mentioned is acceptable if consistently applied.

Documentation Requirements

For business deductions, the IRS requires:

  • Amount of the expense
  • Time and place (date and store)
  • Business purpose
  • Business relationship (if applicable)

A bank transaction covers the first two. You need additional documentation (receipt, note) for the latter two.

My Recommendation

For business users of Plaid imports:

  1. Flag all big-box stores for review (as discussed)
  2. Keep receipts for any purchase over $75 or any purchase that might be partially business
  3. Add metadata to your Beancount entries documenting business purpose:
2025-01-15 * "Costco"
  business_purpose: "Office supplies for home office, household items for personal use"
  Expenses:Office:Supplies  85.00 USD
  Expenses:Personal:Household  115.00 USD
  Liabilities:CreditCard:Visa  -200.00 USD

This kind of documentation is exactly what you’d want in an audit.