Skip to main content

CausalTAD: Causal Column Ordering for LLM Tabular Anomaly Detection

· 6 min read
Mike Thrift
Mike Thrift
Marketing Manager

The previous log covered AnoLLM, which fine-tunes a small LLM to score tabular anomalies via negative log-likelihood. CausalTAD (arXiv:2602.07798) asks a sharp follow-up question: does the order in which you feed columns to that LLM matter? The answer, it turns out, is yes — and injecting causal structure into the ordering gives you a consistent, reproducible lift.

The paper

2026-06-25-causaltad-causal-knowledge-llm-tabular-anomaly-detection

Wang et al. propose CausalTAD, a method that sits on top of AnoLLM-style LLM anomaly detectors and makes one targeted change: instead of serializing tabular rows in random or arbitrary column order, it discovers causal dependencies between columns and reorders them to respect those dependencies before the LLM reads the row.

The paper has two moving parts. First, a causal-driven column ordering module. The authors adapt the COAT factor-extraction framework: an LLM reads column metadata and samples to extract high-level semantic factors (for credit card transactions, a factor like "Compensation" might span the amount and merchant columns). From these factors, three causal discovery algorithms — PC, LiNGAM, and FCI — each build a directed causal graph over factors. The column reordering problem then becomes a Linear Ordering Problem: find the permutation π that maximizes the sum of directed-edge weights, so that cause columns appear before effect columns in the serialized text. Because the LP has many near-optimal solutions, they sample K ≈ 10 orderings within 90% of the optimum and average over them.

Second, a causal-aware reweighting module. Not all columns are equally relevant. A column that influences many factors gets a higher weight αj = |M⁻¹(cj)|, the count of factors it contributes to. The final anomaly score is the weighted average of per-column negative log-likelihoods across the K orderings.

Key ideas

  • Column ordering is a non-trivial inductive bias for autoregressive LLMs: placing a cause column before its effect column lets the model condition on the correct context when assigning likelihood to the effect.
  • Causal discovery at the factor level (rather than the raw column level) lets the method handle mixed-type tables where direct causal discovery between heterogeneous columns is noisy.
  • On 6 mixed-type benchmark datasets, CausalTAD with SmolLM-135M reaches average AUC-ROC 0.834 vs AnoLLM's 0.803 — a 3.1-point absolute improvement with the same backbone model.
  • On the Fake Job Posts dataset specifically, CausalTAD scores 0.873 vs AnoLLM's 0.800 — a 9.1% relative gain, which is large enough to matter in a real triage system.
  • Across 30 numerical ODDS benchmark datasets, CausalTAD achieves the best average AUC-ROC, consistently outperforming classical baselines (Isolation Forest, ECOD, KNN) and deep methods (DeepSVDD, SLAD).
  • All three causal discovery algorithms beat random ordering in ablation; LiNGAM edges out PC and FCI slightly on the mixed datasets.

What holds up — and what doesn't

The core claim — that causal column order helps — is well-supported. The ablation is clean: swapping random ordering for any of the three causal discovery methods improves results on the Fake Job Posts benchmark (from 0.832 to 0.870–0.873), and factor-count reweighting further helps in every configuration. That's a credible story.

What I find less convincing is the bootstrapping assumption. The causal graph is constructed by using an LLM to extract semantic factors from the very data the system is meant to analyze. If the LLM misunderstands the domain — say, for a bespoke accounting system with non-standard column names — the factor extraction will be wrong, and a bad causal graph is arguably worse than random ordering because it introduces a systematic bias. The authors acknowledge this risk ("relies on the capability of LLMs for factor extraction") but do not benchmark factor extraction accuracy independently.

There's also a computational overhead issue that is more serious than the paper suggests. Running three causal discovery algorithms, solving an LP, sampling K orderings, and then running inference on K serialized versions of every test point multiplies the inference cost by K. For a ledger with millions of entries, this matters. The paper notes "future work may focus on improving efficiency" but offers no concrete profiling.

Finally, the 30 numerical ODDS datasets are well-studied and arguably saturated for methods like this. The more meaningful signal is in the 6 mixed-type datasets — which are the realistic ones for finance — and the improvements there, while real, are somewhat modest in absolute terms.

Why this matters for finance AI

Beancount transactions have genuine causal structure: the posting amount causally drives the account selection, the account drives the counterparty expectation, and the memo text is causally downstream of all three. Random column serialization ignores this, which means an AnoLLM-style model sees "memo: groceries | account: Expenses:Food | amount: $4200" as freely as the correctly ordered version.

CausalTAD gives a principled way to encode "amount and account come first" without hardcoding it as a rule. For Bean Labs audit agents, this suggests a practical architectural choice: before scoring a batch of transactions for anomalies, spend one pass discovering the causal graph over the ledger's column schema, then use that fixed ordering for all subsequent inference. The overhead is paid once at schema-level, not per-transaction.

The credit card fraud detection example in the paper is essentially the same task structure as ledger anomaly detection: heterogeneous features, rare labels, and a causal order that domain experts know intuitively but that LLMs would otherwise ignore.

  • AD-LLM: Benchmarking Large Language Models for Anomaly Detection (arXiv:2412.11142, ACL Findings 2025) — the systematic benchmark across three LLM anomaly detection paradigms that CausalTAD fits into; reading it gives the full landscape rather than the single AnoLLM vs CausalTAD comparison.
  • COAT: Boosting Large Language Model-Based In-Context Learning for Tabular Data (Liu et al., 2024) — the factor-extraction framework CausalTAD adapts; understanding how it works clarifies where the causal graph quality can fail.
  • Causal discovery in heterogeneous data: a survey — for understanding the relative merits of PC vs LiNGAM vs FCI on mixed-type tabular data, since the paper treats all three as interchangeable but they make different independence assumptions.