Skip to main content

JSONSchemaBench: Real-World Schema Complexity Breaks LLM Structured Output Guarantees

· 6 min read
Mike Thrift
Mike Thrift
Marketing Manager

Most teams treat constrained decoding as a solved problem — add a JSON schema, get back valid JSON. JSONSchemaBench (arXiv:2501.10868) is the first systematic attempt to test that assumption against 9,558 real-world schemas, and the results are less reassuring than the marketing would suggest.

The paper

2026-07-08-jsonschemabench-structured-outputs-language-models

Saibo Geng, Hudson Cooper, Michał Moskal, and colleagues at Microsoft Research introduce JSONSchemaBench, a benchmark of 9,558 schemas drawn from real production sources: GlaiveAI function-call signatures, GitHub repositories stratified by complexity from trivial to ultra, Kubernetes API configs, Snowplow event analytics schemas, and the JSONSchemaStore collection. They evaluate six constrained decoding frameworks — Guidance, Outlines, Llamacpp, XGrammar, OpenAI Structured Outputs, and Gemini — across three axes: coverage (what fraction of schemas the framework can handle at all), efficiency (tokens-per-second overhead versus unconstrained generation), and quality (downstream task accuracy). The evaluation grid also includes the official JSON Schema Test Suite, which documents 45 feature categories that any compliant engine should support.

The core claim is that schema complexity is the decisive variable that separates capable frameworks from fragile ones, and that no single framework dominates across all three axes.

Key ideas

  • Coverage collapses under schema complexity. On simple GlaiveAI schemas all frameworks score above 86%. But on GitHub-Hard schemas — multi-level nesting, recursive definitions, complex pattern constraints — Guidance drops to 41%, Llamacpp to 39%, XGrammar to 28%, and Outlines to a catastrophic 3%. OpenAI reaches only 9% on GitHub-Hard, and Gemini produces no valid outputs at all on medium-complexity or harder schemas.
  • Kubernetes exposes a specific weakness in XGrammar. Despite XGrammar's speed claims, it achieves only 7% coverage on Kubernetes schemas, likely because those schemas rely on context-dependent patterns that XGrammar's context-independent precomputation cannot handle. Coverage against a benchmark that includes Kubernetes configs is not optional for production agents.
  • Under-constrained is more dangerous than compilation failure. XGrammar exhibits 38 under-constrained failures against the JSON Schema Test Suite — meaning it emits JSON that violates the declared schema while silently reporting success. Guidance has only 1 such failure. For a write-back agent, a compilation error is caught at design time; an under-constrained failure corrupts data at runtime without any signal.
  • Guidance's fast-forwarding delivers a genuine 50% speedup. When long deterministic sequences are present (e.g., field names in a fixed object structure), Guidance can advance multiple tokens per decoding step. On Llama-3.1-8B on an A100, Guidance runs at 6–9 ms per output token while unconstrained generation runs at 15–16 ms. Outlines is slower than unconstrained generation at 30–46 ms, largely due to its up-front automaton compilation taking 3–8 seconds per schema.
  • Constrained decoding modestly improves reasoning accuracy. On GSM8K (math), Guidance lifts accuracy from 80.1% (unconstrained) to 83.8%. On Last Letter and Shuffle Objects, gains are in the 1–3 point range. This contradicts the widely-cited concern that forcing JSON format degrades answer quality — but the effect size is small enough that format choice should not drive framework selection.
  • No framework covers all 45 JSON Schema feature categories. Guidance covers 13, Llamacpp and XGrammar each cover 1, and Outlines covers 0. The practical implication is that any schema using if/then/else, unevaluatedProperties, or recursive $ref definitions will behave unpredictably depending on which engine is under the hood.

What holds up — and what doesn't

The benchmark's strongest contribution is schema sourcing. Prior evaluations used toy schemas or single-source collections. Including Kubernetes configs alongside function-call signatures is the right kind of adversarial diversity. The complexity stratification (trivial through ultra) also gives practitioners a calibration curve: if your schemas look like GlaiveAI function calls, XGrammar or Guidance are both fine; if they look like Kubernetes manifests, your options narrow fast.

The main weakness is the single-sample greedy evaluation. Measuring coverage with one generation per schema understates true capability — a framework might fail 20% of the time but succeed on retry. The paper acknowledges this but doesn't report temperature-sampled pass@k numbers, which would matter for production systems that retry on failure.

The comparison also mixes incomparable models. Open-source frameworks (Guidance, Outlines, Llamacpp, XGrammar) are tested on Llama-3.2-1B, while OpenAI and Gemini run their own undisclosed models. OpenAI's 9% coverage on GitHub-Hard may reflect model capability as much as constrained decoding architecture. A fair comparison would need controlled model access — which the authors obviously cannot force from proprietary providers.

Why this matters for finance AI

Every Beancount write-back agent generates structured output. If the agent emits Beancount directives as JSON before converting to .beancount syntax, or if it calls tools via JSON schemas, the reliability of that JSON generation is not a detail — it is the whole game. The FinTrace paper showed that frontier models fail at reasoning over tool outputs; JSONSchemaBench reveals an orthogonal problem: even before reasoning, the formatting layer may silently emit non-compliant output.

The Kubernetes result is particularly telling for Beancount. Ledger schemas are not flat key-value bags. Account hierarchies, transaction metadata, and tag structures create nested recursive patterns similar to Kubernetes API objects. A framework that scores 7% on Kubernetes is not ready for complex ledger schemas, regardless of how fast its per-token overhead is.

The under-constrained failure mode is the one I would lose sleep over. A Beancount agent using XGrammar could emit a transaction that passes the framework's internal validation check but violates the actual schema — and the agent would have no reason to retry. Silent corruption is worse than visible failure.

  • XGrammar (arXiv:2411.15100, Dong et al.) — the technical paper behind one of the fastest frameworks tested, explaining the context-independent/dependent token split and why Kubernetes schemas stress it.
  • Grammar-Aligned Decoding / ASAp (NeurIPS 2024) — shows that token masking in constrained decoding can distort the model's probability distribution and proposes a corrected sampling algorithm; the theoretical foundation for quality concerns the benchmark measures only indirectly.
  • XGrammar-2 (arXiv:2601.04426) — a follow-up that extends XGrammar to dynamic schemas in agentic settings where the schema itself changes during a multi-turn session, directly relevant to Beancount agents that adapt their output format based on which account types are active.