I tested structured output from 288 LLM calls and logged every way JSON breaks. Here's what I found

Posted by kexxty@reddit | Python | View on Reddit | 49 comments

I've been building Python services that consume LLM output for the past few years, and I kept accumulating the same pile of regex fixups for broken JSON in every project. Markdown fences, trailing commas, Python booleans inside JSON, truncated objects, unescaped quotes, the usual.

Instead of keeping a private junk drawer of string manipulations, I decided to actually study the problem. Ran structured output prompts through 288 model calls across every major provider and catalogued what breaks, how often, and whether the failure modes are consistent across model families. (Spoiler: they are. Weirdly consistent.)

Wrote it up here: What Breaks When You Ask an LLM for JSON

The article covers:

A taxonomy of the 8 most common structured output failures
Why the order you apply repairs in matters (this was the part that surprised me most)
Why JSON mode helps but doesn't solve the problem
What changes when you need to support YAML and TOML alongside JSON

The findings eventually turned into a library (outputguard), but the article stands on its own if you just want to understand the failure modes. Curious if other people are seeing the same patterns.

[-]

Motor-Ad2119@reddit

We deal with similar stuff on the scraping side when LLMs extract structured data from raw HTML. Truncated objects are probably the most common failure we see. Good writeup 👍

[-]

Current-Tip2688@reddit

useful taxonomy for single-call contexts. agent chains are messier.

in langgraph, a structured output failure at node 1 means node 2 receives corrupted state and proceeds confidently on wrong data. by the time you surface the error it's 3-4 steps downstream and looks nothing like a JSON parse issue.

you also can't just retry the failed call because state has already moved. you need checkpointer rollback or explicit validation at each node boundary.

does outputguard handle the streaming case? that's where I've found it hardest. you don't know the output is malformed until the stream closes.

[-]

nickcash@reddit

Every single post on reddit these days is "I fully believe AI is the future but here's my results showing llms are too shitty to even produce valid json, the simplest task you could ever possibly ask for"

[-]

cmdr_iannorton@reddit

I dont understand how anyone can be willing to depend on a third party web service that randomly produces invalid output, it strikes me as futile if it needs a cobstantly updated output sanitizer that might break with the next call.

[-]

marr75@reddit

It's an important tech. Even the smartest models are still befuddlingly stupid at some tasks. These things co-exist.

[-]

nickcash@reddit

Yes, thank you. This is an excellent example of the kind of post I was making fun of.

[-]

kexxty@reddit (OP)

Yeah, it's genuinely absurd that we need tooling for this. I built outputguard basically out of spite.

It has 15 repair strategies because there are apparently 15 distinct ways models can screw up a JSON object. Trailing commas, markdown fences, single quotes, truncated output, ellipsis where data should be... I have 288 test cases sourced from real model outputs, and each time I found a new case I became more and more disillusioned with AI lol.

Does it work well? Yes. Should it need to exist? Absolutely not.

[-]

kamilc86@reddit

The syntax failures you catalogued are real but constrained decoding and JSON mode have mostly closed that gap on the major providers. The problem that still bites in production is semantic: the JSON parses fine but the model hallucinated an enum value that does not exist in your schema, or it picked the wrong branch of a union type, or it returned a plausible but completely fabricated ID in a reference field. Pydantic with Literal types for enums, discriminated unions with explicit tag fields, and custom validators that check referenced IDs against your actual data catch most of this. The repair ordering work you did is solid for the syntax layer but I would argue the next version of this taxonomy should include semantic failure modes because those are the ones that pass json.loads() and then silently corrupt your downstream state.

[-]

dysprog@reddit

Do you mind if I ask why? Of all the things the LLMs to well producing highly structured output is not one of them.

I am trying to imagine a task that needs an LLM's ability to digest unstructured english, but outputs json. I'm drawing a blank.

There must be a better tool to use for the LLM part, that can output clean JSON.

[-]

droans@reddit

There is a better way - it's called Structured Output.

That link is for the OpenAI docs but it's applicable to almost all paid models plus many local models, too. Basically, you tell it what the schema should be and the inference server will reject any tokens that don't match.

You can still have problems with the response, of course. It's like giving the AI a form instead of a blank piece of paper, it can still fill out the wrong answers.

[-]

quantum1eeps@reddit

Yeah this post is so stupid. Anthropic’s logic will keep retrying in a loop until it produces the specified schema. It will even accept regex patterns to conform to for dates etc.. This era of markdown gates and trailing commas problem is solved

[-]

kexxty@reddit (OP)

Some ideas:

Tool/function calling. The entire agent ecosystem (Claude Code, Cursor, every MCP server, OpenAI's Assistants) runs on the LLM emitting JSON to invoke tools. This is the dominant LLM use case in 2026.
Data extraction from unstructured docs — resumes, invoices, contracts, support tickets, medical notes, security findings. Free text in, normalized record out.
Classification with structure — sentiment + category + confidence + reasoning, all in one pass.
NL to query. "Show me universities that haven't renewed in 90 days" → structured filter object you hand to your API.
RAG ingestion — extracting metadata, entities, and relationships from documents as you chunk them.
Evals/grading — structured rubric scoring across thousands of test cases.
Knowledge-graph construction from prose.
UI generation — component trees described in English, emitted as JSON for a renderer.
Workflow/state-machine steps in agentic systems where each step needs a typed output.
Form pre-fill from a paragraph of free-text intake.

[-]

hstarnaud@reddit

Honestly I don't understand the use case.

For example. If I'm I'm relying on an LLM to return JSON from a query or invoke an API with JSON. Then if the JSON is not valid, I'm not trying to fix it. Invalid JSON, try again. You can try to fix some JSON errors but ultimately you are trying to handle something not deterministic, it will be impossible to guarantee you can fix the JSON properly. I really don't see why you wouldn't query the model again to get proper JSON.

[-]

axonxorz@reddit

I really don't see why you wouldn't query the model again to get proper JSON.

Cost and latency.

Personally, I don't want to have highly structured output from a system that outputs categorically invalid data some non-trivial percentage of the time. Then I'm charged again to paper over the provider's failure, on top of the engineering spend I did to mitigate the issue in the first place.

[-]

stumblinbear@reddit

Expensive to query them again when fixing it may give you a perfectly valid result

[-]

hstarnaud@reddit

Sure but in many examples OP gave like this one:

Tool, function calling.

you would usually have a requirement around guarantees. You will never get 100% reliable trying to fix the JSON. Its a compromise. If dropping some messages because you can't parse them is an acceptable compromise fine but you will never be able to fix any invalid JSON, moreover, the model is better at finding the syntax error than whatever algorithm you will write to compensate because it's not a bounded, deterministic problem.

[-]

AxisFlip@reddit

For a small personal app, I give the LLM a description of what I ate and I want to get back an estimate of how much calories (and macronutrients) that is, in JSON format. I don't know how many calls OP is making, but so far the output has always been usable.

[-]

JohnWowUs@reddit

Why not just use something like pydanticai ?

[-]

kexxty@reddit (OP)

I have a bit of a Not Invented Here complex

[-]

No_Soy_Colosio@reddit

Please also have a bit of Don't Reinvent The Wheel

[-]

KandevDev@reddit

The truncated objects one is the worst because it looks recoverable. you can see the closing brace was about to come, the schema is "almost" right, and you'll write a band-aid that works on 99% of cases until that one production call truncates 3 keys deep. switched to streaming + a real partial-json parser (json-stream / jiter) and the volume of bug reports about "the LLM gave us weird output" dropped massively. structured output APIs help but only if the provider actually supports them, otherwise you're back to the regex pile.

[-]

kexxty@reddit (OP)

Truncation is genuinely the hardest one. You're right that "just close the braces" falls apart once you're mid-value three levels deep, our fix_truncated strategy handles the common cases but we're upfront that it's best-effort past simple nesting.

Streaming + partial-json parser is a better solution for that specific problem, since you're validating structure as it arrives rather than reconstructing intent after the fact. Different problem space though, streaming parsers solve "the response got cut off," repair solves "the response completed but the model wrote garbage syntax." You hit both in production.

And yeah, structured output APIs are great until you need to support multiple providers and half of them don't have it. Then you're back to fixing whatever comes out the other end.

[-]

KandevDev@reddit

The streaming path is interesting for another reason, it lets you emit partial structured output to the consumer instead of waiting for the full payload. matters less for batch jobs but a lot for UX where you want to start rendering before the LLM is done. The partial-json libs that do this well are rare though, jiter is the only one i've found that's properly maintained.

[-]

ammy1110@reddit

This is good, thanks for sharing. I will try this with one of my tools.

[-]

aloobhujiyaay@reddit

this is the kind of evaluation work the AI ecosystem desperately needs more of because “structured output support” is still way less reliable than marketing pages imply

[-]

kexxty@reddit (OP)

Many responses here are proving they haven’t actually had experience with JSON output over a wide range of models

[-]

human09812@reddit

The taxonomy is great, but the “repair order” section is pure gold for hardening production pipelines.

[-]

TheBB@reddit

Seems to me like several of these problems could be fixed by a more permissive JSON parser. You can then re-encode as JSON to normalize. Trailing commas, wrong booleans and nulls, comments. Maybe fences too. Then your string manipulation stage would be simpler.

[-]

AreWeNotDoinPhrasing@reddit

Markdown fences
Trailing commas
Wrong booleans, nulls
Comments inside json
Unescaped quotes in string
Truncated objects
Ellipsis place holders
Encoding issues

[-]

kexxty@reddit (OP)

That's the list. The thing that caught me off guard was how much the ordering matters when you fix them. Stripping fences before fixing commas gives you a clean result. Fixing commas first means your regex is operating on text that includes the fence markers, and things get weird. Multiply that across 15 strategies and the interaction effects are where most of the debugging time went.

[-]

campbellm@reddit

means your regex

Now you have 2 problems.

[-]

ibgeek@reddit

Have you ever taken a course on compilers or parsing (e.g., using formally-defined grammars and parsing algorithms)? Those sorts of classes are far less commonly offered in CS curricula these days, but so so helpful when you actually need them. Won't magically fix the issues, but it might provide robust tools for implementing text transformations, theoretical knowledge to set the in order in which the operations should be applied, and the ability to prove correctness.

[-]

kexxty@reddit (OP)

I did take compilers/formal languages coursework, and you're right that it's underrepresented in modern CS curricula. That background actually informed some design decisions here (like the strategy ordering and the two-pass repair architecture).

That said, the core challenge with LLM output repair is that there isn't a formal grammar to parse against. The input is, by definition, malformed: trailing commas, unquoted keys, truncated structures, mixed encodings, markdown fences wrapping JSON. Each of these breaks different grammar rules in different ways, and they combine unpredictably across models and prompting styles.

A traditional parser would reject on the first syntax error, which is exactly the problem we're solving. The approach here is closer to error-recovering parsers (like what GCC/Clang do for diagnostics), but even looser, we're not parsing a known language with known error productions, we're trying to recover intent from text that was never syntactically valid to begin with.

Where formal theory does help is in the strategy ordering (encoding normalization before structural fixes, for the same reason lexing precedes parsing) and in knowing when transformations are safe to compose. Definitely agree it's a useful foundation.

I kept the post pretty casual bc this rabbit hole goes deep (strategy interaction effects alone could be its own post). Might do a more technical writeup as opposed to the normie-friendly on the repair architecture at some point.

[-]

ibgeek@reddit

What about augmenting the grammar with some of the errors?

Or using a stochastic grammar / parser combination?

[-]

latkde@reddit

This article confuses me.

It discusses problems such as invalid JSON responses, but then discards the main solution: "JSON mode" or other constrained decoding features. Pretty much any inference provider now supports Outlines-style structured outputs, where the model is forced to select syntactically valid tokens.

I think the main takeaway should be: if you want JSON, always provide a JSON Schema for inference. This guarantees proper JSON, and that all required fields have been provided. This also makes it possible to force the LLM to produce complete outputs. E.g. instead of a response shape [{"id": "abc", ...}, ...] where I hope that the LLM provides an item for each ID, I can force it to explicitly consider each known ID via an object structure like {"abc": {...}, ...}.

Structured outputs are easily accessible via the openai library, e.g. client.responses.parse(..., text_format=SomePydanticModel).

Once we're at this point, the only real remain issue is truncation. It's possible to repair JSON by adding closing braces etc. But I've deleted all such repair routines from our codebase because it pretty much stopped being a problem since early 2025. Also note that JSON Schema can impose length limits on arrays and strings. There are also tools like Pydantic's partial JSON parsing mode, which directly addresses truncation, and can also be used for best-effort handling of streaming responses. Another minor problem is that different providers support different JSON schema subsets.

I don't want to discourage you, I just think that structured outputs that are driven by a JSON schema have solved >98% of this problem area, and Pydantic is a well-established library to help create schemas and validate data against it. The only thing your library adds is automated injection of retry prompts, but I'd argue that if retries are acceptable, then we could just raise the token budget for the initial round of inference, and/or increase or reduce the model's reasoning effort level (less reasoning = more actual output tokens).

[-]

kexxty@reddit (OP)

Thanks for the thoughtful comment! You're right that structured output modes (JSON mode, Outlines, etc.) have massively improved things and we do recommend using them as a first choice in the docs.

The gap outputguard fills is everything outside that happy path: models and providers that don't support constrained decoding, multi-provider setups where schema support varies, local/open-source models with inconsistent structured output support, and edge cases where even "guaranteed" JSON mode still produces schema-valid but semantically broken output (wrong types in union fields, hallucinated enum values, etc.). JSON mode guarantees syntax — it doesn't guarantee the output matches your business logic.

The retry prompt generation is actually the feature users reach for most not because retries are a fallback for insufficient token budgets, but because they give the model targeted feedback about what was wrong (with JSON path precision), which is fundamentally different from just throwing more tokens at it.

You're right that for a straightforward single-provider setup with good structured output support, you may not need this. But "works fine on OpenAI with Pydantic" doesn't describe everyone's reality and that's who this is for.

[-]

Beliskner64@reddit

Just ask it to produce YAML instead of

[-]

marr75@reddit

This was a much larger problem for us prior to GPT-4.1. In general, after that, most newer models of a certain size started being able to properly use a sensible json schema (as long as it wasn't too deeply nested/insanely named).

If you want the most final, helpful method, use constrained decoding (only tokens that fit the schema can be predicted) and try not to adopt rules that can't be expressed in the constrained schema.

[-]

kexxty@reddit (OP)

Agreed that the frontier API models have gotten way better at this Gand all handle structured output fine most of the time if your schema is reasonable. Constrained decoding is also the correct answer when you have access to it.

But that's not really the problem we're solving. Constrained decoding requires control over the inference stack, meaning it's great if you're running vLLM or SGLang locally, but if you're hitting a hosted API that doesn't expose grammar/schema-guided decoding, you're out of luck. And even APIs with "JSON mode" still produce malformed output at non-trivial rates, especially on longer responses or complex nested schemas.

Re: outlines, instructor, dspy, these are doing different things. Outlines is constrained decoding (inference-level). Instructor is a typed extraction wrapper around API calls. DSPy is a prompt optimization framework. We're none of those, we're a repair layer that sits after generation and fixes malformed output without needing any integration with the model or provider. They're complementary, not competing. You'd use instructor to call the API and outputguard to catch the cases where it still comes back broken.

We looked at contributing to those projects early on but the scope is genuinely different. There's no "repair malformed output" module in any of them because that's not what they do.

[-]

Henry_old@reddit

python community watch out for ban bots on links here. json from llms breaks because developers rely on default parsers instead of forcing strict pydantic schemas at the api boundary. regex fixing is garbage. enforce the schema before it hits the main logic block. anything else is a waste of compute and api credits

[-]

FarRub2855@reddit

That "private junk drawer of string manipulations" line hits way too close to home lol. I'm definetly going to rethink my own repair sequence after reading this, really appreciate you taking the time to catalog it all.

[-]

licjon@reddit

How did you score the github username "YOUR_REPO_HERE"?

[-]

kexxty@reddit (OP)

I'll be honest...I don't really understand what you mean

[-]

ibgeek@reddit

Your link to your repo is broken

[-]

kexxty@reddit (OP)

Thank you so much, I had put the link at the bottom but forgot to fix the mid-post link.

[-]

licjon@reddit

"All of this became outputguard, which I eventually packaged properly because I was importing the same unversioned file into too many projects and it was getting ridiculous."

"outputguard" links to https://github.com/YOUR_REPO_HERE?ref=thecrosswalk.news

[-]

kexxty@reddit (OP)

THANK YOU very much, I had written up a document before copying/pasting the links in.

[-]

Toby_Wan@reddit

wdym json mode doesn't work? If its proper structured output enforces by grammar then it simply cannot out invalid json.

[-]

kexxty@reddit (OP)

There are two different things here. JSON mode only guarantees syntactically valid JSON - it doesn't enforce your schema at all. You can ask for {"name": string} and get back {"completely": "different structure"}. Valid JSON, useless output.

Even with proper grammar-constrained structured output, there are failure modes the grammar can't prevent:

Truncation: model hits max tokens mid-object. The grammar constrains token selection, not completion length. You just get chopped JSON.
Refusals: model decides it can't/won't fulfill the request and returns a refusal string instead of your schema'd object. OpenAI literally has a refusal field in their structured output response type because they know this happens.
String field contents: grammar ensures the JSON structure is valid, but it doesn't constrain what's inside string values. If you need a field to contain valid JSON/code/structured text, you're on your own.
Provider bugs: these systems aren't infallible. There have been real edge cases with recursive schemas, deeply nested objects, etc.

It's more reliable than raw prompting, absolutely. But "simply cannot output invalid json" is incorrect, it cannot output invalid json tokens, but it can absolutely fail to produce a complete, schema-conformant response.