Qwen3.6 35B - TXT vs Markdown vs HTML vs HTML+CSS

Posted by BigYoSpeck@reddit | LocalLLaMA | View on Reddit | 25 comments

There been talk of late about using HTML rather than markdown in Claude Code. I was curious how this worked with a local model so loaded up Qwen3.6 35B A3B at Q8 and F16 KV cache. Then I gave it the same prompt write a detailed explanation of the Blazor render cycle first asking for raw text, then markdown, then unstyled HTML, then HTML+CSS, and finally with no constraint (where it chose markdown). I measured the token counts for reasoning, total response (including the md or HTML formatting) and the raw response content stripped of formatting. I also recorded the tokens per second (running MTP with 3 draft tokens) and the total time taken. | Output | Reasoning tokens | Output tokens | Raw content tokens | Tokens per second | Time taken | |---|---:|---:|---:|---:|---:| | Raw text | 1,873 | 1,080 | 1,080 | 146 | 20s | | Markdown | 1,264 | 1,496 | 1,269 | 123.5 | 23s | | Unstyled HTML | 166 | 7,346 | 4,857 | 139 | 56s | | Styled HTML | 108 | 10,290 | 3,418 | 139 | 82s | | No constraint (chose markdown) | 1,465 | 2,256 | 2,002 | 122 | 31s | Finally I got ChatGPT 5.5 Extended Reasoning to score the quality of their output based on: * **How much correct useful information is present** * **How well it is explained** * **How many errors it contains** * **How efficiently it uses its length** Rank Format Cov Expl Err Dens Total ---- ----------------------------- ----- ----- ----- ----- ------ 1 Markdown 31/40 21/25 18/25 8/10 78/100 2 No constraint (chose markdown) 32/40 18/25 13/25 8/10 71/100 3 Raw text 30/40 19/25 11/25 6/10 66/100 4 Unstyled HTML 34/40 17/25 6/25 4/10 61/100 5 Styled HTML 33/40 19/25 3/25 3/10 58/100

Reply to Post

25 Comments

[-]

Budget-Juggernaut-68@reddit

I do like html as an output though. Very flexible.

[-]

epicfilemcnulty@reddit

Well, good that you have measured that, great work. IMO, though, it was a silly take from the start (that models are better with HTMLL) because: 1. Majority of training data is some markdown flavor 2. Markdown already allows raw HTML embedding, if need be. 3. It's really easy for people to write/read "plain" markdown, not so much with HTML. So what needs to be done is just better support of markdown in all the tooling around LLMs.

[-]

my_name_isnt_clever@reddit

My favorite side effect of the AI boom is forcing Markdown into the mainstream. I've always wished I could just write a report for work in plain markdown instead of Word, now I spend my days explaining what markdown is to my coworkers as it's so vital to how LLMs ingest and output information.

[-]

epicfilemcnulty@reddit

haha, so true, yep, LLMs did make markdown mainstream, which, imo, is a good thing. Even though I've been markdown advocate before LLMs, the LLM boom made me finally write a decent markdown parser/renderer :)

[-]

SkyFeistyLlama8@reddit

It's fun to see Markdown finally bringing some kind of semantic sanity to regular documents. Headers for organization, lists for point facts, and both humans and machines can quickly see how a document is organized at a glance.

[-]

darknecross@reddit

That’s the same as HTML though. Header tags, list tags, formatting tags, blockquote tags, etc. You can see how it’s converted into HTML here: https://github.com/mundimark/markdown.pl/blob/master/Markdown.pl

[-]

pjerky@reddit

The big search engine companies are telling us that proper semantic HTML is as good or better for their processing of web content compared to markdown. For their purposes all we care about is rankings and getting visitors so that's fine. For our purposes, producing and reading documents, markdown extended is likely the best format balance for speed, performance, cost, and readability. Also, the big AI companies will always push you to use the most tokens so that they can save money.

[-]

uhuge@reddit

Err(or) higher≈better column is wonderful.

[-]

BigYoSpeck@reddit (OP)

**Error burden: 25 points** Start at 25, then subtract penalties. |Error type|Penalty| |:-|:-| |Minor wording issue or imprecision|\-1| |Moderate technical error|\-3| |Major lifecycle/rendering error|\-6| |Invented API, fake feature, or seriously misleading mechanism|\-8| |Dangerous/confidently wrong advice likely to mislead implementation|\-10| This should be partly absolute and partly density-aware. A 3,000-word answer with 10 errors is worse than a 600-word answer with 2 errors, but both may have similar error density. I would therefore track: absolute error penalty = total weighted errors error density = weighted errors per 1,000 explanatory words Then score the error category like this: Error burden score = 25 - min(25, absolute_error_penalty × 0.65 + error_density × 0.35) That prevents long answers from hiding behind volume.

[-]

bgravato@reddit

I'm curious how many runs did you have of each test, only one? or multiple ones? I'm asking because from my tests, I've seen it sometimes giving very different answers and taking different time and number of tokens if I'm re-running the same prompt, in the same format, multiple times... Things like asking "who is <some name that doesn't really match anyone famous>?". And the answers can go from "I don't know anyone famous with that name" to "he's a famous soccer player who played for this and that team" or "he's an IT engineer, specialist in this and working for that company". (for clarity, it doesn't say literally "this" and "that" it uses actual team names or company names) Especially with parameters that allows for more creativity or things that don't have one strong answer to, it can go wild.

[-]

BigYoSpeck@reddit (OP)

It wasn't a scientific test by any means, two runs on each type to make sure the result wasn't too anomalous Not stringent enough to draw a conclusion between the variance in asking for raw txt, md or no constraint. They're all within a margin of error from one another. But the difference in the length of reasoning, verbosity of content in the response, and lack of accuracy in HTML versions I think can be attributed to more than just none deterministic variance from one request to another I'm going to try it again later with the 27B dense model to see if the sparsity of active weights was a factor. One thing I suspect influenced this is that certain experts are likely to be routed to when generating HTML content that while very capable of generating good quality HTML, is done so at the cost of world knowledge

[-]

ComplexType568@reddit

I wonder how well it'll do in XML

[-]

HokkaidoNights@reddit

There's no reason to burn tokens on any other output format, markdown in 99% of cases is perfect. Just parse into another format after if you have to. Why waste time and context on html bloat?

[-]

Exciting_Garden2535@reddit

It was started by a guy who wrote a whole article about how useful to get html output from Claude, how it is easy to read comparing to markdown, and , according to the article screenshot, the poor guy just don’t now that you can preview the markdown, and just always read sources.

[-]

HokkaidoNights@reddit

Appreciate the context - that's not smart at-all though, and shows a fundamental lack of knowledge.

[-]

Fuzilumpkinz@reddit

Why not both? Simple files can be MD. Anything viewed by a human or that has tables send to html. This isn’t something we have to choose.

[-]

Former-Ad-5757@reddit

This is the way I use it, I like the html-output for user output it adds extra readability for me. But everything the agent internally needs goes in MD, and there is just an export-to-user-html skill, right next to my export-to-user-xlsx skill and export-to-user-docx skill.

[-]

FortiTree@reddit

There must be a reason why markdown is mainstream for all AI models. Before that I knew what html is but I don't even know what "md" means. Another prominent format for agentic coding is json, not md. It's even clearer on schematic and relationship.

[-]

jonas-reddit@reddit

Thanks. This was an interesting read. I wonder what the next 3-6 months will look like. A lot of “best practices” around how to efficiently consume context are presumably due to the context size limitations and the degradation that still occurs with some or many of the models. Once/when mainstream LLMs support much larger usable context, I wonder what that will impact. Markdown feels like a good format and was popular long before LLMs gained interest. Maybe this will all be less relevant before this year-end.

[-]

Popular-Awareness262@reddit

not surprised markdown won. been structuring my agent skill files as .md and the models def use them better than raw text

[-]

9gxa05s8fa8sh@reddit

great work!!!!

[-]

javasux@reddit

For me this is obvious because of the simplicity. Even looking only at the time taken I would argue that markdown is the obvious answer. People arguing for html don't have even a basic grasp of what LLM's are.

[-]

dradik@reddit

I use obsidian and embedded SVGs (from drawIO) for my docs

[-]

dirty_mind86@reddit

Lucas Meijer has a good video talking about this on Youtube, but the general idea is that the HTML format makes it easier to discern and visualize the model's output.

[-]

Weak_Manufacturer323@reddit

Really interesting results. The HTML outputs having dramatically lower reasoning tokens but much higher verbosity/token waste is a pretty strong signal that markdown still hits the best balance between structure, clarity, and efficiency for current local models.