Use HTML as the primary chat language for your agents so they can draw diagrams
Posted by sdfgeoff@reddit | LocalLLaMA | View on Reddit | 23 comments
A week or two ago Thariq published an article on how good AI's were at working with HTML and that there was not really any reason to use markdown anymore. And yet all of our coding agents work with markdown and output markdown and have been trained on markdown. So as a bit of an experiment I decided to see how good they were at using HTML as part of the main chat. The answer is - pretty good.
So this is a coding agent with the interface running in a web browser. The responses from the agent are piped straight into the page. At first it would still always use markdown, and then I realized that effectively my system prompt was in markdown! Once I switched the system prompt to HTML it got way better. The current system prompt:
<p>
Being helpful doesn't mean doing everything the user says. Neither I nor the user are omniscient or infallible. If the user is making a mistake, I tell them. If I have made a mistake, I mention it and move on. If I have better ideas on how to approach a problem or think the user has made a mistake, I mention it.
</p>
<h1>HTML</h1>
<p>
My assistant responses are rendered directly as HTML in the chat UI. I <i><b>MUST</b></i> use HTML when replying to the user. Plain prose should be wrapped in tags such as `<p>`, `<ul>`, `<ol>`, and heading tags where appropriate. To show the user something visually or as a diagram , I will draw a SVG directly in the chat.
Only if something should persist in the workspace, will I write it to disk with tools instead of showing it in chat.
</p>
(Yeah, I'm also playing around with first person system prompts, benefit/drawbacks unclear)
And as a result it can now chose to render diagrams as part of it's chat response, can put them in tables etc. etc.
In this case I'm using Qwen3.6-27B and it's doing pretty good at making SVG diagrams (ChatGPT isn't much better), though it still has a tendency to try use markdown. I suspect it's just so baked into the models at this point.
Qwen3-vl-4 is pretty bad at SVG's, so I strongly suspect this is an emerging capability of models.
R_Duncan@reddit
nay, use markdown wherever it can as it's more compact, readable, and the same llm has very superior perfomances with it. Use html only where markdown is not enough (graphs?).
Chupa-Skrull@reddit
If you dope your markdown up with a couple Obsidian plugins you never need to bother with the HTML for most outputs.
You can also mess with trying to get it to generate Typst but obviously that's not really baked in at all and prone to massive error
ab2377@reddit
i really don't get it.
TrebleCleft1@reddit
You have to distinguish between input and output. Sure HTML can provide a richer output, but if you're iterating, all those HTML tags are going back in as input, and I haven't seen convincing data that as input content HTML is more effective at grounding the LLM. Unless the model reasons over a screenshot of the output instead of the raw img tags / inline SVG?
Also your text content is sitting inside HTML tags too, just consuming input tokens. Diagrams separate from text as HTML / diagram artifacts might be a neat compromise.
NineThreeTilNow@reddit
The old Claude models were pretty decent at SVGs but I haven't used their site directly to test in a long time.
sahanpk@reddit
rendered HTML is interesting, but i’d want a sandbox boundary first. generated UI is also generated attack surface.
EndlessZone123@reddit
It's hard to imagine how HTML and CSS only can do any malicious.
nicksterling@reddit
You’d need to be very prescriptive to the model. If it wants to generate a diagram then it may use JavaScript and bring in a possibly compromised library from a CDN.
BigYoSpeck@reddit
Depending on the model used I wouldn't want to count on the quality of the actual content when letting it output pre-formatted HTML. At least with Qwen3.6 35B I've found that while the output looks very nice, it seems to have worse subject knowledge when it's creating HTML
If you want the best quality of responses, don't trouble the model with additional context on formatting. Let it choose it's preferred output format to maximise the quality of its reasoning and knowledge. You can post process content deterministically to make it nice for people, you don't want to add to the "cognitive" load of a model getting it to do something trivial
As for diagrams again, that's a generative UI problem. If a diagram, graph or other UI element can be defined in code, don't waste the models capability fabricating SVG for it, let it output a coded version that can be parsed into a visual
East_Entry_8633@reddit
I think someone in this sub made a comparison between models reading and interpreting markdown vs html (styled and raw). If my memory serves me correctly, the html uses less reasoning tokens too.
BigYoSpeck@reddit
If that's in reference to my recent post it wasn't them reading html, it was a normal prompt with instructions for the output formatting
And yes html outputs used massively fewer reasoning tokens, more output tokens both because of the additional html tags and just because they generated more content, and with MTP they achieved faster tokens per second compared with markdown
But, at least with Qwen3.6 35B this was at the expense of quality. The output was nice and clean, but full of hallucinations on the actual topic
Pleasant-Shallot-707@reddit
But a shit ton more output tokens
jtjstock@reddit
Mtp rips through html like a hot knife through butter
benja0x40@reddit
Nice! Aside obvious overhead and security issues, this is a promising direction for LLM encapsulating UIs.
The interactivity with generated content being increasingly granular, dynamic HTML seems the straightforward intermediate between static Markdown and vibe coded Apps.
Long_War8748@reddit
I would prefer a unified markdown that includes interactive elements and be a react light. E.g. what Anthropic and Google already are working on in their web chat frontends. Html/xml etc is just too verbose and if the LLM can just write the pure intent as markdown, and the rendering is a browser/client concern, this would be the best of both worlds.
This is something that is kind of easy to implement even, it just needs to become a standard like markdown or mermaid etc that all agree on, so that it ends up in the training data sets, so that our llms know it "for free". Since making a home baked solution would require to give the llm the manual every time and eat up tokens too, haha
fasti-au@reddit
risk of fail is high, html isnt nice aws balance tags are easy to break if you make your htm a framer and use yaml you are mermaid 8)
Main_Problem_2696@reddit
HTML as primary chat language is clever. Models are trained on it as much as markdown. SVG diagrams in chat without switching contexts is huge.Used Runable to build an agent interface with live HTML rendering. Prototyped in a weekend instead of weeks. The approach definitely works.
noctrex@reddit
But they can already chart diagrams using mermaid in markdown, and do it very well
Jipok_@reddit
https://www.reddit.com/r/LocalLLaMA/comments/1tq7yeq/qwen36_35b_txt_vs_markdown_vs_html_vs_htmlcss/
Weary-Step-8818@reddit
html as agent output makes sense when the interface is actually rendered. markdown is fine for text, but weak for state, layout, diagrams, forms, and inline controls. the real unlock is not prettier answers, it’s agents returning usable UI primitives.
Clear-Ad-9312@reddit
Huh, classic "it's not just, it's " llm kind of output
hapliniste@reddit
If you benchmark it it is likely to score a ton lower since it will be out of distribution.
But it's true they could start to train the model for this directly. It could also add interactivity directly in chats.
Former-Ad-5757@reddit
I just have made my interface have 2 modes, reader mode and raw mode. Reader mode just simply adds a lot of renderers on top of markdown to show mermaid diagrams and draw.io diagrams and excalidraws etc and images, raw mode is just the plain markdown.
This way I can copy/paste from raw mode and have a nice readability in reader mode.
Html just adds to many variables for me for a chat, it is perfect for a report / export and that is where I use it.
But either I need unlimited tokens and a generation speed of 150+ so the model can add all kinds of fancy tabs / popovers / popupunders / tooltips / fancy js tricks all to add extra explanations. Which I don't immediately need to read.
Or I just stay with the current info as readable as possible.
Also it opens up a whole new kind of attack-vectors, browsers are not secured against random generated html content from same-host principle. Basically with ollama and the likes you get a localhost page with all the privileges allowed by that host.
Have fun running a finetune which always generates a 2Mb javascript inclusion to insert malware / ads etc.