Announcing html-to-markdown v2: Rust rewrite, full CommonMark 1.2 compliance, and hOCR support
Posted by Goldziher@reddit | Python | View on Reddit | 7 comments
Hi Pythonistas,
I'm glad to announce the v2 release of html-to-markdown.
This library started life as a fork of markdownify
, a Python library for converting HTML to Markdown. I forked it originally because I needed modern type hints, but then found myself rewriting the entire thing. Over time it became essential for kreuzberg, where it serves as a backbone for both html -> markdown and hOCR -> markdown.
I am working on Kreuzberg v4, which migrates much of it to Rust. This necessitated updating this component as well, which led to a full rewrite in Rust, offering improved performance, memory stability, and a more robust feature set.
v2 delivers Rust-backed HTML → Markdown conversion with Python bindings, a CLI and a Rust crate. The rewrite makes this by far the most performance and complete solution for HTML to Markdown conversion in python. Here are some benchmarks:
Apple M4 • Real Wikipedia documents • convert()
(Python)
Document | Size | Latency | Throughput | Docs/sec |
---|---|---|---|---|
Lists (Timeline) | 129KB | 0.62ms | 208 MB/s | 1,613 |
Tables (Countries) | 360KB | 2.02ms | 178 MB/s | 495 |
Mixed (Python wiki) | 656KB | 4.56ms | 144 MB/s | 219 |
V1 averaged ~2.5 MB/s (Python/BeautifulSoup). V2’s Rust engine delivers 60–80x higher throughput.
The Python package still exposes markdownify
-style calls via html_to_markdown.v1_compat
, so migrations are relatively straightforward, although the v2 did introduce some breaking changes (see CHANGELOG.md for full details).
Highlights
Here are the key highlights of the v2 release aside from the massive performance improvements:
- CommonMark-compliant defaults with explicit toggles when you need legacy behaviour.
- Inline image extraction (
convert_with_inline_images
) that captures data URI assets and inline SVGs with sizing and quota controls. - Full hOCR 1.2 spec compliance, including hOCR table reconstruction and YAML frontmatter for metadata to keep OCR output structured.
- Memory is kept kept in check by dedicated harnesses: repeated conversions stay under 200 MB RSS on multi-megabyte corpora.
Target Audience
- Engineers replacing BeautifulSoup-based converters that fall apart on large documents or OCR outputs.
- Python, Rust, and CLI users who need identical Markdown from libraries, pipelines, and batch tools.
- Teams building document understanding stacks (including the kreuzberg ecosystem) that rely on tight memory behaviour and parallel throughput.
- OCR specialists who need to process hOCR efficiently.
Comparison to Alternatives
markdownify
: the spiritual ancestor, but still Python + BeautifulSoup. html-to-markdown v2 keeps the API shims while delivering 60–80× more throughput, table-aware hOCR support, and deterministic memory usage across repeated conversions.html2text
: solid for quick scripts, yet it lacks CommonMark compliance and tends to drift on complex tables and OCR layouts; it also allocates heavily under pressure because it was never built with long-running processes in mind.pandoc
: extremely flexible (and amazing!), but large, much slower for pure HTML → Markdown pipelines, and not embeddable in Python without subprocess juggling. html-to-markdown v2 offers a slim Rust core with direct bindings, so you keep the performance while staying in-process.
If you end up using the rewrite, a ⭐️ on the repo always makes yours truly happy!
Here0s0Johnny@reddit
It would be great if you could compile it with web assembly so that one could use it more easily on any device.
Goldziher@reddit (OP)
Releases following your suggestion
Here0s0Johnny@reddit
Great job!
Imo, you're 95% there: if I were you, I'd add a GitHub pages site where one can paste html into a textbox and click a button to convert to markdown! Since a lot of use cases are probably single use, this would be the perfect solution.
Goldziher@reddit (OP)
Nice. PR is welcome.
Goldziher@reddit (OP)
Crossed my mind
tunisia3507@reddit
What is CommonMark 1.2? The latest version of the specification is 0.31.2.
Goldziher@reddit (OP)
I missed that in the title... it hOCR spec 1.2., thanks for drawing my attention to this. I cant edit the title now, so its there.
For CommonMark - the library is tested against the
0.31.2.
specification tests (there is a json test suite from the specs)