Announcing html-to-markdown v2: Rust rewrite, full CommonMark 1.2 compliance, and hOCR support

Posted by Goldziher@reddit | Python | View on Reddit | 7 comments

Hi Pythonistas,

I'm glad to announce the v2 release of html-to-markdown.

This library started life as a fork of markdownify, a Python library for converting HTML to Markdown. I forked it originally because I needed modern type hints, but then found myself rewriting the entire thing. Over time it became essential for kreuzberg, where it serves as a backbone for both html -> markdown and hOCR -> markdown.

I am working on Kreuzberg v4, which migrates much of it to Rust. This necessitated updating this component as well, which led to a full rewrite in Rust, offering improved performance, memory stability, and a more robust feature set.

v2 delivers Rust-backed HTML → Markdown conversion with Python bindings, a CLI and a Rust crate. The rewrite makes this by far the most performance and complete solution for HTML to Markdown conversion in python. Here are some benchmarks:

Apple M4 • Real Wikipedia documents • convert() (Python)

Document Size Latency Throughput Docs/sec
Lists (Timeline) 129KB 0.62ms 208 MB/s 1,613
Tables (Countries) 360KB 2.02ms 178 MB/s 495
Mixed (Python wiki) 656KB 4.56ms 144 MB/s 219

V1 averaged ~2.5 MB/s (Python/BeautifulSoup). V2’s Rust engine delivers 60–80x higher throughput.

The Python package still exposes markdownify-style calls via html_to_markdown.v1_compat, so migrations are relatively straightforward, although the v2 did introduce some breaking changes (see CHANGELOG.md for full details).

Highlights

Here are the key highlights of the v2 release aside from the massive performance improvements:

Target Audience

Comparison to Alternatives

If you end up using the rewrite, a ⭐️ on the repo always makes yours truly happy!