Improved markdown quality, code intelligence for 248 languages, and more in Kreuzberg v4.7.0

Posted by Eastern-Surround7763@reddit | Python | View on Reddit | 3 comments

Kreuzberg v4.7.0 is here. Kreuzberg is a Rust-core document intelligence library with bindings for Python, TypeScript/Node.js, Go, Ruby, Java, C#, PHP, Elixir, R, C, and WASM.

We’ve added several features, integrated OpenWEBUI, and made a big improvement in quality across all formats. There is also a new markdown rendering layer and new HTML output, which we now support. And many other fixes and features (find them in our the release notes).

The main highlight is code intelligence and extraction. Kreuzberg now supports 248 formats through our tree-sitter-language-pack library. This is a step toward making Kreuzberg an engine for agents. You can efficiently parse code, allowing direct integration as a library for agents and via MCP. AI agents work with code repositories, review pull requests, index codebases, and analyze source files. Kreuzberg now extracts functions, classes, imports, exports, symbols, and docstrings at the AST level, with code chunking that respects scope boundaries.

Regarding markdown quality, poor document extraction can lead to further issues down the pipeline. We created a benchmark harness using Structural F1 and Text F1 scoring across over 350 documents and 23 formats, then optimized based on that. LaTeX improved from 0% to 100% SF1. XLSX increased from 30% to 100%. PDF table SF1 went from 15.5% to 53.7%. All 23 formats are now at over 80% SF1. The output pipelines receive is now structurally correct by default.

Kreuzberg is now available as a document extraction backend for OpenWebUI, with options for docling-serve compatibility or direct connection. This was one of the most requested integrations, and it’s finally here.

In this release, we’ve added unified architecture where every extractor creates a standard typed document representation. We also included TOON wire format, which is a compact document encoding that reduces LLM prompt token usage by 30 to 50%, semantic chunk labeling, JSON output, strict configuration validation, and improved security. GitHub: https://github.com/kreuzberg-dev/kreuzberg.

https://kreuzberg.dev

Contributions are always very welcome!

[-]

Python-ModTeam@reddit

Hello from the r/Python moderation team,

We appreciate your contribution but have noticed a high volume of similar projects (e.g. AI/ML wrappers, YouTube scrapers, etc.) or submissions that do not meet our quality criteria. To maintain the diversity and quality of content on our subreddit, your post has been removed.

All showcase, code review, project, and AI generated projects should go into the pinned monthly Showcase Thread.

You can also try reposting in one of daily threads instead.

Thank you for understanding, and we encourage you to continue engaging with our community!

Best, The r/Python moderation team

LevelIndependent672@reddit

tree-sitter for code extraction is smart for agent tooling the speed bump over ast.parse helps with large repos

adiberk@reddit

Ok I’m upgrading and will give it a shot again - seems you solved a lot of problems from initial versions I have tried