Docling, how does it work with VLM?

Posted by gevorgter@reddit | LocalLLaMA | View on Reddit | 1 comments

So i have a need to convert PDF to text for data extraction. Regular/Traditional OCR does very good job but unfortunately it does not take into consideration the layout, so while each word is perfectly recognized the output is a gibberish (if you try to read it). Understood each word but actual text does not make sense.

VLMs, such as Qwen3-VL or OpenAI do a good job producing text considering layout, so it makes sense but unfortunately the actual OCR is not nearly as good. It hallucinates often and no coordinates where the word was found.

So now, i am looking at Docling, it's using custom OCR but then sends for processing to VLM.

Question is, What is the output of Docling? Docling tags which is a "marriage" of two worlds OCR and VLM?

How does it do that, how does it marry VLM output with OCR output?