If I want to use a small model to "decode" scanned pdf with graphs and tables etc to feed it to a large non multimodal model. What is my best option?

Posted by Windowsideplant@reddit | LocalLLaMA | View on Reddit | 2 comments

The large one would be on the cloud but not multimodal and the small one on a laptop.

[-]

daviden1013@reddit

I made a package for this (https://github.com/daviden1013/vlm4ocr). In my field (medical), we have a lot of scanned documents in PDF that needs OCR. It's part of my day-to-day work. I found Qwen3-VL having the best performance. You can try other VLM OCR frameworks like deepseek OCR, paddle OCR, smoldoclin...

[-]

JLeonsarmiento@reddit

use the large online model to vibe code this workflow using plain python:

export each pdf page to image
pass image to small LLM with vision (like 100's of options out there) with a prompt for parsing image data as text (JSON or MarkDown)
Pass the parsed texts from pdf images to the large LLM with instructions on what to do
collect what the large LLM spits out

Run that orchestrator python script from your laptop. Qwen3VL is good and comes in 3 percent sizes: 4b 8b and 30b