If I want to use a small model to "decode" scanned pdf with graphs and tables etc to feed it to a large non multimodal model. What is my best option?
Posted by Windowsideplant@reddit | LocalLLaMA | View on Reddit | 2 comments
The large one would be on the cloud but not multimodal and the small one on a laptop.
daviden1013@reddit
I made a package for this (https://github.com/daviden1013/vlm4ocr). In my field (medical), we have a lot of scanned documents in PDF that needs OCR. It's part of my day-to-day work. I found Qwen3-VL having the best performance. You can try other VLM OCR frameworks like deepseek OCR, paddle OCR, smoldoclin...
JLeonsarmiento@reddit
use the large online model to vibe code this workflow using plain python:
Run that orchestrator python script from your laptop. Qwen3VL is good and comes in 3 percent sizes: 4b 8b and 30b