Using PaddleOCR-VL-1.5 with llama-server for book OCR
Posted by Final-Frosting7742@reddit | LocalLLaMA | View on Reddit | 19 comments
I've been running PaddleOCR-VL-1.5 via llama.cpp's server for OCR on book pages. It handles complex layouts, tables, and mixed text/figure pages surprisingly well.
Setup:
- Model: PaddleOCR-VL-1.5-GGUF + mmproj.gguf
- Backend: llama-server (Vulkan on Windows)
- Pipeline: layout detection → region OCR → Markdown with HTML tables
The pipeline can process an entire folder of page photos end-to-end. You can basically digitalise a book with a single command.
Repo: https://github.com/akmalayari/ocr-book
Has anyone else experimented with vision-language models for OCR?
76vangel@reddit
Anyone know how to do handwriting? I have a pile of ww2 soldier/spy diaries I want transcribed.
Final-Frosting7742@reddit (OP)
Honestly that's something i want to try too. Once i test it i'll come back to you.
76vangel@reddit
That would be amazing, thanks.
Arkenstonish@reddit
You do usual vl model pipeline, like qwen 3.5 or 3.6. With mmproj in gguf case.
Multilingual cursive is usually in good shape even in small quants (eg upper bound of q3, so even 8 vram will be sufficient).
You also can always combine usual ocr layout detection (paddle paddle) and use big vl for recognise step.
Layout wise: qwen 3.5 and 3.6 are trained for grounding task too, so you can request "bounding box for
HareMayor@reddit
What sampling parameters are you using ?
Final-Frosting7742@reddit (OP)
I havent tried messing with the sampling parameters but paddleocr is very faithful to the text: it probably uses temp=0.0.
I've never seen it hallucinate except on mirrored text. But that's an extremely vicious edge case. And even there it prefers producing gibberish instead of guessing a word.
HareMayor@reddit
It keeps generating Html formatted text instead of markdown,
Can you tell me if you give it a specific prompt?
Final-Frosting7742@reddit (OP)
I added full markdown conversion. Now the default behaviour is pure markdown, with an option to keep html tables and graphs. Check it out.
Final-Frosting7742@reddit (OP)
No the prompt is hardcoded in the
paddlexlibrary.The html is actually the expected behaviour for tables and graphs. Paddleocrvl natively outputs html for these. The postprocess strips part of the html to keep it concise but that's still html. It wasnt an issue for my use-case but you're right i should actually add an option to output pure markdown.
Thanks for the feedback.
ganonfirehouse420@reddit
I have actually created a python script to perform ocr with gemma4-e4b-it. My script should be model independent and work with models that can do proper markdown formatting. My last try using it with glm-ocr didn't worked well as the formatting was always wrong.
Final-Frosting7742@reddit (OP)
The small gemma4 models are interesting as they can handle any type of file (text, image, audio). How was the processing speed though? Paddleocrvl is only 0.9B parameters so it's pretty fast for the task. Running gemma4-e4b-it to digitalise an entire book on my hardware would probably take a full day.
ganonfirehouse420@reddit
E4B unloads a lot of the processing to CPU so I can get some speed even with an average gpu. Even with a 8GB gpu I can achieve around 30 token per second. It does take a while longer then a small model but small models always failed me at table generation.
ready_to_fuck_yeahh@reddit
Also try z.ai ocr locally, it's just 0.9B
Final-Frosting7742@reddit (OP)
I wasnt aware glm-ocr was this good. 94.6 on omnidocbench.
Speed? Well 20s per page (40s per double-page) on my ryzen ai 9 hx370. Might give glm-ocr a try since it seems faster.
ready_to_fuck_yeahh@reddit
Could you share GitHub link with glm, once you set it up ocr, hehehe.
Final-Frosting7742@reddit (OP)
Lol sure. I'll post here with the new setup once it's done.
Mkengine@reddit
There are so many OCR / document understanding models out there, here is my personal OCR list I try to keep up to date:
GOT-OCR:
https://huggingface.co/stepfun-ai/GOT-OCR2_0
granite:
https://huggingface.co/ibm-granite/granite-docling-258M
https://huggingface.co/ibm-granite/granite-4.0-3b-vision
MinerU:
https://huggingface.co/opendatalab/MinerU2.5-2509-1.2B https://huggingface.co/opendatalab/MinerU-Diffusion-V1-0320-2.5B
OCRFlux:
https://huggingface.co/ChatDOC/OCRFlux-3B
MonkeyOCR-pro:
1.2B: https://huggingface.co/echo840/MonkeyOCR-pro-1.2B
3B: https://huggingface.co/echo840/MonkeyOCR-pro-3B
RolmOCR:
https://huggingface.co/reducto/RolmOCR
Nanonets OCR:
https://huggingface.co/nanonets/Nanonets-OCR2-3B
dots OCR:
https://huggingface.co/rednote-hilab/dots.ocr https://modelscope.cn/models/rednote-hilab/dots.ocr-1.5 https://huggingface.co/rednote-hilab/dots.mocr
olmocr 2:
https://huggingface.co/allenai/olmOCR-2-7B-1025
Light-On-OCR:
https://huggingface.co/lightonai/LightOnOCR-2-1B
Chandra:
https://huggingface.co/datalab-to/chandra-ocr-2
Jina vlm:
https://huggingface.co/jinaai/jina-vlm
HunyuanOCR:
https://huggingface.co/tencent/HunyuanOCR
bytedance Dolphin 2:
https://huggingface.co/ByteDance/Dolphin-v2
PaddleOCR-VL:
https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.5
Deepseek OCR 2:
https://huggingface.co/deepseek-ai/DeepSeek-OCR-2
GLM OCR:
https://huggingface.co/zai-org/GLM-OCR
Nemotron OCR:
https://huggingface.co/nvidia/nemotron-ocr-v2
Qianfan-OCR:
https://huggingface.co/baidu/Qianfan-OCR
Falcon-OCR:
https://huggingface.co/tiiuae/Falcon-OCR
FireRed-OCR:
https://huggingface.co/FireRedTeam/FireRed-OCR
Typhoon-OCR:
https://huggingface.co/typhoon-ai/typhoon-ocr1.5-2b
Churro-3B:
https://huggingface.co/stanford-oval/churro-3B
Service-Kitchen@reddit
Yes, it's an amazing model, I've heard this is a competitive model too: https://huggingface.co/datalab-to/chandra-ocr-2
For digitising books, the difficult part is getting all pages scanned. No at home solutions for that outside manual toil and labour.
Final-Frosting7742@reddit (OP)
Clearly. Need to take a photo of every page. I just go chapter by chapter. Either way it's better than paying a 600€ OCR machine that will probably butcher your graphs and tables.