Looking to build a local OCR solution to extract data from pdf scans.

Posted by oilman99999@reddit | LocalLLaMA | View on Reddit | 14 comments

Attached is an example of what it can be but the percentage of handwritten docs is not large. Most of them are typed and not too hard to extract. I am wondering if there is an open source solution that would be good at reading and extracting values from either. I am testing it on my home pc (RTX 4090) but at work we have a DGX Spark that I can use for this

[-]

scottgal2@reddit

Use docling. docling.ai I've built several products for customers for doing almost the same thing and really...docling solves many of the issues around 'normal' ocr.

[-]

Top_Fisherman9619@reddit

wow this is a solid project, thanks for mentioning this. Will keep this in mind

[-]

LocalLLaMA-ModTeam@reddit

Rule 1 - Search before asking. The content is frequently covered in this sub. Please search to see if your question has been answered before creating a new post.

[-]

VonDenBerg@reddit

Like how many? GLM or Gemini, no setup required. Pray and spray

[-]

Top_Fisherman9619@reddit

glm ocr is insane

[-]

exaknight21@reddit

Seconded.

[-]

youcloudsofdoom@reddit

I was actually using qwen 3.5 35b MoE for this today, did a brilliant job at OCR on handwritten notes photographed on paper, running locally on my 4060.

[-]

realtag2025@reddit

Also just tried it on the QWEN 3.5 9B . it got everything write except for dates that are handwritten.

[-]

SomeOrdinaryKangaroo@reddit

qwen 3.5 provides next generation ocr capabilities, highly recommend it

[-]

fernandolv3@reddit

Link to tesseract: https://github.com/tesseract-ocr/tesseract

[-]

total_amateur@reddit

Tesseract may be the way to go if you want open source.

I just built a quick JavaScript tool for OCR to Google Sheets using Google APIs if OP wants to reuse the logic. It was built to allow plugging in different APIs.

My use case was a bit different, though - reading receipts to split shared expenses.

[-]

FoxiPanda@reddit

I’ve found Gemma-4-26B to be my favorite for this.

Tesseract, Qwen3.5, Gemma-4, and various small OCR models could do this (Nvidia-ocr2, GLM-OCR, etc).

However I like Gemma-4 because it is willing to put in (?) marks where it isn’t sure about a name and won’t get stuck for 100s trying to decipher it. It straddles the right balance of speed, accuracy, and willingness to mark uncertainty in the analysis that I really find appealing.

[-]

turtleisinnocent@reddit

A dude managed to run IBM Granite to do OCR using WebGL. I think the notebook is on HF. Zero install, just run. Granite is pretty good too, it can handle TeX and maths, if that's something you need.

[-]

NoFaithlessness951@reddit

Even tesseract can do this