Running a non-profit that needs to OCR 64 million pages. Where can I apply for free or subsidized compute to run a local model?
Posted by thereisnospooongeek@reddit | LocalLLaMA | View on Reddit | 99 comments
I'm running a not-for-profit and have the need to OCR 64 million pages for building a knowledge base. We don't have the funding and have been using Vast instance for OCR but recently ran out of credits. What are some alternatives where I can apply to get the compute?
Civil-Image5411@reddit
Hi If you have a recent Nvidia GPU you can use this one: https://github.com/aiptimizer/TurboOCR for free. Its one line to start the server via docker. It gives you back a json with text bounding boxes and layout, with layout however a bit slower. If you don't have a GPU and can rent a 5090 on Vastai or Runpod \~ 0.5$ per hour, if there the pages are not too dense you will maybe get 300 pages per second which would cost you around 30$ for all. If you dont trust the numbers i can spin it up for you on my 5090 and let you test 😁
Normal-Ad-7114@reddit
Do you have a sample page, preferably of the shittier quality?
thereisnospooongeek@reddit (OP)
Yes, Here you go.
https://imgur.com/a/uuNXLxj
Normal-Ad-7114@reddit
Oh yeah, properly shitty😖
thereisnospooongeek@reddit (OP)
Thats a document from 1950, So this is the best we could get!
CATLLM@reddit
What kind of documents are these? Different types of docs can use different methods for ocr that can be faster
thereisnospooongeek@reddit (OP)
There are both scanned documents and also Digital version of the document. uploaded one example.
https://imgur.com/a/uuNXLxj
CATLLM@reddit
Is that the real quality of the scan? I think for something like this you will want a dedicated OCR model and maybe a preprocessing step to clean it up.
Paddle ocr or deepseek ocr comes to mind.
thereisnospooongeek@reddit (OP)
I will upload a sample file tomorrow.
Dominican_mamba@reddit
Hey OP, maybe try kreuzberg. It’s free and maybe works for you?
https://github.com/kreuzberg-dev/kreuzberg
I can try volunteering to help you if you want.
for running the compute, you could try in google Colab since it has access to TPU and downloading models and then save to Google Drive or to a storage of your choosing most likely.
thereisnospooongeek@reddit (OP)
Thanks for sharing, really appreciate it.
I used Claude to help set up the pipeline with kreuzberg, and it’s working well so far. For now, I just want to finish the current run with PyMuPDF.
Also appreciate the offer to help, I may reach out once I start iterating further.
Hour_Inevitable_9811@reddit
Use MinerU CPU pipeline. Using one regular computer dedicated to this, it will take one year to do the job.
thereisnospooongeek@reddit (OP)
This is precisely I'm using now. Thanks
RelationshipThink589@reddit
Will 5090s suffice? I can donate GPU hours from my clusters
thereisnospooongeek@reddit (OP)
Yes, That would be certainly helpful, Can I DM you?
RelationshipThink589@reddit
yes
thereisnospooongeek@reddit (OP)
I have built a hybrid approach since I made the post. I’m trying to finish this with CPU-only for now,but if I need some GPU hours, I will certainly reach out to you.
trabulium@reddit
I've been doing lower volume document and have been looking at solutions and found this.
https://huggingface.co/lightonai/LightOnOCR-2-1B
Multi-GPU to get it done in a week:
thereisnospooongeek@reddit (OP)
Thanks for the guidance here. I will check out this. I was using OLMOCR because of the out of the box support for pipeline.
So I hosted my pipline in Hetzner and connected to vast.ai pods. This was working just fine until I ran out of credits. Let me check lighton OCR today.
trabulium@reddit
I'd love to hear back how it went. I've not had a chance to do any testing yet, especially with u/SexyAlienHotTubWater 's suggestions to get that price down.
SexyAlienHotTubWater@reddit
Migth be able to push these numbers down further. On vast, the 3090 is 2/5 the price for similar bandwidth and maybe 3/5 the FLOPs.
The 5090 is 20% more expensive for roughly 1.75x the bandwidth and flops.
Spot prices are also 3/5 the booked price, and this workload is very tolerant to interruption.
If the H100 only gets double the throughput, that suggests to me the workload is bandwidth-constrained, so a B100 would be a better option and the 3090 will give very similar results to the 4090 at much less cost.
thereisnospooongeek@reddit (OP)
// $1K// This is the sweet spot I'm aiming for. Thanks
tophlove31415@reddit
Can't you ocr using Python with some simple strats? It's not perfect, but maybe it's good enough?
thereisnospooongeek@reddit (OP)
Thanks for it. This is what I have done for now.
This is the latest progress:
Rich_Artist_8327@reddit
If you are a business etc. in europe, there is load of free GPU capacity available:
https://www.eurohpc-ju.europa.eu/ai-factories/ai-factories-access-calls_en
thereisnospooongeek@reddit (OP)
No We are not a business. We are a not for profit to help the students in the legal space. I guess technically the work we are doing can be leveraged for any students but starting with students pursuing law.
Azuriteh@reddit
I think I've seen a client get credits from Azure... maybe you can also try asking lium? Last I heard they were giving some grants.
thereisnospooongeek@reddit (OP)
Thanks, I will explore that.
amberdrake@reddit
Depending on your time frame I would actually recommend local processing with a two tier cheap system. First a minimal cpu and ram machine to run tesseract ocr then only pass failures to the other machine running with gpu cpu and ram running sglang and glm-ocr. I literally did a little over 2 million documents last week in my home computer, would go faster if your tiers were split but even on one machine I did fine.
thereisnospooongeek@reddit (OP)
Yes, This is what I'm doing more or less now. Instead of running locally spinned up a VM in Hetzner.
scottgal2@reddit
Docling docling.ai 64 million takes a while but I've gotten to a few thousand a day on laptops with tuning; best suited when you need RAG segmentation as it gives you striuctural cues etc...
But it depends on the documet really, if it's good scans of standard fonts Tesseract alone would rip through these.
EasyOCR etc...are subsumed by docling (it has them internally / you can specify their use).
thereisnospooongeek@reddit (OP)
I tried it, it took approximately 202 seconds to process a 768 page document.
thereisnospooongeek@reddit (OP)
I will try it tomorrow
EastZealousideal7352@reddit
How fast do you need it?
If you’re cool with it running 24/7 for a few weeks/months then the cheapest way is likely a local machine with tesseract or similar. Depends on a bunch of things of course like storage, networking, etc…
If you need it quicker than that then you might want to look at AWS textract depending on your needs. If you just want the words off the page and can take responsibility for storing, sorting, indexing, and whatever else you need to do then this is probably the cheapest way.
thereisnospooongeek@reddit (OP)
I'm aiming to get it processed in one month. Thanks to the support and guidance, I have built a hybrid pipeline now.
We are staying as far as possible from AWS because of the notorious bills. I dont think any small not for profit can even afford AWS.
Hetzner all the way for K8s pods and I also have a 64 Cores, 128 GB ram Server. GPU(s) has become super expensive so we have been renting from vast.ai
Academic_Sleep1118@reddit
I used GLM-OCR on a RTX 6000 Blackwell instance that I rented on vast ai (should have taken a 5090 instead, much cheaper for the job), and got away with something like $1/200Mo output. Assuming you have around 760 billion letters in your 64 million pages, it would cost 760/0.2 = $3840. You could lower that price by going with cheaper GPUs, like 5070s or 5090s (multi GPU is perfectly okay for this kind of job).
thereisnospooongeek@reddit (OP)
I have been using OLMOCR because of the native pipline support. GLMOCR, Would you mind explaining a bit further to understand your work setup or pipeline? I would like to also do a benchmark between GLM and OLM OCR.
one-escape-left@reddit
worth doing the math on what you actually need.
Assuming quantized Qwen3.5 27B at \~30s/page:
- 1 GPU → \~61 years
- 10 GPUs → \~6 years
- 100 GPUs → \~222 days
- 1,000 GPUs → \~22 days
1000x RTX 6000 Ada instances running for 22 days costs \~$264K–$660K on spot markets
thereisnospooongeek@reddit (OP)
//\~$264K–$660K //
Jeez! I wish I had that kind of budget. We barely get 500 USD to 1K USD in donations.
Ok-Internal9317@reddit
You forgot VLLM which allows multiple inferencing to be runned at same time so its' more like 30s/30 page. And for OCR purposes smaller qwen9b can be enough for the task. If its PDF to text then pymupdf is way faster and more accurate than vision only llms. So your cost estimate is a bit high.
one-escape-left@reddit
Yeah, I agree mostly. Throughput won't be 1s per page though even with vLLM. I know because i'm running this setup. It's not that fast. I believe relying on text based OCR is too unreliable and you'll get more consistent results with vision only pipeline. Maybe 9B is fine, but if you want the extra 1% from the benchmarks then 27B would be the winner as of now
guesdo@reddit
Shouldnt smaller tailored models for OCR way better and faster for their purpose?
Like Qwen3.5 is a beast for complex images, but for a knowledge base out of text, something like IBM Docling might just do better and faster overall.
Additional-Bet7074@reddit
Yes, Q3.5 won’t perform as well as the smaller specialized OCR models and frameworks. Neither in quality or speed.
Cronus_k98@reddit
In my experience qwen3.5 gives better quality results than the ocr specific models I’ve tried. Especially with handwriting. It’s very slow though. Qwen3.5 4b is decently fast. I settled on the 35b model because I was doing additional summarization and I’m ok with the slower speed.
Additional-Bet7074@reddit
Interesting. Have you tried cursive? I’d love for a reliable way to transcribe my hand notes from the field. I’ve never got used to tablets so i still use paper but it just ends up being a pile of papers i cant easily search.
one-escape-left@reddit
Benchmarks on document understanding that support this?
Additional-Bet7074@reddit
No benchmark, just my own understanding of the training for ocr models versus larger models like qwen.
Both can be used, with docling I use a larger local model for things like image descriptions.
I would love to see some comparison benchmarks, and I could see qwen3.5 performing better on document structure, particularly if fine tuned, but it would be at a major expense of speed and memory use.
Using the smaller and quicker models initially for document structure, then extracting with a fast OCR like tesseract, then additional things with a larger model or maybe a quality pass is the best method I have found.
moserine@reddit
Docling is extremely slow, though not 30s/page slow. The model OP is using (OlmOCR) is pretty decent tradeoff wise though will perform worse than more recent models. The top of OmniDocBench atm is GLM-OCR which also clocks in at 0.9B params so it's going to be much, much faster (1.86 pages / s for PDF).
Mashic@reddit
If I were to go the LLM route, I'd use Qwen3-VL-30B-A3B. The 27B is overkill for this.
ML-Future@reddit
Is that necessary? A script can do resize, then use a small model like GLM OCR
Mashic@reddit
The guy came here, asked a question about a gigantic job, and doesn't even care about providing extra necessary information like what type of documents, the quality of the scan, if he has even already scanned them.
thereisnospooongeek@reddit (OP)
Fair point, I should’ve added more context.
They’re legal PDFs, a mix of digital and scanned docs with varying quality. Most are already scanned. I’ve built a pipeline to classify and only OCR where needed, and Tesseract works fine for most cases.
I have update more info in replies now.
johnerp@reddit
What’s the use case? C
What are you trying to extract from what types of docs?
Can you ocr on demand, or does it all need to be done upfront?
Regular old OCR might be fine.
Those are huge numbers, do you need all of it?
If NFP maybe you need a business case to spend some donations, run a donation drive etc.
thereisnospooongeek@reddit (OP)
The use case is to extract structured, searchable text from legal PDFs to support students with research and thesis work. The documents are a mix of digital PDFs and scanned/image-based files, so the pipeline first classifies them and only applies OCR where needed.
I’m doing OCR on demand rather than upfront to save compute and avoid unnecessary processing. For a large portion of the documents, standard text extraction works well, and realized Tesseract is sufficient for most of the scanned ones.
Since this is a non-profit effort, I’m trying to be resource conscious first and only consider funding or donations if it becomes absolutely necessary. I dont know maybe during the RAG and emdeddings stage it might be required. Lets see.
anykeyh@reddit
64M pages is no joke haha.
If your corpus is consistent (same doc types, same domain) and relatively good quality, a small vision model + a bit of LoRA fine-tuning on a representative sample might get you surprisingly far without breaking the bank.
How I would do: Create gold-standard dataset over \~500 pages across your hardest document types.
Measure quality, throughput through different solutions. Without that you're just guessing.
Then you can properly estimate time and cost based on that.
For grants: Google TPU Research Cloud, AWS Research Credits, Oracle for Research: all have non-profit tracks, worth applying in parallel.
I run a small Data Shop and that's the kind of subjects we are dealing with constantly. But in your case, you're in the high tier in terms of volume. The OCR is only the first part of your problem too, what about the document storage and retrieval?
Good luck !
thereisnospooongeek@reddit (OP)
Baby Steps! The first part of the problem was gathering 16Million documents. I have ran into multitude of problems during the scraping and iNode exhaustion and what not!
Currently all the documents are stored in a Remote Storage Box. I'm aiming for backblaze b2 for object storage so that I get S3 compatibility.
Ingesting them for RAG is another piece in the puzzle. I'm pretty much a one man shop but I will tackle these issues one at a time.
shansoft@reddit
Use Tesseract OCR and they should be open source. If your documents have different languages or funky writing that OCR couldn't do, I would suggest using Gemma3 27b (or gemma4 31b). I have a lot better accuracy with Gemma than any other OCR LLM models when dealing with multiple languages that isn't English.
instantlybanned@reddit
Do you have absolutely no budget? What's your timeline? If you have 1-2k (maybe even less) and it doesn't need to get done within a few days, then there are options.
thereisnospooongeek@reddit (OP)
We get small donations so we should be able to shell out 500USD to 1K USD for the job. 1K USD is quite a stretch actually.
I'm currently exploring these options.
Superb_Onion8227@reddit
If you get a program that can do OCR on a laptop running at 10W and able to process 1 page per second, that's 3600 pages for 0.010kWh. For 64M pages, that's 18 000kWh
In most country that's at least 3K USD, and that's really a lower bound, only paying energy.
instantlybanned@reddit
Look into PaddleOCR, their models are state of the art for OCR. I use PP-OCRv5 on images. It's quite a small model and a million times better than tesseract. They have other models that are specifically for documents. You can deploy PP-OCRv5 on one or multiple small GPUs. It runs fine one AWS' smallest GPU instance g4dn.xlarge for example. You can even deploy two or maybe even three models in parallel on such a small GPU (if you use PP-OCRv5).
Torodaddy@reddit
Why would you think theres free compute like that?
No-Flatworm-9518@reddit
check out google cloud research credits
aws imagine grants are another good one for nonprofits
some universities have spare hpc capacity if you partner with them
just gotta apply and explain the public benefit
JayPSec@reddit
https://www.reddit.com/r/LocalLLaMA/comments/1shqf5a/using_ocr_models_with_llamacpp_by_ngxson/ worth checking out
Mkengine@reddit
There are so many OCR / document understanding models out there, here is my personal OCR list I try to keep up to date:
GOT-OCR:
https://huggingface.co/stepfun-ai/GOT-OCR2_0
granite:
https://huggingface.co/ibm-granite/granite-docling-258M
https://huggingface.co/ibm-granite/granite-4.0-3b-vision
MinerU:
https://huggingface.co/opendatalab/MinerU2.5-2509-1.2B https://huggingface.co/opendatalab/MinerU-Diffusion-V1-0320-2.5B
OCRFlux:
https://huggingface.co/ChatDOC/OCRFlux-3B
MonkeyOCR-pro:
1.2B: https://huggingface.co/echo840/MonkeyOCR-pro-1.2B
3B: https://huggingface.co/echo840/MonkeyOCR-pro-3B
RolmOCR:
https://huggingface.co/reducto/RolmOCR
Nanonets OCR:
https://huggingface.co/nanonets/Nanonets-OCR2-3B
dots OCR:
https://huggingface.co/rednote-hilab/dots.ocr https://modelscope.cn/models/rednote-hilab/dots.ocr-1.5 https://huggingface.co/rednote-hilab/dots.mocr
olmocr 2:
https://huggingface.co/allenai/olmOCR-2-7B-1025
Light-On-OCR:
https://huggingface.co/lightonai/LightOnOCR-2-1B
Chandra:
https://huggingface.co/datalab-to/chandra-ocr-2
Jina vlm:
https://huggingface.co/jinaai/jina-vlm
HunyuanOCR:
https://huggingface.co/tencent/HunyuanOCR
bytedance Dolphin 2:
https://huggingface.co/ByteDance/Dolphin-v2
PaddleOCR-VL:
https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.5
Deepseek OCR 2:
https://huggingface.co/deepseek-ai/DeepSeek-OCR-2
GLM OCR:
https://huggingface.co/zai-org/GLM-OCR
Nemotron OCR:
https://huggingface.co/nvidia/nemotron-ocr-v2
Qianfan-OCR:
https://huggingface.co/baidu/Qianfan-OCR
Falcon-OCR:
https://huggingface.co/tiiuae/Falcon-OCR
FireRed-OCR:
https://huggingface.co/FireRedTeam/FireRed-OCR
Typhoon-OCR:
https://huggingface.co/typhoon-ai/typhoon-ocr1.5-2b
Quiet-Possession-597@reddit
gpuhub! 17c/h for a 4080 absolute steal
turtleisinnocent@reddit
Use IBM Granite on 16GB of VRAM and you'll go through those pages in like a day. For real.
Ikinoki@reddit
While tesseract and llms are free one is not precise and another uses a lot of compute. May I suggest finereader? https://pdf.abbyy.com/
They made their engine before tesseract even existed and it was pretty on par 26 years ago with current tesseract.
What I'd suggest is get free abbyy and try it out with your files. In some cases when you need to train your own converter neither finereader nor tesseract will outdo a pretrained llm even counting the compute costs.
Pristine_Pick823@reddit
Why exactly do you need an LLM for this specific task? Wouldn't old school OCR (tesseract etc) extract the data you need?
moserine@reddit
Only true if your docs don't have tables, columns, diagrams, etc. and are purely text. Tesseract and other LSTMs perform much worse than VLMs on anything more than very clean docs
Pristine_Pick823@reddit
That’s true, but I would also argue (from personal experience) that LLMs are also rather prone to getting data wrong when processing tables, specially in large datasets.
Cupakov@reddit
I don’t know, I’ve had great success at extracting tables with GLM-OCR, haven’t managed to find a mistake yet.
Ikinoki@reddit
OCR is also prone to wrong data. Y0|| |ii5+ can't get around this.
p3r3lin@reddit
Agree. Everything with a "layout" can still trip things up. A small local vision model could pre-categorise each document and decide to route it to cheap Tesaract OCR or more expensive LLMs.
Pristine_Pick823@reddit
This is the way!
durika@reddit
Call me dumb but how do you use LSTM for OCR?
p3r3lin@reddit
+1 - Pre-LLMs https://github.com/tesseract-ocr/tesseract was already really good at OCR. I would sample a few thousand pages and see how far you can come with it. Runs on almost any hardware.
benno_1237@reddit
If you want, send me a DM about what your non profit is about (roughly) and which model you need for how long. I can give you access to a few B200 for a while
thereisnospooongeek@reddit (OP)
Thanks for the kind offer. I have been working for the past 14 hours straight setting up CPU based pipeline to check accuracy since we were not having any GPU, I will DM you tomorrow if that is okay.
I'm hoping to deploy a model where I can strike a better balance with quality. OLMOCR worked fine for us earlier.
benno_1237@reddit
Take your time. What most of the other comments didnt consider: datacenter GPUs shine in concurrency. So even if a single request takes ~1s, you can most likely hit the B200s with 100-200 requests at once. So take that into account
The only issue, I would prefer giving you access via a wireguard tunnel, not directly.
createthiscom@reddit
You're probably not going to do that in a reasonable amount of time on consumer hardware. You could try renting a B200 and running Gemma-4 or something.
dontfeedagalasponge@reddit
My friend's startup just released a local model for PDF processing. Give this a go?
https://github.com/muna-ai/nomic-layout
They're super responsive so you can reach out for support.
ganonfirehouse420@reddit
I tried several models for ocr but the formatting was often wrong. Then I deployed lm studio with qwen3.5:9b. Next I vibecoded myself a python script to do ocr. The setup works on 8gb GPUs.
Altruistic_Bonus2583@reddit
https://github.com/btbtyler09/shrew-server + https://huggingface.co/btbtyler09/shrew-2b
matt-k-wong@reddit
This is a case where you are better off running a model fine tuned for the task instead of running a large generic model. Check out: https://build.nvidia.com/nvidia/nemotron-ocr-v1 you can get some work done for free there and it's also small enough it should run on consumer grade video cards.
thereisnospooongeek@reddit (OP)
I have been using OLM OCR. Example from a run we did last year, I found the free model is super slow.
miklosp@reddit
Have your tried the basics?
Marker: https://github.com/datalab-to/marker
Markitdown: https://github.com/microsoft/markitdown
Paddle OCR has a tiny model: https://www.paddleocr.ai/latest/en/version3.x/pipeline_usage/PaddleOCR-VL.html
insanemal@reddit
I mean, that's dependant on your hardware. What were you running it on?
And do you have any budget to improve that situation
Familiar_Text_6913@reddit
Ocrmypdf on a laptop overnight. No hallucinations!
This_Maintenance_834@reddit
There are 31millions of seconds in a year, you will need 2 pages per second to finish it in a year. Something like a RTX PRO 6000 running local model probably can do it in a year. Obviously, finding the most efficient model will help with progress a lot. Also, when run local model , getting high concurrency is very important to get the throughput.
exaknight21@reddit
Depends on your data type… is it computer generated PDFs? Scanned computer generated documents? Computer generated documents with handwritten text? Mostly handwritten text?
If computer generated -> OCRMyPDF
If scanned Computer Generated documents-> OCRMyPDF
If scanned with Handwritten text -> ZLM OCR
Mostly Handwritten Text -> ZLM OCR.
3060 works pretty fast with vLLM.
OCRMyPDF can work on a shitty CPU too.
https://github.com/ikantkode/exaOCR (fully open source dockerized easy to deploy fastapi app).
I dont think I pushed zlm ocr code yet, but i will check and post back if interested.
denoflore_ai_guy@reddit
https://www.computingforhumanity.com/our-story
Snoo_28140@reddit
That will be expensive to do with llms. Even a very small 9b model at $0.15 per million output tokens * 64 (million pages)*800 (tokens per page), will run you some 8k in output alone.
You should really look into traditional OCR and compare the quality. If you need a hybrid approach you will need to be very selective - maybe only use an llm for pages that contain images.
phreak9i6@reddit
with the cost of AI and the bulk of content you want to OCR, Id recommend you use traditional ocr software to handle this task.
Cute_Obligation2944@reddit
Have you considered Adobe? They were doing OCR before vision models hit the news.
PassengerPigeon343@reddit
You’re going to want to look for stuff like this post (note haven’t actually tried this myself, but remembered seeing it when it was posted)
https://www.reddit.com/r/LocalLLaMA/s/rTZwSJEjNp
sir_creamy@reddit
Surya and paddle ocr are good, full packages, and fast.
Qwen 3.5 gives best results but is expensive and slow.
Get a set of docs that are shitty quality and start testing the cheapest option that meets requirements
CreamPitiful4295@reddit
Some lighter fluid and a lighter would go a long way here.
last_llm_standing@reddit
you mige be better iff running on local compute