OCR + LLM for Invoice Extraction

Posted by JumpyHouse@reddit | LocalLLaMA | View on Reddit | 67 comments

I’m starting to get a bit frustrated. I’m trying to develop a mobile application for an academic project involving invoice information extraction. Since this is a non-commercial project, I’m not allowed to use paid solutions like Google Vision or Azure AI Vision. So far, I’ve studied several possibilities, with the best being SuryaOCR/Marker for data extraction and Qwen 2.5 14B for data interpretation, along with some minor validation through RegEx.

I’m also limited in terms of options because I have an RX 6700 XT with 12GB of VRAM and can’t run Hugging Face models due to the lack of support for my GPU. I’ve also tried a few Vision models like Llama 3.2 Vision and various OCR solutions like PaddleOCR , PyTesseract and EasyOCR and they all came short due to the lack of layout detection.

I wanted to ask if any of you have faced a similar situation and if you have any ideas or tips because I’m running out of options for data extraction. The invoices are predominantly Portuguese, so many OCR models end up lacking support for the layout detection.

Thank you in advance.🫡

[-]

Straight_Ad3312@reddit

Depends a lot on volume and whether you need accounting-grade structured data (header + line items) or just searchable text. Some things I've learned building this out:

Stack that actually works for invoices:

OCR: Mistral OCR API has been the best price/accuracy ratio for me — handles messy scans and handwriting way better than Tesseract, and it's cheaper than Textract. Google Document AI Invoice Parser is the premium option if budget isn't a concern.
Field extraction: Don't try to regex the OCR output. Feed the raw text to an LLM with a strict JSON schema (header fields + a line_items array). Mistral Large, Claude Sonnet, and GPT-4o-mini all work fine. Validate the output with Zod or Pydantic before writing to DB.
Coherence checks: Always cross-validate qty × unit_price ≈ line_total and sum(line_items) ≈ total_amount. Catches \~80% of silent OCR errors before they hit your ERP.
Queue: Inngest, Trigger.dev, or plain BullMQ. OCR + extraction easily takes 20-40s, you don't want that in the request path.
Rate limits: Mistral is 6 req/s on the base tier. Build a proper limiter or you'll hit 429s the first time a user bulk-uploads 50 docs.
Storage: If you'll ever need semantic search across invoices, store 1024-dim embeddings in pgvector now (mistral-embed or OpenAI text-embedding-3-small). Retrofitting is painful.

Gotchas nobody mentions:

Line items are where every template-based tool falls apart. A schema-constrained LLM extraction destroys template tools on layout variation.
PDFs with embedded text: extract directly (pdf-parse, pdfplumber) before falling back to OCR. Saves latency + 90% of the OCR cost.
If you sync to Xero/QuickBooks: the API expects internal account codes, NOT the text written on the invoice. You need a chart-of-accounts mapping step (can be LLM-assisted, but the user has to confirm it once per vendor).
Bank statements aren't invoices (debit/credit, not qty × price). If you ingest both, plan separate extraction schemas from day one.

TL;DR: <100 docs/month → no-code (Parseur, Nanonets, Rossum) is fine. Above that, rolling the stack above is cheaper and gives you full control over the accounting sync, which is usually where no-code tools break down.

Disclaimer: I'm the founder of Zerentry - we do exactly this stack in production. Mod, feel free to remove if out of line.

[-]

Soft_Willingness_529@reddit

damn this is actually super helpful, thanks for laying it all out like that

the point about not regexing the ocr output and just using an llm with a strict json schema is huge, saved me so much headache when i finally switched over

also the coherence checks for the math are such a simple but genius catch, never wouldve thought of that for catching ocr errors

storage with embeddings is a good call too, retrofitting that later does sound like a massive pain

the bit about bank statements needing a totally different schema is so real, learned that one the hard way after trying to force it

seriously tho, thanks for the detailed breakdown, this is the kind of practical advice you only get from actually building it

[-]

FunDaveX@reddit

So I actually built a workflow for this for my software agency's clients on top of my SharpAPI platform.

It's not a single-step thing though, that's where most people get stuck. What actually works is a multi-step pipeline: first ML-based OCR extracts the raw data, then a couple rounds of AI processing clean it up, validate fields, and structure everything into proper JSON.

It handles PDF, TIFF, JPG, and PNG - including messy phone photos and multi-page invoices -- and spits back structured data: seller, buyer, line items, tax breakdowns, payment terms, even e-invoice fields. Has SDKs for PHP, Laravel, and Node so integration is pretty painless.

https://sharpapi.com/en/catalog/ai/accounting-finance/invoice-parser

[-]

Feeling_Product1359@reddit

I had the exact same frustration with Docling’s speed on long PDFs. I actually ended up building AI PDF Parserto solve this.

It handles 100+ page docs much faster without needing a local GPU, and it preserves formatting in JSON perfectly. It saved me from the 'splitting manually' nightmare you mentioned. Would love to hear if it works for your 131-page file!

[-]

Forward-Sympathy7479@reddit

How is it going, Did you find anything, that works....

[-]

Apprehensive_Dig3462@reddit

Try out https://olmocr.allenai.org/ its opensource better than surya

[-]

JumpyHouse@reddit (OP)

Thank you for the response, I tried it and it's amazing, the only problem is it requires at least a RTX 4090 and 20GB VRAM so a big no no for my little RX6700XT :-)

[-]

Apprehensive_Dig3462@reddit

I rent gpus as i need them. An rtx4090 is around 50-70 cents per hour on vast.ai and with olmocr a 250 page pdf took around 20 mins so i spent very little in the end.

[-]

nbzuong@reddit

Hi, I'm currently renting a rtx3090ti on vast.ai and trying to setup olmocr but I got a bit problem (I'm using docker run instruction in the readme, and on ubuntu 22.04 but the only model shown up in the api list model is Qwen/Qwen3-0.6B).
May I have your setup pipeline?
Thanks in advance.

[-]

Yes_but_I_think@reddit

For the accuracy you are aiming for it is necessary to use a larger GPU. Otherwise stay where you are.

[-]

ok-ok-sawa@reddit

Hey, I had a similar project before and honestly the only thing that wor⁤ked for me in terms of layout detection was Lid⁤o

[-]

NecessaryTourist9539@reddit

Has to be clevrscan.com

[-]

SouthTurbulent33@reddit

Try Unstract. You can write prompts to extract the information you need from docs. Think they give you a trial for 2 weeks.

Or, if you just want to extract data and get an API, check out llmwhisperer - also made by the company I mentioned above.

[-]

cppgenius@reddit

What worked for me is Ollama qwen2.5vl.... but took a lot of prompt engineering to get it to do what I wanted. Running it on an i5 with 8GB of RAM, CPU only, so it is slow as crap... so I guess with your GPU thrown into the mix it could work for you. I am using PaddleOCR in cases where I need quicker basic checks to be done. Paddle seems to extract the text in a meaningful manner so that qwen2.5vl can understand it. I also use the same OCR dumps for the base qwen2.5 model. Since I am using a machine with low specs... I am using smaller models.

I agree with PyTesseract being mediocre, between Paddle and Tesseract, Paddle is a true winner. Tesseract struggles a lot giving meaningful results.

[-]

ArmadilloFlaky6440@reddit

I've been stuck for like two years now trying to extract key info from french invoices. I've tried OCR with docTR combined with LayoutXLM (fine-tuned), gave Donut and pix2struct a shot, and even tested some zero shot stuff VLMs like internLM, Qwen 2VL, and miniCPM ... none of these worked for me. So far, the only thing that actually worked are closed source solutions, which is honestly pretty frustrating.

[-]

cppgenius@reddit

I was searching for some feedback on Surya and found this post. I have to agree... I was using PaddleOCR, which works fairly well for an LLM + OCR system I am developing. I wanted something faster than Paddle and Surya looked like a good candidate. Must say at first I almost thought of swapping over to Surya completely because it was doing OCR even better than Paddle... until I started working with lower quality scans.... Surya was just beginning to detect certain noise as Hebrew, Arabic, Chinese.... and I could not find any flag to set to force it to English only. After two days of intensive testing with Surya I just got so frustrated I decided to use Paddle earlier on in my function... and now I can just re-use the OCR dumps whenever I need it where I used Paddle anyway.

Surya was an extremely frustrating experience. Only big complaint I have about Paddle, it sometimes just quits with zero error feedback... I guess it might be memory related and when that happens Paddle just leaves without saying goodbye... I must say in that regard I found Surya much more stable... but I will rather go for Paddle who crashes only once in a blue moon and have more accurate OCR dumps than the garbage Surya gives me.

If someone van give me a proper python implementation of Surya that helps me force it to use English without the need of an extensive sanitizing function (because I had to strip so many tags and unusable stuff from the dump) I am more than willing to give it a try again. It also seemed to hallucinate a lot, adding weird phrases and one time something even in German, pulling text out of thin air in places where there was no text to speak of.

[-]

Pale-Tie-3691@reddit

if you don't mind, can you share some samples here ? Our team is working on a OCR solution for non - english

[-]

YakFit8581@reddit

Try www.revsig.com they are pretty good

[-]

paolovic89@reddit

Hey u/JumpyHouse ,

do you have to handle (ticked) checkboxes by any chance?
If so, how do you do that?

[-]

Sharp-Past-8473@reddit

ey! Totally feel your pain — extracting structured data from invoices with free tools is way trickier than it should be, especially when layout and language come into play.

You’re actually doing a great job breaking it down — using something like SuryaOCR for text and Qwen for parsing is a solid path. But layout is always the bottleneck, and most free OCR tools (like EasyOCR, PyTesseract) aren’t layout-aware, especially for non-English formats like Portuguese invoices.

A few ideas that might help:

⸻

💡 Suggestions: • Try LayoutParser with PaddleOCR layout mode. It supports multilingual text and basic layout detection — might not be perfect but it’s light and scriptable. • Run OCR separately from interpretation: Extract raw text+coords first, then pass it to a small LLM with system prompts (e.g. “This is the output of an OCR scan of a Portuguese invoice. Extract: Vendor, Date, Total…”). You don’t need a vision model if you isolate text blocks reasonably well. • Quantized LLMs: You can run Mistral 7B or Qwen 7B with tools like llama.cpp or ggml on an RX 6700 XT. It’s not Hugging Face, but with a clean prompt structure it works decently even on modest hardware. • Add simple heuristics or regex for field validation (e.g. invoice number format, known keywords in Portuguese like “Total”, “Data”, etc.).

⸻

🔧 Open Source Option You Might Like

I’m working on an open-source extractor that follows this philosophy:

🔗 https://github.com/WellApp-ai/Well/tree/main/ai-invoice-extractor • OCR-agnostic: use any tool for OCR (Paddle, EasyOCR, etc.) • Works with OpenAI or Mistral (or your own local model) • Uses prompt-based parsing to extract structured data like amount, date, supplier, etc. • MIT license — good fit for academic/non-commercial use

You can tweak the prompt or model as needed and even plug it into your mobile backend.

⸻

Happy to help if you need pointers on prompts or setup. Been deep in this rabbit hole lately 😅 You’re definitely not alone — layout-aware invoice parsing is still very much an open challenge.

Good luck with your project! 💪

[-]

paolovic89@reddit

Is PaddleOCR better than SuryaOCR in your opinion?

[-]

PerfectBookkeeper286@reddit

Podłączę się trzymając kciuki za Twój projekt.
Ja chwilę temu próbowałem zrobić coś podobnego w celu usprawnienia pracy.
Z AI:
W pierwszej kolejności z tessercatem skanowałem dokumenty i robiłem dla nich bboxy dla tokenów. zapisywałem to json. (sama maska tekstowa ok. bboxy oznaczone praktycznie w 100%)
Następnie skorzystałem z doccano aby oznaczać co jest nr faktury, co datą wystawienia itp. Niestety wersja, z której skorzystałem pozwalała oznaczać tylko tekst.
Dlatego przeszedłem na LabelStudio.
Po dokonaniu naniesieniu labeli na dokumenty eksport jsonl z Label Studio łączyłem bboxy tokenów z pozycjami LabelStudio i przekazywałem to do dotrenowania modelu: LayoutLMv3 microsoft/layoutlmv3-base.
Efekt średni... ale też mało faktur. Oznaczyłem ich tylko 150.
Dodatkowo - czas naniesienia etykiet na dokumenty jest masakrycznie długi jeśli robisz to bez automatyzacji (ale też nie kupowałem wspomagacza AI dostępnego w Label)... dlatego tu odpuściłem.
Z regex:

Poszedłem drogę regexów... i tu o ile w miarę proste są daty i nr faktur czy NIP to wyzwaniem staje się tabela VAT. Samo podsumowanie może też jest proste, ale kluczowe jest rozpoznanie co jaka jest wartość np. VAT 8%, VAT 5% i VAT 23% oraz wartości brutto, które wynikają z tych stawek. Niestety jak masz ciąg tekstu w masce tekstowej... pisanie regexów to już abstrakcja.
z HTML:

Spróbowałem jeszcze konwersji pdf na html. I tu nawet zaskoczenie bo udało się odwzorować layout faktury HTML... (pdf zawiera pozycje tokenów)... ale to co dostałem pod spodem to n div-ów, które oczywiście nie są otagowane - więc znów zabawa co jest numerem fv, a co numerem konta.

Na dobicie trafiłem na dokument, który był OCRowany tak, że do wartości podatku VAT z uporem dodawał cyfrę 7 (niezależnie od podbijania rozdzielczości, powiększania dokumentu itp) co finalnie powodowało błędna wartość tego podatku - tak wiem, że to tylko 1 dokument ale zbyt istotny błąd i na razie stwierdziłem, że poczekam ;)

[-]

notPascalCasing@reddit

Hey, was wondering what you ended up going with.

[-]

JumpyHouse@reddit (OP)

Hey, I'm still searching (almost every day....) for a definitive solution, right now , given my limitations in hardware I'm using Marker for the OCR and Phi4 for the data extraction and formatting. I was using Qwen2.5 until last week and found Qwen3 to be worse at instruction following than the others. I've also tried Gemma3 and Deepseek-r1 in the 12/14b variants but they are simply not enough for my use case. But yeah, I'll always be testing new solutions until I submit the project in June but now I'm getting around 85% precision on real invoices (real images that are affected by the quality of the ink and etc...) so it's not really perfect but it's not the worse as well.

[-]

notPascalCasing@reddit

Thanks for sharing!

[-]

SouvikMandal@reddit

If you are still looking for a solution, can try this: https://github.com/NanoNets/docext

You can define all the fields and table columns you want to extract. There is a webUI that you can use in colab. https://github.com/NanoNets/docext?tab=readme-ov-file#quickstart

[-]

No_Afternoon_4260@reddit

Smoldocling courtesy of IBM the docling team and huggingface

https://huggingface.co/ds4sd/SmolDocling-256M-preview

Their paper is cool

[-]

swagonflyyyy@reddit

You can use Mini-CPM-V-2.6 or Gemma3-12b-it for OCR. They seem pretty good at what they do.

[-]

JumpyHouse@reddit (OP)

I’ll try them thank you very much. My only problem with these models is that they tend to hallucinate when they are not sure of the ouput

[-]

Mkengine@reddit

Try olmOCR, the text ancoring may help with that.

[-]

JumpyHouse@reddit (OP)

If I had the computer power for it I would definitely use it🥲

[-]

Mkengine@reddit

It's a (7B model)[https://huggingface.co/allenai/olmOCR-7B-0225-preview], is that too for you?

[-]

JumpyHouse@reddit (OP)

Trust me Ive tried

[-]

Mkengine@reddit

What about SmolDocling?

https://huggingface.co/ds4sd/SmolDocling-256M-preview

[-]

JumpyHouse@reddit (OP)

saw a video about it yesterday and I haven’t really tried it yet

[-]

swagonflyyyy@reddit

Try Q8

[-]

YearZero@reddit

Why not the Mini-cpm-o-Q8? "o" is there newer and updated model over the "v".

[-]

swagonflyyyy@reddit

Yeah but that's more for like CSM stuff. V should be good enough. Also, o may or may not have support in llama.cpp last time I checked.

[-]

lechiffreqc@reddit

I was trying to achieve something like this for bank statement, to help with bookkeeping.

Best way I have found for the moment was

PDF -> DOCLING -> markdown -> LLM

Docling is great to convert the PDF to structured MD, and LLM understand well MD.

[-]

JumpyHouse@reddit (OP)

I tried it and it looks really good, maybe similar in performance to Marker. The problem is that the image detection is horrendous and since most of the invoices will be sent as images I’m afraid it wont be enough. Nevertheless its really easy to use and if I only used pdf files it would be a nice choice.

[-]

AndyHenr@reddit

I second Docling. Rember that OCR is a plugin thing and you can select several options for it. Its by far the best one i have come across. So if you can, give it another shot and test out the features. I believe you will be quite satisfied you did.

[-]

Steve2606@reddit

You could try Gemma 3 with Ollama + structured output

[-]

merotatox@reddit

I helped a similar project , my approach was to finetune qwen 2.5 3b instruct on invoice data extraction and then after extraction i would feed the data back for validation. Pretty you can get away with Gemma 3 , olmoocr or florence 2 aswell without the need for Fine-Tuning.

[-]

Finanzamt_Endgegner@reddit

Ovis 1b/4b/8b

[-]

No-Fig-8614@reddit

Here is an example we put together for starter code https://docs.parasail.io/parasail-docs/cookbooks/multi-modal

[-]

RandomRobot01@reddit

To add another option to the mix, I would suggest trying Qwen 2.5 VL 7b. I have had some success using it to extract structured data from engineering drawings and other difficult formats.

[-]

loyalekoinu88@reddit

Do you have an example of what makes a Portuguese invoice different?

[-]

JumpyHouse@reddit (OP)

Mainly the language 😂, I don’t know how it works in other countries since there aren’t really that many examples but I noticed that for example the layout detection in PaddleOCR only works in chinese so I tought it might be relevant for the question

[-]

jeremiah256@reddit

This is weird. Yesterday I needed to extract text written in Portuguese from a document that was not in perfect condition. Unfortunately for this situation, I used an online solution, DeepSeek.

[-]

loyalekoinu88@reddit

I mean you’re probably not wrong when it comes to orientation like writing right->left, etc. I think you’d need to verify the model supports the language first. Then do some testing.

[-]

Familyinalicante@reddit

Ollama-OCR

[-]

Ok_Hope_4007@reddit

Did you try MinderU ? It's the last one i tried and in my tests it worked quite well. But i have not a lot of comparison besides docling, easyocr.

[-]

GHOST--1@reddit

try doctr from mindee. for my PDFs, it does a great job

[-]

Durian881@reddit

Maybe you can try Gemm 3 12B. It's a new vision model from Google.

[-]

JumpyHouse@reddit (OP)

It seems I just can't use the Gemma3 12B it version since for some reason my GPU refuses to work with Hugging Face transformers library, so right now I'm stuck with llama.cpp

[-]

etaxi341@reddit

I tried every vision llm and never succeeded.. tell me when you find one

[-]

loyalekoinu88@reddit

Just adding to make sure if you have a clearly outline output example it will conform the data to that structure. It sounds like they are trying to use multiple models for something Gemma 3 12b can do in a single run.

[-]

un_passant@reddit

Just adding to make sure if you have a clearly outline output example it will conform the data to that structure.

Any documentation / examples on how to do that in practice ?

Thx.

[-]

yeswearecoding@reddit

By curiosity, have you tried Mistral OCR ?

[-]

JumpyHouse@reddit (OP)

Isn't it a paid API tho?

[-]

yeswearecoding@reddit

Oh sorry, you're right. My bad

[-]

knoodrake@reddit

TL;DR: \~6 months ago: SuryaOCR + LLM

So, I did something somewhat similar a few months ago, except not escpecially for invoices only and for EN + FR docs.

Due to the heterogeneity of contents, the best OCR method I found was SuryaOCR, by far at the time. All other that I tried were way worse in some way or another for me. There were many specialized models on HF, but none that fitted my language and I didn't want to start the laborious work of (preparing data for) fine-tuning one.

At that time ( \~6 months ago ), multimodal LLMs were nowhere good enough at both speed and accuracy.

Once the OCR pass was done, I used an LLM ( option to use either a small local one or oAI API ) to check / correct / redo the formatting , and generate some key info about the document (that would be indexed to be searched) like a description, summary, keys infos, ...

Since then, maybe some new multimodal are good enough ? like gemma 3 ? no Idea. Wasn't the case for me 6 months ago.

[-]

JumpyHouse@reddit (OP)

It works well most of the times like this, I just wanted to know if there was anything new( since most reddit posts are from 2023/4) that allowed me to improve the quality of the data since what I’m concerned about is the 1% of times where it just misses the info

[-]

g0pherman@reddit

I know that docling uses EasyOCR, but it does a good job in most cases.

[-]

jonahbenton@reddit

Try the python utility pdftotext with the -layout option. For produced, not scanned, PDFs, it should give you the text, preserving the spacing, visual columns, etc. Many models with sufficient prompting can do column aware data extraction from plain text.

[-]

infernalr00t@reddit

Somewhat related: yesterday was playing with gemma3 and text extraction from pdf. And notice that from time to time try to summarize the output. Eventually get a 99% ocr file.

As an output tried with html and latex, this seems like a great niche that I hope can be explored.

[-]

JumpyHouse@reddit (OP)

I’ll certainly try it for text extraction, I actually used it for the text interpretation and while it worked very well I just found qwen2.5 to be more accurate on the guesses