Best LLM for OCR invoices and make JSON and calculate values?

Posted by Difficult-Bluejay-52@reddit | LocalLLaMA | View on Reddit | 6 comments

Hi. I have been using gpt4 and gpt4o for a while and recently switched to Sonnet 3.5. I want to know what other LLM models you have tried for OCR.

This is what we are currently using and our requirements.
We send a bunch of pictures 1-10 that contain pages from one invoice or multiple invoices.

The LLM has to go through each image, extract this information, and make this JSON (and sum up values):

{

"Currency":"",

"Vendor":"",

"CourierName":"",

"CourierNumber":"",

"Consignee": "",

"ACC number": "",

"Items":[{"Description":"","QTY":"","Unit Price":"", "FileID": ""}],

"Subtotal":"",

"Tax":"",

"Shipping&Handling":"",

"Shipping&HandlingDiscount":"",

"Discount":"",

"Refund":"",

"Coupon":"",

"GiftCard":"",

"Credit":"",

"Total":""

}

This works 70-80% of the time, but sometimes the sum-up values are incorrect, failing in the sum and giving the wrong totals (subtotal or tax or shipping, total, etc.) and I would like to try other llms to see if they can do better!

Thanks.

[-]

brewhouse@reddit

Like others have said, if the LLM have already correctly identified the individual numbers, just use a simple function to do the calculation...

On another note, for something like this Gemini 1.5 Flash is actually very competent and much cheaper than Sonnet 3.5. You can use flash and the cost is fixed at 258 tokens for both 768x768 and 3072x3072 image input. I urge you to test it out because the cost / image is orders of magnitude cheaper than if using Claude Sonnet 3.5, and something like this flash is more than good enough.

cost / image using gemini flash (3072x3072):

$0.00001935

cost/image using claude 3.5 sonnet(1092x1092):

$0.0048

It's 250x cheaper.

AnotherAvery@reddit

Summing up values with LLMs is not a good idea. It should be easy to do in your wrapper code (you have to call the LLM somehow, I suppose?).

On a tangent, it is staggering to think of the billions of calculations needed to produce sum with an LLM, and then it's wrong nonetheless...

Eugr@reddit

Right, I'd use LLM to extract information into the JSON and then perform any calculations outside of it (and also some basic sanity checks).

Inevitable-Start-653@reddit

Have a triei GOT-OCR?

I have it integrated into a project, and it works very well.

https://github.com/RandomInternetPreson/Lucid_Autonomy

Difficult-Bluejay-52@reddit (OP)

How do you run it? In face?

GOT-OCR has an online demo:

https://huggingface.co/stepfun-ai/GOT-OCR2_0

If it finds all of the text you are looking for regarding your JSON requirements, you'd need a way to output the JSON. That's why I linked to my project, any LLM can talk to various vision and OCR models.

My project is an extension for oobabooga's textgeneration webui, so you would need to install that and then install the extension.

There are some pretty good vision models:

Aria and Molmo, you can try and ask them to directly convert to Json for you.

https://huggingface.co/rhymes-ai/Aria

https://huggingface.co/allenai/Molmo-72B-0924

Both of those HF links have links to the model demos. If either of those work for you, it is possible to run them locally. However they both use a lot of VRAM.