Vision (for bank account statements): is it better to OCR an account statement and have the LLM analyze markdown/json to get the info you need OR have the vision model extract the info you need?

Posted by dirtyring@reddit | LocalLLaMA | View on Reddit | 8 comments

My use case: extract certain information from a bank account statement (I can't use Plaid for this app).

e.g. highest transaction in the month of March.

I have a PDF full of bank transactions. Should I use a library to OCR it and then have the LLM interpret the results, or does it work better in just having the vision model finding that information right away?

I'm currently exploring Docling and Llama 3.2 vision

[-]

Vishek-H@reddit

If you need consistency, OCR → JSON/Markdown → LLM is usually more reliable. You normalize the text first, then let the model or simple logic extract things like “highest transaction in March.” Vision models are handy for quick prototypes since they see layout, but they’re harder to scale and less predictable across different statement formats. A lot of teams end up with a hybrid: OCR for baseline text + AI for interpretation. Platforms like KlearStack take this approach for bank/financial docs, so you don’t have to build the pipeline from scratch.

[-]

Consistent_Cut2447@reddit

I’ve had better results OCR’ing first, then running the structured output through an LLM. For bank statements, StmtScan does the OCR + cleanup so you get a clean CSV/JSON to query for things like “highest transaction in March” without extra parsing.

Direct vision model extraction can work, but I’ve found it’s less consistent on multi-page PDFs with varied layouts.

[-]

divyeshk95@reddit

How can I extract some information from bank statements? Working on a bank statement analyser with pdfs for N different banks. Combination of NLP and regex isnt working

[-]

brotie@reddit

Preprocessing the PDFs with something that does PDF text extraction as its primary function will yield much better results than hoping the vision model nails it

[-]

sshan@reddit

If the pdfs follow a standard structure and there are no graphs or infographics like this case. (straightforward charts are fine)

[-]

GHOST--1@reddit

extract the data using an OCR i.e. surya OCR or doctr (document transformer). Then send it to an LLM.

Directly giving it to an LLM/VLM for OCR may sometimes result in hallucinated text.

[-]

jonahbenton@reddit

All PDF bank statements encode the actual data as text, extractable with non AI PDF processing tools. That data is sufficient for a lot of tasks. LLM vision can be a good assist if there is some specific question that can be posed as an either/or based on the data.

Old school (lol) OCR in general tends not to be reliable enough on its own and you can't query it.

[-]

ismaaiil933@reddit

Llama 3.2 is not super good with vision tasks.