Vision (for bank account statements): is it better to OCR an account statement and have the LLM analyze markdown/json to get the info you need OR have the vision model extract the info you need?
Posted by dirtyring@reddit | LocalLLaMA | View on Reddit | 3 comments
My use case: extract certain information from a bank account statement (I can't use Plaid for this app).
e.g. highest transaction in the month of March.
I have a PDF full of bank transactions. Should I use a library to OCR it and then have the LLM interpret the results, or does it work better in just having the vision model finding that information right away?
I'm currently exploring Docling and Llama 3.2 vision
GHOST--1@reddit
extract the data using an OCR i.e. surya OCR or doctr (document transformer). Then send it to an LLM.
Directly giving it to an LLM/VLM for OCR may sometimes result in hallucinated text.
jonahbenton@reddit
All PDF bank statements encode the actual data as text, extractable with non AI PDF processing tools. That data is sufficient for a lot of tasks. LLM vision can be a good assist if there is some specific question that can be posed as an either/or based on the data.
Old school (lol) OCR in general tends not to be reliable enough on its own and you can't query it.
ismaaiil933@reddit
Llama 3.2 is not super good with vision tasks.