LLM for data extraction
Posted by Impressive_Refuse_75@reddit | LocalLLaMA | View on Reddit | 4 comments
Hi everyone,
I just started working for a company that needs to process many different RFQ (Request for Quotation files) formats of incoming files like .xls .xlsx .pdf .docx to extract certain data from them, woth to say that the files usually follow a tabular format and sometimes they just have lines.
The thing is that each file comes with its own columns and names so extracting data it´s really a mess. The idea I thought was to extract by docling/marker/markitdown the data of the file to a .md and then pass it through a LLM hosted locally in LMStudio to "intelligently" extract the actual variables I want in a JSON and use them.
The problem is that the LLM sometimes skips words or doesn´t extract correctly from the document. Also when its a large .md the LLM takes so long with my GPU, which is RTX 5060 8GB, so I actually don´t know what else to do for this task.
I would like to hear what you do or methods you have for things like this, thanks :)
SouthTurbulent33@reddit
You're on the right track. I suggest you try a different parser, like llmwhisperer or landing AI. These are miles better and faster than the ones you're using now.
My company uses llmwhisperer + Claude Sonnet 4.6 for structured data extraction, and it's more or less been a breeze so far.
I have tried reducto as well, but it just turned out to be too slow + expensive for the performance.
Impressive_Refuse_75@reddit (OP)
Thanks for the advice, I´ll check it and see how it goes
ps5cfw@reddit
You are not going to get reliable results without using big enough models, but unless you operate on a very low workload (hundreds of documents a day at most) you'll either Need to spend decent bucks to run a model Quick enough or it's not going to be worth the hassle.
And again, even the best models are reliable up to a point, So YMMV
Impressive_Refuse_75@reddit (OP)
Yeah, but do you think that if I get more GPUs or something is it worth it to hosted it locally or either way I probably should run a bigger model on some clud servers?