LLM for data extraction

Posted by Impressive_Refuse_75@reddit | LocalLLaMA | View on Reddit | 4 comments

Hi everyone,

I just started working for a company that needs to process many different RFQ (Request for Quotation files) formats of incoming files like .xls .xlsx .pdf .docx to extract certain data from them, woth to say that the files usually follow a tabular format and sometimes they just have lines.

The thing is that each file comes with its own columns and names so extracting data it´s really a mess. The idea I thought was to extract by docling/marker/markitdown the data of the file to a .md and then pass it through a LLM hosted locally in LMStudio to "intelligently" extract the actual variables I want in a JSON and use them.

The problem is that the LLM sometimes skips words or doesn´t extract correctly from the document. Also when its a large .md the LLM takes so long with my GPU, which is RTX 5060 8GB, so I actually don´t know what else to do for this task.

I would like to hear what you do or methods you have for things like this, thanks :)