numind/NuExtract3 · Hugging Face

Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 6 comments

**NuExtract3** is a unified **4B** vision-language reasoning model for document understanding. It combines strong **structured information extraction** with high-quality **image-to-Markdown** conversion, making it suitable for extraction pipelines, OCR, and RAG preprocessing for all types of documents such as scans, receipts, forms, invoices, contracts or tables. # Overview * **Structured extraction**: input (text/images) + JSON template + instructions --> JSON output * **Markdown conversion**: input (text/images) --> Markdown * **Multimodal inputs**: text, images, or text + images. * **Multilingual** documents. * **Reasoning** and non-reasoning inference modes. * **Template generation** for structured extraction from natural language or input document. # [](https://huggingface.co/numind/NuExtract3#benchmark-results) GGUF, NVFP4, MLX, VLLM, etc., already there [https://huggingface.co/models?other=base\_model:quantized:numind/NuExtract3](https://huggingface.co/models?other=base_model:quantized:numind/NuExtract3)

6 Comments

[-]

Steuern_Runter@reddit

Can someone recommend me an easy to use tool to transcribe a bunch of scanned documents using a VLLM like this one?

computehungry@reddit

I've never found a package that perfectly does this out of the box. docx, pptx, whatever random files you have will break during some conversion. What I did was: - Convert all docs to PDF (you can make AI automate this) - Check the docs if they're broken (you can skip this if you have a million docs... or probably search some more for a nicer solution) - Run the VLM as an OpenAI-compatible server (llama-server, vllm serve, etc) - Make AI write a Python script that will loop the VLM processing, while processing pdfs as images, and output the transcripts (I do doc->pdf->image because pdf->image is pretty robust, doc->pdf/image can be bad, and pdfs are way easier to check than images. You can skip pdfs if you don't care) - If you have a lot of documents, you can try to make it run in parallel (but this may break llama.cpp) You can feed my whole comment into some big enough llm and it can do everything for you lol

Thanks. I already used OpenCode to handle that process with classic OCR tools. I just thought there could be a more streamlined workflow since this is a big area of application for VLMs.

Agree... I'd think so too. All those existing packages try to do "document understanding" instead of just taking in images, then call 30 tools to figure out what to look for. idk why they're going this direction. Perhaps images take too many tokens for API.

Il_Signor_Luigi@reddit

Interesting, might test later

pmttyji@reddit (OP)

Share your evaluations later

Reply to Post

6 Comments