NuExtract3 released: open-weight 4B VLM for Markdown, OCR and structured extraction (self-hostable)
Posted by Gailenstorm@reddit | LocalLLaMA | View on Reddit | 24 comments
Disclaimer: I work for Numind, the company behind this open-weight model
TLDR: Image/text to Markdown :-)
We just released a 4B model based on Qwen3.5-4B, under Apache-2.0 license. The goal is to make information extraction from complex documents more practical with an open model: PDFs, screenshots, forms, tables, receipts, invoices, multi-page documents, and other visually structured inputs.
Try it, we have a huggingface space that is completely free (you don't even have to sign-up): https://huggingface.co/spaces/numind/NuExtract3
If you ever used NuMarkdown, NuExtract3 is the successor.
There are some examples to guide you. Feel free to re-use this model for any task.
A few things it is designed for:
- converting document images to Markdown
- extracting structured data from documents using a target json template
- handling tables, forms, and layout-heavy pages
- working with both text and visual document inputs
- serving as a local/open-weight alternative for document extraction pipelines
It was trained on a node of 8xH100 for 3 days to train on as much context as we could, so it should perform fairly well even on long document. For Markdown, we'd still recommend going page by page for the best results and inference speed, since you can parallelize better this way.
It's very easy to self-host, since we provide fairly extensive documentation, Safetensors, GGUF and MLX weights. With as little as 4GB of VRAM, you should be good to go. We provide multiple quantizations (GPTQ, W8A8, FP8, Q4, Q6...) so you should be able to run it anywhere.
We mostly tried vLLM, SGLang, llama.cpp. Ollama support would be nice but I'm not a big fan of their chat template engine.
We have a blog post and a pretty decent model card:
- https://about.nuextract.ai/blog/nuextract-3-release
- https://huggingface.co/numind/NuExtract3
- https://huggingface.co/collections/numind/nuextract3
I'm currently writing a paper on this model so I'll post it as soon as it's accepted. It's not yet on Arxiv yet as it has been submitted in a peer-review journal/conference.
I'll try to answer as many questions as possible if you have any. We would really appreciate feedback from the community.
We also have a discord if you're interested
https://discord.com/invite/3tsEtJNCDe
laul_pogan@reddit
If you're loading this in vLLM and hitting weight key errors or silent load failures, Qwen3.5 VLM weights sometimes serialize with a
model.language_model.*prefix in the safetensors. VLLM expects the flat layout, so you need to strip that prefix before loading. Same issue withmrope_section_sizeleft inconfig.json; vLLM's Qwen2-VL backend chokes on it. Two-line fix in a weight conversion script, or just patchconfig.jsonto drop the mrope key.--load-format safetensorsalso loads shards 4-7x faster than the default on multi-shard checkpoints.Gailenstorm@reddit (OP)
Hm, I did not see any error or silent failure on my end, I'm only using vLLM 0.20.0+. Are you using an older version?
laul_pogan@reddit
Well thank you! I've just realized I was in fact on 0.17.0 when I thought I was up to date. Cheers :)
Fit_Advice8967@reddit
Good stuff!!! Anybody who used this for academic papers plz let us know how it performs!
Wise_Stick9613@reddit
I'm trying to accomplish this (OCR: what is the best way to extract data in JSON format from this old French book?): can your model help me?
Should I:
Gailenstorm@reddit (OP)
Nice and complicated task you have here. NuExtract3 can do it all in a single pass I think. I just tried on the huggingface space! https://huggingface.co/spaces/numind/NuExtract3 (no sign-up required)
I slightly changed your template to "help" the model
{"Page number": "integer",
"Title": "verbatim-string",
"Verse content": "string",
"Commentaire content": "string"}
And when I "Extract JSON" I get
{
"Page number": 192,
"Title": "II. — MACCABÈES. — VI. - MORT D'ÉLEAZAR",
"Verse content": "20. Intuens autem, quemadmodum oporteret accedere, patienter sustinens, destinavit non admittere illicita propter vitam amore. 21. Hic autem qui astabant, iniqua miseratione commoti, propter antiquam viri amicitiam, tollentes eum secreto, rogabant afferri carnes quibus vesci ei licebat, ut simularetur manducasse, sicut rex imperaverat, de sacrificii carnibus. 22. Ut, hoc facto, a morte liberaretur; et propter veterem viri amicitiam, hanc in eo faciebant humanitatem. 23. At ille cogitare coepit ætatis ac senectutis suæ eminentiam dignam, et ingentis nobilitatis caniticem, atque a puero optimæ conversationis actus; et secundum sanctæ et a Deo conditæ legis constituta, respondit cito, dicens, præmitti se velle in infernum. 24. Non enim ætati nostræ dignum est, inquit, fingere, ut multi adolescentium, arbitrantur Eleazarum nonaginta annorum transisse ad vitam alienigenarum. 25. Et ipsi, propter meam simulationem, et propter modicum corruptibilis vitæ tempus, decipiuntur, et per hœc maculam, atque execrationem meæ senectuti conquiram.",
"Commentaire content": "20. INTUENS AUTEM QUEMADMODUM OPORTERET ACCEDERE, PATIENTER SUSTINENS, DESTINAVIT NON ADMITTERE ILLICITA PROPTER VITÆ AMORE. Le grec est entortillé et peu compréhensible, à moins de reconstituer la phrase en hébreu (1). Le voici mot à mot en latin : Expuens autem secundum quem oportebat modum accedere suslinens les puniri, qux non licebat gustare propter vitam amore. C'est-à-dire, littéralement, mais selon la manière dont devaient se conduire ceux qui avaient la force de souffrir, il crachait ces choses qu'il ne lui était pas permis de goûter pour sauver sa vie. La phrase a une tournure hébraïque très accentuée. 21. QUI ASTABANT INIQUA MISERATIONE COMOTI. Le grec (2) : Ceux qui étaient préposés à cet injuste festin, où l'on servait les entrailles des bêtes immolées aux idoles. Ce grand homme aime mieux souffrir la mort que de scandaliser les faibles par une action permise en elle-même, mais qui serait prise infailliblement pour une prévarication. Il suivait dès lors, dans la pratique, les règles de morale que Jésus-Christ (3), que saint Paul (4), que les martyrs (5), ont depuis, enseignées et pratiquées. 24. NON ENIM ÆTATI NOSTRÆ DIGNUM EST, INQUIT, FINGERE. Cette fiction n'aurait été digne d'aucun âge, puisque tous les hommes, de quelque âge qu'ils puissent être, sont obligés de donner aux autres l'exemple d'une foi sincère, et d'une piété sans déguisement. Mais le scandale que cause la chute d'un homme chargé d'années et consommé dans les exercices de la piété, est d'une autre conséquence sans comparaison, que celui que causerait la chute d'une personne ordinaire. La grande estime que l'on a conçue pour la vertu et pour la lumière du premier, donne un poids particulier à toutes ses actions. Il ne fait rien qui ne soit, non pas seulement un fruit, mais une semence de vie ou de mort pour plusieurs personnes, qui le regardent comme un modèle que l'on doit suivre. Ainsi Éléazar avait raison, quand il répondit à ceux qui voulaient lui inspirer cette fiction, qu'elle aurait été indigne de son grand âge : non qu'elle eût pu convenir à un âge moins avancé, mais parce que sa vieillesse aurait rendu son exemple plus dangereux pour plusieurs jeunes personnes qu'il aurait trompées par ce déguisement, et à qui il serait ainsi devenu un sujet de scandale. Il préfèra donc, comme il le dit, laisser aux jeunes gens un exemple de fermelé, plutôt que de conserver un petit reste de cette vie corruptible, par une dissimulation si pernicieuse tant à son salut qu'à celui de ses frères."
}
Let me know what you think! But I'd say it's pretty good
Wise_Stick9613@reddit
The result looks great. Thanks a lot, I'll give a try!
fishylord01@reddit
This is the type of post we need more here. My company has a feature for it's SASS that customers pay for mainly called "digital forms" basically just recreating manual/physical forms that people have into digital ones, that can be filled in via our software system through a job/task and that data is used to autopopulate the form. which can be sent to people, esign, attach images etc.
Expand on a Feature to go from automatic Manual -> polished translation layer -> digital form you can easily sell this feature to top companies like clickup,monday.com etc. Where instead of taking 1-2days to create a new Digital form to closely resemble real forms with the system (which can be hard to learn the design) to a minute just taking a picture. I could explain more in dms how it all works, etc if you'd like.
PferdOne@reddit
Pretty good so far. It already succeeded where I had problems with Qwen and Gemma (MoE, 4B, 9B, dense). I'll probably integrate it into my workflow. Thanks alot!
Gailenstorm@reddit (OP)
You're welcome, thanks a mil for the compliment
Don't hesitate to put a little heart on the huggingface repo 😉
ECrispy@reddit
this looks very interesting, thank you!
can this be used in place of tools like trafilatura etc to convert web pages to markdown? I have a lot of saved web pages I'd like to extract content from.
Gailenstorm@reddit (OP)
To be honest this is not the primary use-case, it has not been trained "that" much on websites. I'd say it would still be good for its weight but not exceptionnal
If you give it a try, try with both input as HTML (text) or with input as an image (render/screenshot)
ECrispy@reddit
sorry, I do realize that after reading some more. I think html extraction is a different problem and this seems to understand images as well.
Odd-Ordinary-5922@reddit
you can always take a screenshot of the html
akisviete@reddit
Is the model any good for ocr chinese video burned in subtitles? Any recommendations for that? Using videocr app now.
Bubulela@reddit
Tried it on Friday, I think results were good without much iteration. Do you have any suggestions for digital newspapers? I'm trying to replace gemini flash 3, which works really well but the cost adds up fairly quickly.
Gailenstorm@reddit (OP)
I'd say playing with DPI might help. In our tests, 150-170 DPI work best but it might be worth to increase the number. Splitting the pages with something like OpenCV might help too if you are doing markdown
Bubulela@reddit
Thanks for the quick answer! will try when I get the change. Awesome job!
BunchaQuestion@reddit
I need an AI to help me understand this in TLDR, be right back
Gailenstorm@reddit (OP)
LLM takes image and produces text that looks like image (but is text)
tmp1571@reddit
LLM takes image and produces text that looks like image (but is text)
Celestialien@reddit
Nice, the 4GB VRAM floor is what makes this actually usable for a lot of us - appreciate that you shipped GGUF and MLX weights day one instead of leaving it to the community.
Quick question: how does it hold up on multi-column layouts and dense tables compared to something like dots.ocr or Qwen3-VL? Markdown OCR tends to fall apart on reading order once you've got sidebars, footnotes, or merged table cells. Also curious whether it handles handwriting at all, or if that's out of scope for this release.
Either way, will have a play around with it this week!
Gailenstorm@reddit (OP)
We have trained it quite a lot on this type of document, and the table rendering is actually HTML inside the markdown. We found it work so much better than markdown, because the base models are trained heavily on HTML and also it's less prone to "forgetting" a single | which messes up the table
And because it also allows for more complex layouts, multiple boxes insides boxes
So I'd say i works well
Concerning handwriting, it's not the main usecase but we still trained for it. You should definitely try out on the huggingface space, I think it should suit most people
silenceimpaired@reddit
Thanks for sharing and for the model. Now I just need a book scanner that doesn’t require I cut out all the pages or even turn the pages :)