Recommendations for summarization and structured data extraction

Posted by cachophonic@reddit | LocalLLaMA | View on Reddit | 10 comments

Hi all, I’m looking for people’s current favourites/recommendations for models that are great at following instructions for text summarization and structured data extraction.

For a bit of context the model needs to be able to fit within 48gb of VRAM and the use case is largely extracting specific information (eg question and answer pairs, specific assessment info) and structured JSON data from appointment transcripts. Usually around 30k tokens including prompts per generation.

Our current go to is still Mistral 24b Instruct at fp8 running in VLLM.

This a production project so priority is accuracy, ability to follow instructions and avoid confabulation over raw t/s.

We tried several other models like gpt oss 20b, Qwen3-30B-A3B and several other smaller Qwen models when we initially got started but it's hard to keep up with all the changes so thought I'd see if people have particular go-tos so we can reduce the short list of models to experiment with. Thanks!

[-]

daviden1013@reddit

I do medical structured data extraction for work. I found Qwen3-30B-A3B-Thinking and gpt-oss-120b have the best accuracy. Both output JSON pretty well. gpt-oss is faster because the thinking is more concise. While Qwen3 often over think. Given your vRAM, gpt-oss-120b won't work. So I'd suggest Qwen3-30B-A3B-Thinking in int8 quantization. Besides the choice of LLM, to get accurate results in JSON, context processing and post-processing are important too. I made a python package for structured data extraction pipeline (https://github.com/daviden1013/llm-ie). Might be relevant to your use case.

[-]

cachophonic@reddit (OP)

Thanks - that’s great info. We did try A3B (non-thinking which might have been the issue) but found it more prone to confabulation than Mistral Small in our testing. But I might have to give it another go. I’ll check out your repo too - looks interesting. We’re using Mastra to help with workflows and structured output but do still run into issues with schema compliance at times. Thanks again - very helpful.

[-]

daviden1013@reddit

Yeah, the non-thinking one is fine for simple tasks. I found the performance much worse than the thinking one (~40% vs. 70% on a cardiology NER task).

[-]

cachophonic@reddit (OP)

Have you had any issues with hallucinations/confabulation? Or has it been pretty good for you? I guess the thinking variant would be less prone to it as it can sort of check its own work to some degree.

[-]

daviden1013@reddit

Yes, that's a real problem. My solutions are: 1. Prompt engineering. Often it's not the model's capability, but ambiguity in the instructions. Error analysis and iteratively develop the prompt helps. 2. Use reasoning model. Reasoning process includes reviewing the instructions and self-correction. I found this very very helpful on my benchmarks and real-world tasks. 3. Chunking inputs: LLM often performance poorly on long input text. Especially poor recall. Divide the input into paragraphs or sentences and have LLM process one by one helps a lot. This is shown in my paper, and supported in the package (UnitChunker class, SenteceFrameExtractor). My default at work (medical concept extraction) is sentence by sentence prompting. 4. Divide a hard task into subtasks: sometimes the task is complex. For example, extracting a few hundred types of entities and attributes following a hierarchical schema. The complexity is beyond LLM's capacity. In such case, I'd review the schema and isolate it into independent subtasks and assign to multiple LLMs. For example, one LLM extracts diagnosis entities, one LLM extracts medication entities.

I'd say the rule of thumb is, if your input text looks long to a regularly lazy human, consider chunk it. Because LLMs can be lazy in a simular way and cause poor recall. If your task/schema looks complex to a high school kid, consider divide it. Don't expect one-and-done. It always takes a dozen iterations of error analysis and prompt revision. That's why I made a visualization tool to make my live easier.

[-]

cachophonic@reddit (OP)

[-]

HoldTheMayo25@reddit

Qwen 2.5 Coder 32B Instruct is currently the best-in-class for this specific setup. Coding models are exceptionally good at strict JSON schema adherence and structured extraction, and at 32B, it fits comfortably in your 48GB VRAM with plenty of room for the 30k context KV cache.

If you need higher reasoning capabilities, you could squeeze Llama 3.3 70B in by using 4-bit quantization (AWQ/GPTQ), which fits in \~40-42GB, though you'll have less headroom for batching. Also, ensure you are using --enable-chunked-prefill in vLLM to manage those large 30k prompts efficiently.

[-]

cachophonic@reddit (OP)

Interesting - I hadn’t thought about coding models but that makes a lots of sense. I’ll have to give that a try and see how it performs in our set up. And thanks for the tip re chunked-prefill. I’ll double check our set up tomorrow and read up on how it works.

[-]

Optimalutopic@reddit

I would suggest go for a proprietary models, you will have to think about the changes you make in extraction model should be powerful enough to adapt to the new changes with same accuracy, local models are for sure best for prod use cases, even from the inference angle, unless your usage is too high or you have on Prem gpus, request basis costing from providers is the best and cost effective option

[-]

cachophonic@reddit (OP)

We need to self host due to privacy requirements, otherwise that would definitely make a lot more sense.