Best sub 14b llm for long text summaries?

Posted by GreenTreeAndBlueSky@reddit | LocalLLaMA | View on Reddit | 16 comments

Speed is not important (can run overnight if really need be) but accuracy really matters to me. I was wondering if there were good 1M or 512K or even 256k context models That I might not be aware of.

I know qwen3 4b instruct has 256k native but im afraid it might not be accurate enough and hallucinate quite a bit due to its size

[-]

imoshudu@reddit

Accuracy is hardly defined for a summary, and summarization is basically among the easiest things for an LLM. Context size matters only a little since RAG has become standard and you can guide any LLM to use RAG. Hallucination only happens when the LLM has nothing to work with. Just use qwen3 8b with /nothink, or use the "so cheap it's basically free" gemini flash 2.0 on openrouter for incredible context size and speed.

[-]

GreenTreeAndBlueSky@reddit (OP)

It has happened to me that 4b says things that never happened in the text. And because i need an overall picture rag is not gonna cut it. That's why I'm asking

[-]

o0genesis0o@reddit

There is a technique. Essentially, you split text into chunks, and get LLM to process these chunks incrementally. Each iteration produces a partially summary (up to that point), which would be part of the input for the next iteration. With some careful playing with system prompt, you can have quite rock solid summary (albeit slowly) with a local LLM. And because this approach does not push LLM to degradation, local model would work just fine.

The only challenge is whether your text to summarized can be parsed easily. PDF, for example, is a massive PITA.

[-]

imoshudu@reddit

You are not setting up rag properly. Look into langchain for instance. 4b is a bit risky but 8b is completely fine. Gemini flash 2.0 is the best option.

[-]

GreenTreeAndBlueSky@reddit (OP)

Thanks. Although id rather use local options i dont trust cloud privacy tbh especially when since we dont have homeomorphic encryption yet

[-]

ForsookComparison@reddit

I've done a lot of these tests and the winner in that size range is almost always Llama 3.1 8B for sub-128k and Nemotron-Ultralong-8B for anything higher.

They're older now, but nothing recent has come out in that size that handles massive context so well.

[-]

ttkciar@reddit

Thanks for pointing out Nemotron-Ultralong-8B! My usual go-to for long summary is Gemma3-12B or 27B, but their competence drops off sharply after 90K of context. When I get home next week I'll compare them to Nemotron-Ultralong-8B. Having a better long-context summarizer will be great!

[-]

Trilogix@reddit

https://hugston.com/uploads/llm_models/irix-12b-model_stock-q6_k.gguf

1 million ctx trained and linear :)

[-]

CtrlAltDelve@reddit

It is better etiquette to link directly to a Git repo or a HF repo when sharing a link to a model, just so people can understand what they're downloading before they click :)

https://huggingface.co/DreadPoor/Irix-12B-Model_Stock

[-]

Trilogix@reddit

Yes it is, thanks for the main source. Is that the models are getting in thousands and is quite difficult to remember the main source. Asap time allows we will include the source in the description.

[-]

QFGTrialByFire@reddit

i know its more than 14B but the model does give better results for these tasks - oss20B mxfp4 fits in 11.8GB. Its max context len is 128k tho. To be honest pushing beyond 128k starts to be diminishing returns as even if a model has that context attention gets sparse. So even if larger models can go with larger contexts it starts to loose accuracy/clarity. At that point you want to use a RAG like system or do overlapping sliding window summarisation and then ask it to blend the summarisations together.

(caveat - unless you are asking it to do copyrighted stuff oss20B will spit the dummy then. it can summarise copyrighted material but not generate new content)

[-]

GreenTreeAndBlueSky@reddit (OP)

Thanks, it wont be copyrighted it's more meeting transcripts

[-]

All the extra curly braces and commas and quotes etc eat into the context budget without give much more structure/context to the llm.

[-]

GreenTreeAndBlueSky@reddit (OP)

That's reassuring i was a bit scared that 2 hours might not fit