Best sub 14b llm for long text summaries?
Posted by GreenTreeAndBlueSky@reddit | LocalLLaMA | View on Reddit | 16 comments
Speed is not important (can run overnight if really need be) but accuracy really matters to me. I was wondering if there were good 1M or 512K or even 256k context models That I might not be aware of.
I know qwen3 4b instruct has 256k native but im afraid it might not be accurate enough and hallucinate quite a bit due to its size
imoshudu@reddit
Accuracy is hardly defined for a summary, and summarization is basically among the easiest things for an LLM. Context size matters only a little since RAG has become standard and you can guide any LLM to use RAG. Hallucination only happens when the LLM has nothing to work with. Just use qwen3 8b with /nothink, or use the "so cheap it's basically free" gemini flash 2.0 on openrouter for incredible context size and speed.
GreenTreeAndBlueSky@reddit (OP)
It has happened to me that 4b says things that never happened in the text. And because i need an overall picture rag is not gonna cut it. That's why I'm asking
o0genesis0o@reddit
There is a technique. Essentially, you split text into chunks, and get LLM to process these chunks incrementally. Each iteration produces a partially summary (up to that point), which would be part of the input for the next iteration. With some careful playing with system prompt, you can have quite rock solid summary (albeit slowly) with a local LLM. And because this approach does not push LLM to degradation, local model would work just fine.
The only challenge is whether your text to summarized can be parsed easily. PDF, for example, is a massive PITA.
imoshudu@reddit
You are not setting up rag properly. Look into langchain for instance. 4b is a bit risky but 8b is completely fine. Gemini flash 2.0 is the best option.
GreenTreeAndBlueSky@reddit (OP)
Thanks. Although id rather use local options i dont trust cloud privacy tbh especially when since we dont have homeomorphic encryption yet
ForsookComparison@reddit
I've done a lot of these tests and the winner in that size range is almost always Llama 3.1 8B for sub-128k and Nemotron-Ultralong-8B for anything higher.
They're older now, but nothing recent has come out in that size that handles massive context so well.
ttkciar@reddit
Thanks for pointing out Nemotron-Ultralong-8B! My usual go-to for long summary is Gemma3-12B or 27B, but their competence drops off sharply after 90K of context. When I get home next week I'll compare them to Nemotron-Ultralong-8B. Having a better long-context summarizer will be great!
Trilogix@reddit
https://hugston.com/uploads/llm_models/irix-12b-model_stock-q6_k.gguf
1 million ctx trained and linear :)
CtrlAltDelve@reddit
It is better etiquette to link directly to a Git repo or a HF repo when sharing a link to a model, just so people can understand what they're downloading before they click :)
https://huggingface.co/DreadPoor/Irix-12B-Model_Stock
Trilogix@reddit
Yes it is, thanks for the main source. Is that the models are getting in thousands and is quite difficult to remember the main source. Asap time allows we will include the source in the description.
QFGTrialByFire@reddit
i know its more than 14B but the model does give better results for these tasks - oss20B mxfp4 fits in 11.8GB. Its max context len is 128k tho. To be honest pushing beyond 128k starts to be diminishing returns as even if a model has that context attention gets sparse. So even if larger models can go with larger contexts it starts to loose accuracy/clarity. At that point you want to use a RAG like system or do overlapping sliding window summarisation and then ask it to blend the summarisations together.
(caveat - unless you are asking it to do copyrighted stuff oss20B will spit the dummy then. it can summarise copyrighted material but not generate new content)
GreenTreeAndBlueSky@reddit (OP)
Thanks, it wont be copyrighted it's more meeting transcripts
QFGTrialByFire@reddit
ah thats probably not a problem. Also how come you need a context window larger than 128k? That's probably like 90k words or something like 10hours of talking? I don't imagine meetings go that long :)
MaverickPT@reddit
I have my meeting transcripts in a .json file. I found that it helps with speaker diarization and all, but all the extra .json structure eats into the context budget. Although I am happy to have a better way of doing things suggested
QFGTrialByFire@reddit
Ah yes all it really needs is structure it doesn't have to be the full json format. You can run a simple script to first strip out the json into a simple format then feed to the llm. eg something like
[00:12:31] Alice: We should review the budget.
[00:12:45] Bob: Yes, I’ll send the spreadsheet.
All the extra curly braces and commas and quotes etc eat into the context budget without give much more structure/context to the llm.
GreenTreeAndBlueSky@reddit (OP)
That's reassuring i was a bit scared that 2 hours might not fit