llama: limit max outputs of `llama_context` by am17an · Pull Request #23861 · ggml-org/llama.cpp
Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 27 comments
Overview
continue #23764, this PR only reserves logits space for n_seqs when possible. With -ub 2048 and MTP, this saves another 1.2GB of VRAM for me. I've tested llama-perplexity also and it seems to work fine. But maybe there is a better API, putting up as a draft for now According to me an API in llama-context is a good solution for this, by default it will reserve all tokens but specifically in server-context we can set it to 1 whenever possible.
- u/am17an
nickm_27@reddit
7900XTX on Vulkan running Gemma4 26B-A4B, this does save 1.2GB for me, enabling running a higher quant and more context. Awesome to see
Sensitive_Pop4803@reddit
Vulkan is literally saving AMD’s ass. I am now comfortable abandoning ROCm after months and months of earnest effort to streamline the process. Too much BS.
Just download the vulkan executable, and done!
legit_split_@reddit
What about tensor parallelism, does Vulkan work?
Mountain_Patience231@reddit
no
pmttyji@reddit (OP)
llama.cpp contributed more on Vulkan backend, particularly last 7-8 months. Just searched PRs there just now(Vulkan 1000+ vs Rocm 350). AMD should work more on optimizations.
Now I want ik_llama to get stronger on Vulkan too.
nickm_27@reddit
Yeah for me it is faster in every way, smaller and easier to download binaries, etc.
pmttyji@reddit (OP)
That was a quick feedback. Nice. Try Qwen3.5-35B/27B with MTP too.
nickm_27@reddit
Qwen3.6 27B (Q5_K_S) with MTP enabled went from 22GiB to 20.4 GiB
nickm_27@reddit
Haha, I have been running it on my own build for a few days now
Pentium95@reddit
Can we merge this and your previous VRAM-saving PR into your Gemma 4 MTP PR branch? (merging Master into "am17an:gemma4-mtp"? I use that branch and i'm a bit too lazy and underskilled to merge every time your nice optimizations gets into master
Thanks man
am17an@reddit
Ask and you shall receive! (in this case)
ionizing@reddit
Not OP but just wanted to thank you for all your contributions.
am17an@reddit
Cheers!
No_Lingonberry1201@reddit
I second the previous commenter. You da' man!
VoiceApprehensive893@reddit
Time to download some RAM
ionizing@reddit
On localweights/Qwen3.6-27B-MTP-IMAT-IQ4_XS-Q8nextn.gguf I went from \~111K ctx, q8/q8 kv and 1280 batch/ubatch BEFORE the change, to....
AFTER: \~131072 ctx, q8 on V only, K now at full size, and increased b/ub to 1536 and I STILL have \~1gb of space after hitting the server with 25K context from a real workload.... so still room to tweak. very nice.
headless 3090 x1. Now moving on to test Q5/Q6 which is what I really want to see.
Excellent work!!!
GotHereLateNameTaken@reddit
Since this change im able to run the following on a single 3090 getting between 60 and 70tkps on coding tasks:
Qwen3.6-27B-Q5_K_S | llama.cpp | 8080 | 8 threads | all GPU | 100k ctx | MTP spec (2 draft, ubatch 512) | KV q8_0 | FA on | fit off | preserve_thinking
soyalemujica@reddit
7900XTX this allowed me to run Qwen 3.6 27B Q6K MTP with 80k\~ context at 55t/s
donomo@reddit
wait that's huge
soyalemujica@reddit
It really is, can't complain, I got 100k without MTP at 30t/s, using q5_1/q4_1 kvcache for 93.70%\~ precision
xeeff@reddit
KV cache hits heavy. use in a hermes harness and see
soyalemujica@reddit
I am using C++ for agentic coding and I have yet to experience a single hallucination or numeric disparity even
xeeff@reddit
do you quant the MTP draft's KV as well?
soyalemujica@reddit
Yeah at q4
xeeff@reddit
not the same q5_1/q4_1 split?
soyalemujica@reddit
Not really, haven’t needed to at all
SurpriseOk6927@reddit
1.2gb saved just by not being dumb about logits allocation. everyone chases bigger models while this kind of low level optimization makes local inference actually usable on consumer gpus. more of this please