llama: limit max outputs of `llama_context` by am17an · Pull Request #23861 · ggml-org/llama.cpp

Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 27 comments

Overview

continue #23764, this PR only reserves logits space for n_seqs when possible. With -ub 2048 and MTP, this saves another 1.2GB of VRAM for me. I've tested llama-perplexity also and it seems to work fine. But maybe there is a better API, putting up as a draft for now According to me an API in llama-context is a good solution for this, by default it will reserve all tokens but specifically in server-context we can set it to 1 whenever possible.

- u/am17an

[-]

nickm_27@reddit

7900XTX on Vulkan running Gemma4 26B-A4B, this does save 1.2GB for me, enabling running a higher quant and more context. Awesome to see

[-]

Sensitive_Pop4803@reddit

Vulkan is literally saving AMD’s ass. I am now comfortable abandoning ROCm after months and months of earnest effort to streamline the process. Too much BS.

Just download the vulkan executable, and done!

[-]

legit_split_@reddit

What about tensor parallelism, does Vulkan work?

[-]

Mountain_Patience231@reddit

[-]

pmttyji@reddit (OP)

llama.cpp contributed more on Vulkan backend, particularly last 7-8 months. Just searched PRs there just now(Vulkan 1000+ vs Rocm 350). AMD should work more on optimizations.

Now I want ik_llama to get stronger on Vulkan too.

[-]

nickm_27@reddit

Yeah for me it is faster in every way, smaller and easier to download binaries, etc.

[-]

pmttyji@reddit (OP)

That was a quick feedback. Nice. Try Qwen3.5-35B/27B with MTP too.

[-]

nickm_27@reddit

Qwen3.6 27B (Q5_K_S) with MTP enabled went from 22GiB to 20.4 GiB

[-]

nickm_27@reddit

Haha, I have been running it on my own build for a few days now

[-]

Pentium95@reddit

Can we merge this and your previous VRAM-saving PR into your Gemma 4 MTP PR branch? (merging Master into "am17an:gemma4-mtp"? I use that branch and i'm a bit too lazy and underskilled to merge every time your nice optimizations gets into master

Thanks man

[-]

am17an@reddit

Ask and you shall receive! (in this case)

[-]

ionizing@reddit

Not OP but just wanted to thank you for all your contributions.

[-]

am17an@reddit

Cheers!

[-]

No_Lingonberry1201@reddit

I second the previous commenter. You da' man!

[-]

VoiceApprehensive893@reddit

Time to download some RAM

[-]

ionizing@reddit

On localweights/Qwen3.6-27B-MTP-IMAT-IQ4_XS-Q8nextn.gguf I went from \~111K ctx, q8/q8 kv and 1280 batch/ubatch BEFORE the change, to....

AFTER: \~131072 ctx, q8 on V only, K now at full size, and increased b/ub to 1536 and I STILL have \~1gb of space after hitting the server with 25K context from a real workload.... so still room to tweak. very nice.

headless 3090 x1. Now moving on to test Q5/Q6 which is what I really want to see.

Excellent work!!!

[-]

GotHereLateNameTaken@reddit

Since this change im able to run the following on a single 3090 getting between 60 and 70tkps on coding tasks:

[-]

llama: limit max outputs of `llama_context` by am17an · Pull Request #23861 · ggml-org/llama.cpp

Overview

nickm_27@reddit

Sensitive_Pop4803@reddit

legit_split_@reddit

Mountain_Patience231@reddit

pmttyji@reddit (OP)

nickm_27@reddit

pmttyji@reddit (OP)

nickm_27@reddit

nickm_27@reddit

Pentium95@reddit

am17an@reddit

ionizing@reddit

am17an@reddit

No_Lingonberry1201@reddit

VoiceApprehensive893@reddit

ionizing@reddit

GotHereLateNameTaken@reddit

soyalemujica@reddit

donomo@reddit

soyalemujica@reddit

xeeff@reddit

soyalemujica@reddit

xeeff@reddit

soyalemujica@reddit

xeeff@reddit

soyalemujica@reddit

SurpriseOk6927@reddit