Optimizing Token Generation in llama.cpp's CUDA Backend
Posted by am17an@reddit | LocalLLaMA | View on Reddit | 22 comments
Link to the post: https://github.com/ggml-org/llama.cpp/discussions/17621
We've been working over the last few months on kernel fusion in llama.cpp, I wrote a small write-up, it's semi-technical but one of the things I wanted to raise awareness is about if you're on a single GPU you can use GGML_CUDA_GRAPH_OPT=1 to run things slightly faster :)
Double_Cause4609@reddit
This seems specific to MoE models. Aren't most people running MoEs generally running quite large models and doing hybrid inference (Attn + Context -> GPU, conditional MoE FFN -> CPU)?
Do these benefits still hold there? I would think anyone who can run a 30B MoE on GPU would generally be running a 32B dense, actually. It looks like some of the improvements you targeted were specific to the MoE routing which I think is actually somewhat rare to throw on raw GPU.
Not trying to diminish the results; this is great work regardless, I just think the most minmax solution for end-users is improvements to hybrid inference.
Chromix_@reddit
The GGML_CUDA_GRAPH_OPT is broken for me on the latest commit, leads to slower TG on a RTX 3090.
VibeThinker is no MoE btw.
The error for granite is:
am17an@reddit (OP)
Please create an issue, I’ll take a look! Btw why are you running TG 4096?
Chromix_@reddit
I was testing from 128 to 16k to see if there were differences in slowing down with more KV cache usage.
Doesn't it fail for you when using that specific granite model? (just llama-bench -ngl -1 -fa on -p 0)
Maybe others with a 3090 can test it, to rule out issues on my end. I didn't test with different build configurations and driver / CUDA versions.
am17an@reddit (OP)
I think the correct way to that is use the depth(-d) parameter in llama-bench
on a 3090 I get with graph_opt
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 1 | tg128 @ d4096 | 203.40 ± 1.23 |
without
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 1 | tg128 @ d4096 | 196.37 ± 0.85 |
Re the granite model, I will download the model and take a look!
Chromix_@reddit
Thank for sharing these numbers. That was useful for me. It seems building against a different CUDA version locally can come with a speed penalty. It's faster with the official build and also speeds up a bit with the opt setting. Not as fast/much as yours though. That way I noticed that my VRAM OC was lost.
Glittering-Call8746@reddit
Can u do the same for ik_llama.cpp ? Pretty pls
a_beautiful_rhind@reddit
ik already does many fused operations. it might be wise to test effect on perplexity when using stuff like this.
DistanceSolar1449@reddit
You’d have to royally fuck up writing the kernel if you’re noticeably dropping perplexity with a fused kernel.
a_beautiful_rhind@reddit
One would think but with so many architectures and hardware, never say never.
am17an@reddit (OP)
The problem is that is the CI does not catch PPL errors yet, and llama-perplexity does not catch TG (batch_size=1) bugs. So it is possible to royally fuck up pretty easily :)
rerri@reddit
Seems like all layer cpu-moe works, but partial cpu-moe doesn't work.
Works: llama-bench -m gpt-oss-20b-Q8_0.gguf -fa 1 -p 0 -ncmoe 99
Doesn't work: llama-bench -m gpt-oss-20b-Q8_0.gguf -fa 1 -p 0 -ncmoe 10
Crashes with error: ggml-cuda.cu:90: CUDA error
build: fa0465954 (7205), 4090, Win11
Noble00_@reddit
Love these optimizations! -80% of theoretical limit of gpt-oss-20B on a 5090 is nice work. Speedup gains are modest at longer depths, but how much so have you measured?
am17an@reddit (OP)
A helpful engineer from NVIDIA benchmarked this
https://github.com/ggml-org/llama.cpp/pull/16991#pullrequestreview-3473149194, however that is only blackwell.
There are some other perf numbers in the PR as well
Noble00_@reddit
Thanks for the quick reply! I'll take a look!
hazeslack@reddit
And on multi gpu setup -sm layer, i get massive speed drop from latest update, i used b6402 before same launch parameter and model, now after update to latest get half tps for generation speed. So what happen
am17an@reddit (OP)
I think I know what the problem is (it's not related to this), but I will be submitting a fix soon
am17an@reddit (OP)
A helpful engineer from NVIDIA benchmarked this
https://github.com/ggml-org/llama.cpp/pull/16991#pullrequestreview-3473149194
-p-e-w-@reddit
Thank you for your impressive work. I use llama.cpp every day and any performance improvement is very valuable to me.
chiribe@reddit
Acabo de probarlo en cmd windows añadiendo este comando antes de llamar a llama-server. Nunca mi GTX1070 de laptop se había alcanzado a los 45tk/s. ¡Gracias!
jacek2023@reddit
Is there also some benefit for multi-GPU setup?
am17an@reddit (OP)
Not yet but we're working on multi-GPU improvements, probably will have something early next year