LLM amateur with a multi-GPU question. How to optimize for speed?
Posted by William-Riker@reddit | LocalLLaMA | View on Reddit | 6 comments
I want to run DeepSeek-V3-0324. Specifically the 2.71-bit 232GB Q2_K_XL version by unsloth. My hardware is the following:
Intel 10980XE 18C/36T @ All-Core OC at 4.8GHz.
256GB DDR4 3600MHz
2x 3090 (48GB VRAM)
2TB Samsung 990 Pro.
LLama.ccp running DeepSeek-V3-0324-UD-Q2_K_XL GGUF.
Between RAM and VRAM, I have \~304GB of memory to load the model into. It works, but the most I can get is around 3 T/S.
I have played around with a lot of the settings just in trial and error, but I thought I'd ask how to optimize the speed. How many layers to offload to the GPU? How many threads to use? Split row? BLAS size?
How to optimize for more speed?
FYI: I know it will never be super fast, but if I could increase it slightly to a natural reading speed, that would be nice.
Tips? Thanks.
segmond@reddit
Most of the parameters are pretty marginal. I have played around with them numerous times trying to run big models and it really makes no difference. The best you can do is offload as many layers as you can to your GPU, use nvidia-smi or nvtop to see how much vram you are using and increase till you start running out. Another is to move the k/v cache to system ram and load more layers to the GPU. Outside of these, gotta add more GPUs. :-D
MatterMean5176@reddit
Rumor is the fork ik_llama.cpp allows for flash attention with DeepSeek models. This could give you a boost if successful I would think. I had no success but my GPUs are way too old, without Tensor Cores.
https://github.com/ikawrakow/ik_llama.cpp
Also, 3090s should be fine with ktransformers(not that i own any.) the performance increase ktransformers claims over llama.cpp is eyepopping, it's in the link:
https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/DeepseekR1_V3_tutorial.md
tomz17@reddit
> FYI: I know it will never be super fast, but if I could increase it slightly to a natural reading speed, that would be nice.
Since a model of that size is primarily running on the CPU, you are never going to get much faster without switching platforms. The memory bandwidth (4 x DDR3600) will always be the limit.
Phocks7@reddit
Not sure if you've seen this thread, but you might be able to squeeze some more performance out of the imatrix quants.
fizzy1242@reddit
3 t/s is pretty impressive for that model, though. How is the response quality with such a low quant?
solidsnakeblue@reddit
You might try ktransformers, but you might need 4090s