LLM amateur with a multi-GPU question. How to optimize for speed?

Posted by William-Riker@reddit | LocalLLaMA | View on Reddit | 6 comments

I want to run DeepSeek-V3-0324. Specifically the 2.71-bit 232GB Q2_K_XL version by unsloth. My hardware is the following:

Intel 10980XE 18C/36T @ All-Core OC at 4.8GHz.

256GB DDR4 3600MHz

2x 3090 (48GB VRAM)

2TB Samsung 990 Pro.

LLama.ccp running DeepSeek-V3-0324-UD-Q2_K_XL GGUF.

Between RAM and VRAM, I have \~304GB of memory to load the model into. It works, but the most I can get is around 3 T/S.

I have played around with a lot of the settings just in trial and error, but I thought I'd ask how to optimize the speed. How many layers to offload to the GPU? How many threads to use? Split row? BLAS size?

How to optimize for more speed?

FYI: I know it will never be super fast, but if I could increase it slightly to a natural reading speed, that would be nice.

Tips? Thanks.