Problem with rtx 3090 and MoE models?

Posted by GodComplecs@reddit | LocalLLaMA | View on Reddit | 17 comments

I think I am having speed issues with the rtx 3090 and big MoE models like Qwen 3 coder and step 3.5 flash. I get around 21tk/s on Qwen3 next and 9tk/s on step, all offloaded to plenty of 2400hz ddr4 ram, Ryzen 5800x3d. I've tried all kinds of settings, even -ot with regex. Some load into virtual VRAM and some load them into RAM, doesnt matter. Nonmap or going into NVME. I tried REAP model of Qwen, still slow. Some posts talk about 30-40tks with Qwen 3 next on similar hardware, seems big. Latest llama.cpp, both are tested on Windows cuda precompiled or WSL Ubuntu llama.cpp. Vulkan did nothing but it was through LM studio, which weirdly is VERY slow, like 8tk/s for Qwen 3 next. Any tips?

Reply to Post

17 Comments

[-]

Lorelabbestia@reddit

Are you running with custom MoE Kernel and compiling the graphs? But if it is offloading to the CPU is quite bad, you could try quantazing the KV Cache to free up some memory. Also, run on SGLang for great performance and tuning.

[-]

GodComplecs@reddit (OP)

Ok I'm trying SGland, any speedup is a win in my book, the KV cache actually slowed down generation but not PP I think, overall slower though. No custom kernel or graphs, I'll look into it.

[-]

Lorelabbestia@reddit

u/GodComplecs, how did it go?

[-]

Greenonetrailmix@reddit

Getting 88 tok/s on my 5090 x8 gen5, 4090 x8 gen 4, 14900k PC using Q4_K_M. Using LM Studio on Vulkan as using cuda halfed! The token output speed. I've been thinking about adding a 3090 in on gen 4 x4 so I can get to Q5 size, but I'm worried about output speed.

[-]

Yes-Scale-9723@reddit

offloading even one layer will almost defeat the purpose of having a GPU

[-]

fizzy1242@reddit

for dense models, yeah. Handful are fine for MoE, though

[-]

DataGOGO@reddit

How many ram channels? Memory bandwidth when you are doing any offload matters a lot, for example: 2 channels of ddr4 2400 MT/s is 38GB/s, 2 channels of ddr5 8400 MT/s is 134BG/s, 8 channels of ddr5 6400 MT/s is 409GB/s

[-]

GodComplecs@reddit (OP)

Well theres the issue, ddr4 is not gonna cut it, I run 4 channels (i think)

[-]

DataGOGO@reddit

Only Xeon-w/threadripper or server CPU’s have 4+ memory channels You need to run smaller models that fit in your vram at 4+ bit quants I have an old 2950X with 4x channels of ddr4 3200 and two 3090’s there does great, just have to keep everything in VRAM, no offloading

[-]

Blindax@reddit

If it’s a 5800x3d 4stick is still 2 channels. You did not mention the quant you are using but your speeds seem pretty fine to me for 24gb vram considering models sizes (in particular step 3.5 which is huge).

[-]

GodComplecs@reddit (OP)

Dont think ddr4 is just gonna cut it anymore then if the speedups are that big, even in theory

[-]

cm8t@reddit

Hit up the Nvidia control panel (not the app) and make sure Sysmem fallback policy is set to “Prefer No Fallback” or the like. Then when you load the model in LM Studio, make sure all layers are loaded onto GPU.

[-]

GodComplecs@reddit (OP)

Didnt work! Still extremely slow, tried BETA runtimes also. Gonna stick with llama.cpp

[-]

Hot_Turnip_3309@reddit

there are major bugs in llamacpp with -next, both corrupting the output and reducing the speed. for example, no matter how many experts you offload, you'll get the same speed (or slower)

[-]

Klutzy-Snow8016@reddit

I think \`--fit\` will get you close to the fastest available performance. But the issue is probably the 2400 MHz ram. DDR4 3200 is 33% faster, 3600 is 50% faster, and memory bandwidth is the bottleneck for this workload.

[-]

GodComplecs@reddit (OP)

Ok thanks that explains a lot! Yeah I used fit, I was able to eeke out 1tk/s with -ot command though!

[-]

Ryanmonroe82@reddit

Any offloading to system ram kills it