Problem with rtx 3090 and MoE models?
Posted by GodComplecs@reddit | LocalLLaMA | View on Reddit | 17 comments
I think I am having speed issues with the rtx 3090 and big MoE models like Qwen 3 coder and step 3.5 flash. I get around 21tk/s on Qwen3 next and 9tk/s on step, all offloaded to plenty of 2400hz ddr4 ram, Ryzen 5800x3d. I've tried all kinds of settings, even -ot with regex. Some load into virtual VRAM and some load them into RAM, doesnt matter. Nonmap or going into NVME. I tried REAP model of Qwen, still slow.
Some posts talk about 30-40tks with Qwen 3 next on similar hardware, seems big.
Latest llama.cpp, both are tested on Windows cuda precompiled or WSL Ubuntu llama.cpp.
Vulkan did nothing but it was through LM studio, which weirdly is VERY slow, like 8tk/s for Qwen 3 next.
Any tips?
17 Comments
Lorelabbestia@reddit
GodComplecs@reddit (OP)
Lorelabbestia@reddit
Greenonetrailmix@reddit
Yes-Scale-9723@reddit
fizzy1242@reddit
DataGOGO@reddit
GodComplecs@reddit (OP)
DataGOGO@reddit
Blindax@reddit
GodComplecs@reddit (OP)
cm8t@reddit
GodComplecs@reddit (OP)
Hot_Turnip_3309@reddit
Klutzy-Snow8016@reddit
GodComplecs@reddit (OP)
Ryanmonroe82@reddit