Is qwen3vl 235B is supposed to be this slow?
Posted by shapic@reddit | LocalLLaMA | View on Reddit | 17 comments
Heya, I managed to get access to server with 40G A100 and 96 RAM. Tried loading Qwen3-VL-235B-A22B-Thinking-GGUF UD-IQ3_XXS using llama.cpp.
Configuration is: --ctx-size 40000 --n-cpu-moe 64 --prio 2 --temp 1.0 --repeat-penalty 1.0 --min-p 0.0 --top-k 20 --top-p 0.95 --presence_penalty 0.0 --image-min-tokens 1024 --jinja --flash-attn on -ctk q8_0 -ctv q8_0
Takes most of vram, but output speed is 6.2 tps. I never tried MOE before, but from what I read I thought I would get at least 15. I did not find any comprehensive data on running this specific model not on huge cluster (outside of some guy running it at 2tps), so my question is, where my expectations too high?
Or am I missing something?
fizzy1242@reddit
you did nothing wrong. it's still a very big model despite being a MoE model.. have you tried glm 4.5V? only 12b active compared to 22 in qwen
shapic@reddit (OP)
It,s old GPU, but 40Gb. iq3 22b does not even fill half of it. And I thing you are speaking about GLM4.5 air. I am more interested in vlm part tbh.
No_Afternoon_4260@reddit
But you stil have 200+B to load in ram no mater the amount of active parameters. In your setup and quant idk but 40gb vram isn't that much for a 200+B, you have may be less than half in vram. Idk your cpu setup but understand it is probably 2 order of magnitude under the h200 perf.
Have you checked if every thing seats in ram/vram that you don't swap to disk? (probably not it would be an order of magnitude slower I would guess)
shapic@reddit (OP)
Ok, that was mmap messing output, no swap to disk. Guess it is just that slow. I thought conversion is greater with people reporting stuff like hundreds of tps when fully in vram. I'll probably just get the non-thinking one then.
shapic@reddit (OP)
It should be, but something seems off in htop output. Good catch, I'll try to figure out how to check that in ubuntu
pmttyji@reddit
Remaining things to get better t/s is llama-bench with --n-cpu-moe values. (Rough calculation gave me around 55. So llama-bench with 55-63 for --n-cpu-moe parameter)
Also your command doesn't have other parameters like -b -ub -t. Use those, do experiment & share your final t/s here.
shapic@reddit (OP)
Not sure how playing with number of threads can affect anything, default already uses all. Also where to read about -b and -ub? I thought they were taken from internal template provided by model creators.
pmttyji@reddit
For my system config(8GB VRAM, 32GB RAM, Cores - 20, Logical Processors -28), -t 8 gave me best t/s.
https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md
You have 40GB VRAM & More RAM so play with -b & -ub.
LegacyRemaster@reddit
If there's even just 1 layer on your RAM, that's going to be your bottleneck. It runs at the same speed on my DDR4 PC, so I really think you need to load EVERYTHING onto the VRAM to go fast.
shapic@reddit (OP)
The whole reason behind MOE is that bulk of calculations are done on 22B part, which speeds whole thing significantly even if experts are loaded on CPU
Nepherpitu@reddit
You got idea wrong. It will not run FAST on CPU. It will run faster, than dense model. Model will not magically use only 22Gb of VRAM. It will randomly select multiple different experts for each token. Multiple, different and for each token. So, matrix multiplication will be done on 22B params, but memory utilization will be as for full size model.
For example, you have 1000GB/s memory bandwidth and 235B A22B model. For dense 235B model you will get ~5 token per second. For MoE model you will get ~50 token per second.
But your RAM has bandwidth around 80GB/s. So, 22B active params will give you 4tps in ideal case for Q8 without GPU. GPU makes everything faster, but more complicated.
Anyway, main idea here is MoE doesn't magically makes your CPU stand in the corner and watch how GPU magically rotates experts. It will work like it processing 66% of weights while static weights and random experts are on GPU. In your case it's about 8GB of weights for each token. Like running 14B Q4 model on CPU only. Is it possible? Yes. Is it acceptable? Probably. Is it fast? No.
uti24@reddit
What was your calculations behind that?
shapic@reddit (OP)
Purely speculative. 27B Q8 was running around 21tps on that gpu.
uti24@reddit
Yes, that’s the problem. A 27B model in Q8 can be fully loaded into 40GB VRAM, but with a 235B model, even in something like Q3, you can load only about 30 - 40% into the GPU/VRAM. The rest will run on the CPU and system RAM, or be swapped between RAM and VRAM.
Either way, it’s much slower, your GPU ends up working at around 20% and spends most of the time waiting for whatever the CPU/RAM is doing with the LLM.
No-Refrigerator-1672@reddit
27B Q4 was running in that GPU. 235B can't fit completely inside no matter what, you're running moslty on CPU with just a few layers being accelerated.
shapic@reddit (OP)
27b is massively smaller and whole experts offloading is supposed to make it much more bearable, up to people reporting 15tps on cpu
MaxKruse96@reddit
on top of that, 6t/s is already good for 22B active, even if its iq3xxs. Its a 98gb model (with vision encoder), sparsity is 22B/235B, so 10% (roughly, being generous here), so it computes against 98gb x 10% = \~10GB for every token. Thats roughly equivalent to a dense 8b model at q8_k_xl. They are very likely being on "slight" GPU accelerated compute, while most is still in RAM, therefore slowing it down.