Is qwen3vl 235B is supposed to be this slow?

Posted by shapic@reddit | LocalLLaMA | View on Reddit | 17 comments

Heya, I managed to get access to server with 40G A100 and 96 RAM. Tried loading Qwen3-VL-235B-A22B-Thinking-GGUF UD-IQ3_XXS using llama.cpp.

Configuration is: --ctx-size 40000 --n-cpu-moe 64 --prio 2 --temp 1.0 --repeat-penalty 1.0 --min-p 0.0 --top-k 20 --top-p 0.95 --presence_penalty 0.0 --image-min-tokens 1024 --jinja --flash-attn on -ctk q8_0 -ctv q8_0

Takes most of vram, but output speed is 6.2 tps. I never tried MOE before, but from what I read I thought I would get at least 15. I did not find any comprehensive data on running this specific model not on huge cluster (outside of some guy running it at 2tps), so my question is, where my expectations too high?

Or am I missing something?