Recommended parameters for Qwen 3.6 35B A3B on a 8GB VRAM card and 24GB RAM?

Posted by FUS3N@reddit | LocalLLaMA | View on Reddit | 21 comments

I was running Q3_K_S with 90k context and was getting 21tok/s and gets reduced to 19.5 something after a few messages (I am using mmproj-F16 as i need vision for some task) And slowly reduces. Any way to get a bit better performance while keeping high context size is that not the issue?

My current params:

llama-server -m model --mmproj mmproj --jinja -fit on -c 90000 -b 4096 -ub 1024 -ngl 99 -ctk q8_0 -ctv q8_0 --flash-attn on --n-cpu-moe 38 --reasoning off --presence-penalty 1.5 --repeat-penalty 1.0 --temp 0.7 --top-p 0.95 --min-p 0.0 --top-k 20 --context-shift --keep 1024 -np 1 --mlock --split-mode layer --n-predict 32768 --parallel 2 --no-mmap

I only started using direct llamacpp recently so i still don't know all the params or what most even do (there's so many) so i just looked up and gathered as much params i could and mashed them together to make the above, don't even know if its the right settings for my setup or if it could be better.