Running Qwen-3.6-35B-A3B locally is very slow
Posted by Sad-Duck2812@reddit | LocalLLaMA | View on Reddit | 3 comments
Hi Everyone,
I am pretty new to running LLMs locally and I have faced some issues I hope someone can help me with this.
I am running the model Qwen-3.6-35B-A3B on my PC and I am getting around 16.7 tk/s prompt evaluation and 65 tk/s token generation.
If I prompt the model with "Hello", it answers me back quick with 65 tk/s, however if I am to use an agentic coder such as Cline or Opencode then it takes a long time to process before responding back. I understand that the Cline or opencode is prompting the AI with system instructions which I noticed is around 12,000 tokens and it takes around 5-10 min to get a response after saying something like "Build a resume page in tailwindcss" and that is only the planning stage. Once the plan is done by opencode or Cline and I ask it to build the page based on its suggested implementation it takes 30 min to 1 hour to have the index.html made.
My question is, Is this normal or am I doing something wrong? I can see people here with dual 5060 TI's with 32GB VRAM total being able to work with this setup but I can't figure out why it takes such a long time to get something done with my setup.
My Setup:
RTX 4070 Ti Super 16GB VRAM
RTX 2070 8GB VRAM
96GB DDR4 RAM
Ryzen 7 5700X
I have tried LM Studio and llamacpp, I have tried using only the RTX 4070 Ti super and offloading the rest to system memory as well as some MOE experts to cpu. I have also tried using both GPUs with tensor split 2,1 and i was able to achieve 65 tk/s, However even with only RTX 4070 Ti super i was getting 27 tk/s and it still took a similar time in prompt processing and generation the index html file.
I have tried 64k and 100k Context sizes and both taking similar times.
My llamacpp command:
llama-cli.exe --model "D:\AI Models\lmstudio-community\Qwen3.6-35B-A3B-GGUF\Qwen3.6-35B-A3B-Q4_K_M.gguf" --ctx-size 100000 --n-gpu-layers 40 --n-cpu-moe 4 --split-mode layer --main-gpu 0 --tensor-split 2,1 --threads 16 --threads-batch 16 --batch-size 2048 --ubatch-size 512 --flash-attn on
I have even tried with batch size or 256 and ubatch of 128, still taking too long. As well as fitting the entire model in 2 GPUs and no spill to system memory.
Appreciate any help on this, I'm sure I'm doing something wrong but I have no clue. Have spent days on this trying many models and even Qwen 3.6-27B seems to act the same taking a long time with Cline or opencode.
CapsAdmin@reddit
on a 4090 I get 150 tk/s if all the experts fit into vram. Enabling rebar in bios also helped improve performance a bit with experts offloading.
my models.ini looks like this:
Not sure if all of it makes sense, but it's optimized for a single user and gpu and turning the server on and off manually.
fit-target lets you define how much vram should be preserved, which you may not want.
Striking_Wishbone861@reddit
I had the exact same experience and went back to 3.5:36b go figure it’s still faster
Sad-Duck2812@reddit (OP)
Actually I have managed the right setup and I'm getting 125 tk/s.
It is possible but you may need to check your config