Running Qwen-3.6-35B-A3B locally is very slow

Posted by Sad-Duck2812@reddit | LocalLLaMA | View on Reddit | 3 comments

Hi Everyone,

I am pretty new to running LLMs locally and I have faced some issues I hope someone can help me with this.

I am running the model Qwen-3.6-35B-A3B on my PC and I am getting around 16.7 tk/s prompt evaluation and 65 tk/s token generation.

If I prompt the model with "Hello", it answers me back quick with 65 tk/s, however if I am to use an agentic coder such as Cline or Opencode then it takes a long time to process before responding back. I understand that the Cline or opencode is prompting the AI with system instructions which I noticed is around 12,000 tokens and it takes around 5-10 min to get a response after saying something like "Build a resume page in tailwindcss" and that is only the planning stage. Once the plan is done by opencode or Cline and I ask it to build the page based on its suggested implementation it takes 30 min to 1 hour to have the index.html made.

My question is, Is this normal or am I doing something wrong? I can see people here with dual 5060 TI's with 32GB VRAM total being able to work with this setup but I can't figure out why it takes such a long time to get something done with my setup.

My Setup:

RTX 4070 Ti Super 16GB VRAM

RTX 2070 8GB VRAM

96GB DDR4 RAM

Ryzen 7 5700X

I have tried LM Studio and llamacpp, I have tried using only the RTX 4070 Ti super and offloading the rest to system memory as well as some MOE experts to cpu. I have also tried using both GPUs with tensor split 2,1 and i was able to achieve 65 tk/s, However even with only RTX 4070 Ti super i was getting 27 tk/s and it still took a similar time in prompt processing and generation the index html file.

I have tried 64k and 100k Context sizes and both taking similar times.

My llamacpp command:

llama-cli.exe --model "D:\AI Models\lmstudio-community\Qwen3.6-35B-A3B-GGUF\Qwen3.6-35B-A3B-Q4_K_M.gguf" --ctx-size 100000 --n-gpu-layers 40 --n-cpu-moe 4 --split-mode layer --main-gpu 0 --tensor-split 2,1 --threads 16 --threads-batch 16 --batch-size 2048 --ubatch-size 512 --flash-attn on

I have even tried with batch size or 256 and ubatch of 128, still taking too long. As well as fitting the entire model in 2 GPUs and no spill to system memory.

Appreciate any help on this, I'm sure I'm doing something wrong but I have no clue. Have spent days on this trying many models and even Qwen 3.6-27B seems to act the same taking a long time with Cline or opencode.