Running Qwen-3.6-35B-A3B locally is very slow

Posted by Sad-Duck2812@reddit | LocalLLaMA | View on Reddit | 3 comments

Hi Everyone,

I am pretty new to running LLMs locally and I have faced some issues I hope someone can help me with this.

I am running the model Qwen-3.6-35B-A3B on my PC and I am getting around 16.7 tk/s prompt evaluation and 65 tk/s token generation.

If I prompt the model with "Hello", it answers me back quick with 65 tk/s, however if I am to use an agentic coder such as Cline or Opencode then it takes a long time to process before responding back. I understand that the Cline or opencode is prompting the AI with system instructions which I noticed is around 12,000 tokens and it takes around 5-10 min to get a response after saying something like "Build a resume page in tailwindcss" and that is only the planning stage. Once the plan is done by opencode or Cline and I ask it to build the page based on its suggested implementation it takes 30 min to 1 hour to have the index.html made.

My question is, Is this normal or am I doing something wrong? I can see people here with dual 5060 TI's with 32GB VRAM total being able to work with this setup but I can't figure out why it takes such a long time to get something done with my setup.

My Setup:

RTX 4070 Ti Super 16GB VRAM

RTX 2070 8GB VRAM

96GB DDR4 RAM

Ryzen 7 5700X

I have tried LM Studio and llamacpp, I have tried using only the RTX 4070 Ti super and offloading the rest to system memory as well as some MOE experts to cpu. I have also tried using both GPUs with tensor split 2,1 and i was able to achieve 65 tk/s, However even with only RTX 4070 Ti super i was getting 27 tk/s and it still took a similar time in prompt processing and generation the index html file.

I have tried 64k and 100k Context sizes and both taking similar times.

My llamacpp command:

llama-cli.exe --model "D:\AI Models\lmstudio-community\Qwen3.6-35B-A3B-GGUF\Qwen3.6-35B-A3B-Q4_K_M.gguf" --ctx-size 100000 --n-gpu-layers 40 --n-cpu-moe 4 --split-mode layer --main-gpu 0 --tensor-split 2,1 --threads 16 --threads-batch 16 --batch-size 2048 --ubatch-size 512 --flash-attn on

I have even tried with batch size or 256 and ubatch of 128, still taking too long. As well as fitting the entire model in 2 GPUs and no spill to system memory.

Appreciate any help on this, I'm sure I'm doing something wrong but I have no clue. Have spent days on this trying many models and even Qwen 3.6-27B seems to act the same taking a long time with Cline or opencode.

[-]

CapsAdmin@reddit

on a 4090 I get 150 tk/s if all the experts fit into vram. Enabling rebar in bios also helped improve performance a bit with experts offloading.

my models.ini looks like this:

n-gpu-layers = -1
flash-attn = on
kv-unified = true
jinja = true
ctx-size = 131072
cache-type-k = q8_0
cache-type-v = q8_0
batch-size = 4096
ubatch-size = 2048
n-cpu-moe = 0
no-mmap = true
parallel = 1
fit-target = 4096 ; preserve 4gb of vram
temperature = 0.6
top-p = 0.95
top-k = 20
min-p = 0.05
presence-penalty = 0.0

Not sure if all of it makes sense, but it's optimized for a single user and gpu and turning the server on and off manually.

fit-target lets you define how much vram should be preserved, which you may not want.

Striking_Wishbone861@reddit

I had the exact same experience and went back to 3.5:36b go figure it’s still faster

Sad-Duck2812@reddit (OP)

Actually I have managed the right setup and I'm getting 125 tk/s.

It is possible but you may need to check your config