qwen3.6-35b-a3b-mtp running on GTX 1060 6GB

Posted by xxvegas@reddit | LocalLLaMA | View on Reddit | 11 comments

I have this old 10-year old Dell T5810 workstation with 32GB ddr3(?) memory and a E5-2698v3 (16 cores 32 threads), a GTX 1060 6GB that's used for mining back in the old days (paid itself back many times over). I managed to get the model running with LMStudio in Windows(!). My settings are:

Model: unsloth qwen3.6-35B-a3b-MTP-GGUF UD Q4_K_XL

Ctx length:131072

GPU offload 41

CPU threadpool size 16

Max concurrent 4

Number of experts 8

Number of MOE layers offloaded to CPU 41

MTP max draft 3

KV quantization both Q4_0

prefill 16k about 130-150tps

decode 4k about 16tps

Very usable for chat.

[-]

nickless07@reddit

Try without MTP and offload some less layers to CPU. Right now only the KV, Vision tower, draft stack and some overhead is used by your 1060 everything else runs on your CPU.

[-]

mraurelien@reddit

u/nickless07 would you mind explaining how to find out the right parameters ?
As for me I'm running an AMD 7840U with iGPU 780m + dGPU 7700S with 8GB VRAM + 64Go DDR5.
I can' get more than 27 t/s (vulkan, always trying the latest version available).
I would have though to get more of it regarding some numbers in this sub...

```
llama-server \

--model \~/models/bartowski/Qwen_Qwen3.6-35B-A3B-GGUF/Qwen_Qwen3.6-35B-A3B-IQ4_XS.gguf

--ctx-size 32768 \

--jinja \

--chat-template-kwargs '{"preserve_thinking":true}' \

--port 4141 \

--parallel 1 \

--alias qwen \

--temperature 0.6 \

--top-p 0.95 \

--top-k 20 \

--min-p 0 \

--presence-penalty 0 \

--repeat-penalty 1.0 \

--flash-attn on \

--cache-type-k q8_0 \

--cache-type-v q8_0 \

--n-cpu-moe 26 \

--no-mmap \

--gpu-layers 99 \

--mlock \

--verbose
```

If someone can point to me some optimizations I missed or the methodology to find a better sweet spot (in case I'm doing wrong), I would be really thankful 🙏
Or maybe it's related to AMD being far behind than green team ones for the same capacity...

[-]

ea_man@reddit

use --fit-ctx 32768 --fit-target 100

and llama will tune it for you

[-]

mraurelien@reddit

I tried but so far no better speed... But thank's for the advice !

[-]

ea_man@reddit

Ok, but consider that if you use --fit-ctx and --fit-target than you don't specify your:

--n-cpu-moe 26 \
--ctx-size 32768 \

Maybe you can do with

--cache-type-k q8_0 \

--cache-type-v q5_0 \

With \~30k context. It won't change much ;)

[-]

mraurelien@reddit

Yes ! That's what I realized. I got pretty far by manually finetuning but indeed --fit-ctx and --fit-target are more explicit without having magic numbers changing from a model to another. Thank's a lot !

Now another question about the KV non asymetric best pattern : why q8/q5 ? Any paper or benchmarks about it ?

[-]

ea_man@reddit

It much depends on how much VRAM you have to spare, context length (the longer the bigger the quants) and if it's MoE or Dense (latter more stable).

Nice paper: https://anbeeld.com/articles/kv-cache-quantization-benchmarks-for-long-context

[-]

nickless07@reddit

Basically just change --n-cpu-moe until you have no more oom. Check how much you have left. More then 2GB? Then go for 24, less then 500mb? keep as is and so on. Check with nvidia-smi for nvidia cards or rocm-smi for AMD. What i'm missing at your load params is --threads and maybe mmproj (if used), aside of that it looks pretty good for that model, quant and VRAM size.

[-]

xxvegas@reddit (OP)

Thanks for the suggestion.

No-MTP, 36 MOE layers off loaded (down from 41): \~18tps decode

MTP, 37 MOE layers off loaded: 19-20tps

I suppose I can get a smaller (Q4 KS?) to offload more layers. MTP remains helpful.

[-]

nickless07@reddit

Depends... With your initial settings you create a bottleneck on your PICe, KV on VRAM and compute on CPU? Well every token had to be send to VRAM, get quantisized to q8, send back, get processed and then stored again. Yes the VRAM is much faster, but the 6GB is a bit rough. 18-20tps is already descend. Have you tested KV on CPU too? or even don't use the GTX at all?
Here is some way more in depth information. This year we got some decent models that can perform great on old hardware.

[-]

Clear-Ad-9312@reddit

Try Gemma 4 E2B through LiteRT-LM, possible to get about 90 t/s gen. I know, it's not as smart/capable, but if you want something bigger, then best is to save up the cash for more VRAM.