Qwen3.5-35B running well on RTX4060 Ti 16GB at 60 tok/s

Posted by Nutty_Praline404@reddit | LocalLLaMA | View on Reddit | 40 comments

Spent a bunch of time tuning llama.cpp on a Windows 11 box (i7-13700F 64GB) with an RTX 4060 Ti 16GB, trying to get unsloth Qwen3.5-35B-A3B-UD-Q4_K_L running well at 64k context. I finally got it into a pretty solid place, so I wanted to share what is working for me.

models.ini entry:

[qwen3.5-35b-64k]
model = Qwen3.5-35B-A3B-UD-Q4_K_L.gguf
c = 65536
t = 6
tb = 8
n-cpu-moe = 11
b = 1024
ub = 512
parallel = 2
kv-unified = true

Router start command

llama-server.exe --models-preset models.ini --models-max 1 --host 0.0.0.0 --webui-mcp-proxy --port 8080

What I’m seeing now

With that preset, I’m reliably getting roughly 40–60 tok/s on many tasks, even with Docker Desktop running in the background.

A few examples from the logs:

\~56.41 tok/s on a 1050-token generation
\~46.84 tok/s on a 234-token continuation after a 1087-token prompt
\~44.97 tok/s on a 259-token continuation after checkpoint restore
\~41.21 tok/s on a 1676-token generation
\~42.71 tok/s on a 1689-token generation in a much longer conversation

So not “benchmark fantasy numbers,” but real usable throughput at 64k on a 4060 Ti 16GB.

Other observations

The startup logs can look “correct” and still produce bad throughput if the effective runtime shape isn’t what you think.
Looking at:
n_parallel
kv_unified
n_ctx_seq
n_ctx_slot
n_batch
n_ubatch was way more useful than just staring at the top-level command line.
Keeping VRAM pressure under control mattered more than squeezing out the absolute highest one-off score.

I did not find a database of tuned configs for various cards, but might be something useful to have.

[-]

No_Ebb3423@reddit

Did it get dumber since you’re using Q4?

[-]

ea_man@reddit

If you tight it up really well this runs in your VRAM: https://huggingface.co/unsloth/Qwen3.5-27B-GGUF?show_file_info=Qwen3.5-27B-IQ4_XS.gguf

Use KV Q_4

np 1

It's tight, would be better if you don't run a desktop with that (or max LXqt, not windows) or you use integrated graphics for that.
Yet it's much better than 35B A3B, runs at \~half speed.

[-]

Nutty_Praline404@reddit (OP)

Thanks for the suggestion. I agree that 27B is much better at coding - just finished testing with Q3_K_M, but it runs at 17 tok/s, but output is better, so probably worth it.

[-]

ea_man@reddit

Actually 17tok/s ain't bad for the reasoning / planner model, once you do that you may use a 4-8B for agent / tools. Omnicoder 2 comes to mind.

I remind you that QWENS use XML for tools whiule most everything else use json so they don't mix up, hence if you want an AI Coding Agent Harness with QWENs you better stick to Qwencode.
If your QWEN tools calls in *various editors fail it's because of that, Qwencode is really good at tooling with QWEN LLMs (FFS it uses \~11K of context just for that! https://github.com/QwenLM/qwen-code/blob/main/packages/core/src/core/prompts.ts /RANT)

[-]

DeepBlue96@reddit

to me runs at 1/4th speed the 27b i have a 3090... 24tks vs 98tks of the 35b they both produce more or less the same quality code (i know that the 27 is dense and the other is a 3b activation but still it's output is good enough for roocode agentic workflow)

[-]

ea_man@reddit

> to me runs at 1/4th speed the 27b

I guess most people can't run 25B A3B all in VRAM while you pretty much have to with the dense model, hence the speed difference.

Supposed they both run in VRAM the MOE has only 3B active vs 27B.

[-]

Danmoreng@reddit

I recommend trying out the fit and fit-ctx parameters: https://github.com/Danmoreng/local-qwen3-coder-env?tab=readme-ov-file#server-optimization-details

Do you build llama.cpp from scratch or use a pre-compiled binary? Self compilation might be slightly better.

[-]

Nutty_Praline404@reddit (OP)

This is great! Thanks for sharing. I'm using pre-comp.

[-]

v01dm4n@reddit

Hey OP, i also have 16G vram. I found the results with "qwen3.5-27b-iq3xxs UD" much better than the 35b moe model. That dense model is far more intelligent. I use kv cache at 4 bits and a ctx of 256k. All of this fits in my 16G. Get a speed of ~25tps with a 5060ti.

I use it with hermes or pi and it does a decent job at coding, research, browsing, writing articles etc.

[-]

FewBasis7497@reddit

Thanks, interesting. Could you please share your whole config/params.

[-]

v01dm4n@reddit

Sure.

llama-serve -m .gguf -c 256000 -ctk q4_0 -ctv q4_0 --no-mmproj

The model is from unsloth. Quant: UD-IQ3_XXS

[-]

LocalAI_Amateur@reddit

Wow. I have a 5070 ti 16gb vram and I'm not getting anywhere near your performances. but then again my setup is very different. I'm using LM Studio on a laptop with 32gb ram connected the video card through Oculink.

I'm getting at best 37 tokens per second and that's at 20,000 context window. I wonder which is the biggest factor: Oculink, 32gb of ram, LM Studio, or something else...

[-]

dpenev98@reddit

I have a very similar setup but with an integrated 8GB RTX Pro 1000 Blackwell on my laptop.

It runs on 32 t/s with 128k context. Very happy with it

[-]

Nutty_Praline404@reddit (OP)

As suggested by u/guigouz I also tried llm-server in docker to see whether its automatic hardware/model tuning could reproduce or beat the manual llama.cpp config I ended up with.

For my setup, it did not find a working solution for the 35B 64k case.

What happened:

llm-server correctly detected my RTX 4060 Ti and the model.
But it chose a very conservative moe_offload strategy, only placing 17 layers on GPU and 23 on CPU.
It also picked just 3 generation / batch CPU threads.
After partially loading the model, it still concluded the model “doesn’t fit” and aborted, even though I already have a stable manual llama.cpp config that runs this model at 64k in practice.

So for this specific hardware/model combo, the takeaway was:

My hand-tuned native llama.cpp setup beat llm-server’s automatic strategy.

I do still think llm-server is interesting, especially for simpler setups or smaller models, but on this 35B MoE / 64k / 16GB VRAM edge case, it seems to be optimizing for safety/conservatism rather than finding the aggressive-but-working configuration.

The practical lesson for me was:

autotuning is useful
but for borderline MoE models on limited VRAM, you still need to inspect the actual runtime behavior:
GPU vs CPU layer placement
parallel
kv_unified
effective context per slot
real throughput under long generations

In other words: llm-server was a good experiment, but it did not replace manual tuning here.

If anyone has gotten llm-server to successfully discover a working 35B MoE 64k config on a 16GB card, I’d be interested to compare notes.

[-]

guigouz@reddit

This is what it suggests here

═══ llm-server v2.1.0 ═══
Binary: /usr/local/sbin/llama-server
CPU: 16 physical cores
RAM: 67570MB available / 126426MB total
GPUs: 1 detected
  GPU0: NVIDIA GeForce RTX 4060 Ti 15718MB free / 16380MB total (PCIe x8 gen1)

Model: Qwen3.5-35B-A3B-UD-Q6_K_S.gguf
Size: 26.6GB (27194MB)
Architecture: 40 layers, 256 experts (MoE)
Backend: llama.cpp (mainline)
  SSM/Mamba hybrid → context-shift disabled
✓ Memory: Model and context estimated to fit within available RAM/VRAM.
  GPU+CPU split → batch=2048 ubatch=512
  KV cache: q4_0 (1280MB) — minimum VRAM
  Quantized KV cache → Hadamard K-transform enabled
  Prompt cache: 5609MB (10% of free RAM)
  Threads: gen=16 batch=16 (16 physical cores)

Strategy: moe_offload
Expert size per layer: 632MB (93% expert, from GGUF tensors)
  Per-layer cost: 632MB experts + 47MB attn/norm + 32MB KV = 711MB

Expert placement (conservative):
  GPU0 (NVIDIA GeForce RTX 4060 Ti): 12 layers (blk 0-11)
  CPU (RAM): 28 layers (~17696MB)

Command:
  /usr/local/sbin/llama-server -m Qwen3.5-35B-A3B-UD-Q6_K_S.gguf --host 0.0.0.0 \
    --port 8081 --ctx-size 65536 --flash-attn on -b 2048 -ub 512 \
    --cache-type-k q4_0 --cache-type-v q4_0 --jinja --threads 16 \
    --threads-batch 16 --no-context-shift -ngl 999 -mg 0 --no-mmap \
    -ot blk\.(0|1|2|3|4|5|6|7|8|9|10|11)\.ffn_(gate_up|up_gate|gate|up|down)_exps.*=CUDA0,exps=CPU

It's worth noting that it checks the free ram/vram in the moment that you run it, so if you have other models loaded or processes using the gpu, it will affect the estimation.

[-]

Jester14@reddit

What do you mean it "doesn't fit"? Did you use the -fit flag? UD-Q4_K_XL is larger than 16 GB so it will overflow to RAM but it will also "fit" if loaded appropriately. I get 30t/s on my 4060 8 GB using -fit with that quant with 40k context in VRAM.

[-]

ApprehensiveAd3629@reddit

Is possible to do it with lm studio?

[-]

guigouz@reddit

Did you try any other quants? I'm running Q6 here @ ~30t/s with 128k context (q4 k/v cache), using the cmdline generated by https://github.com/raketenkater/llm-server (llm-server --dry-run )

I started at Q8, now I'm testing Q6 which is a bit faster with similar quality. I wonder how low I can go.

Btw: I also tested qwen3.5 9b Q8 (almost same speed of 35b) and gemma4 26b (slower and in my coding tests, dumber)

[-]

tomByrer@reddit

Nice info!

I wonder how low I can go

Most of the tests I've seen, the speed/RAM/accuracy loss curve is ideal Q4-Q6. Exactly where is best-best depends on the model, who is making the quant, & your use-case. Also I'm guessing you could see some benefit if you fine-tune below Q6, since then the quant will keep more of what you want.

[-]

SirApprehensive7573@reddit

Tu aqui tbm?

Te vejo tudo quanto é canto

[-]

guigouz@reddit

Tudo quanto é canto de dev e IA :)

[-]

Nutty_Praline404@reddit (OP)

I am looking at others, but wanted to start here as it seemed the edge of feasibility. Thanks for pointing out that tool, but I could not get llm-server working right in windows under wsl.

[-]

tomByrer@reddit

Tried to use with long sessions? I'm wondering if you have enough room for a larger context, or are only able to run short bursts &/or constant compacting/resetting context.

[-]

Elegant_Tech@reddit

I use Q8 and it starts to struggle with file edit at around 90k context. It has to take multiple attempts at writing the edit to get it write. At least it catches and fixes it before finishing the prompt.

[-]

tomByrer@reddit

* writing the edit to get it right ;)
Thanks good to know your experience, helps me to decide form my 24Gb VRAM.

[-]

Nutty_Praline404@reddit (OP)

It is working with long sessions. Does drop from peak, and as context grows it does slow a bit, for example still at 37 tok/s after coding session that nearly fills context.

[-]

tomByrer@reddit

Thanks (to both of you)!
I'm setting up my RTX3090, so I'm sure I can fit all/most of the model & context in 24GB VRAM?
BTW, I'm not using that GPU for for anything but AI (monitor is on 2nd GPU or iGPU); so I have full VRAM available.

So Qwen3.5-35B is better than Qwen3-coder for coding?

[-]

PaceZealousideal6091@reddit

Hey.. thanks for sharing this. Quick question. Whats the '-kv_unified' flag exactly for? How does it work ?

[-]

Nutty_Praline404@reddit (OP)

If -np (parallel) is set to auto, it can change kv_unified to false which splits the context size across parallel units giving an effective smaller context (i.e. -np 2 can result in 32k context in each to give 64k total - not really what you want). At least that is what I understood, but am no expert.

[-]

PaceZealousideal6091@reddit

So, why not add -np 1 instead of -kvu?

[-]

dreamai87@reddit

You are correct. If np is given then default is 4 with each having similar context length 64k

[-]

MrTechnoScotty@reddit

I have found gemma 4 disappointingly slower than Qwen3.5 but havent worked as hard at optimizing yet

[-]

SmartCustard9944@reddit

A4B vs A3B makes some difference. Also the different KV cache.

[-]

Serious-Log7550@reddit

llama-server \
-ncmoe 17 \
--webui-mcp-proxy \
--alias "Qwen 3.5 35B A3B" \
-hf unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q4_K_XL \
--no-mmproj \
--cache-ram 134217728 --ctx-size 131072 --kv-unified --cache-type-k q8_0 --cache-type-v q8_0 \
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 \
--presence-penalty 0.0 --repeat-penalty 1.0 \
--flash-attn on --fit on \
--no-mmap \
--jinja \
--threads -1 \
--reasoning on \
--reasoning-budget 4096 \
--reasoning-budget-message "... Considering the limited time by the user, I have to give the solution based on the thinking directly now."

Gives me stable 35-40t/s regardless off used context percentage.

[-]

vk3r@reddit

How?
22GB in 16GB ?

[-]

Mashic@reddit

Spill to system RAM.

[-]

Nutty_Praline404@reddit (OP)

It’s not “22 GB in 16 GB VRAM.”

It fits because llama.cpp is using a GPU + system RAM split, not pure-VRAM loading.

For this model, the working setup was roughly:

part of the model on GPU VRAM
part on CPU / system RAM
KV cache quantized
MoE layers partially kept on CPU
only the hot path is heavily GPU-accelerated

So the effective footprint is something like:

\~9–14 GB on GPU depending on config
the rest in host RAM
plus compute / KV buffers

That’s why it can run on a 16 GB card even though the total model + runtime footprint is larger than 16 GB.

The tradeoff is:

it works
but it’s much more sensitive to tuning than a model that fully fits in VRAM

[-]

qubridInc@reddit

This proves you don’t need expensive GPUs just tuned configs; someone should turn this into a shared “GPU config zoo” instead of everyone reinventing the same setup.

[-]

ducksoup_18@reddit

Your unsloth link goes to the 9b model. Was a but confused for a sec.

[-]

Nutty_Praline404@reddit (OP)

thanks, fixed it.