Qwen3.5-35B running well on RTX4060 Ti 16GB at 60 tok/s
Posted by Nutty_Praline404@reddit | LocalLLaMA | View on Reddit | 40 comments
Spent a bunch of time tuning llama.cpp on a Windows 11 box (i7-13700F 64GB) with an RTX 4060 Ti 16GB, trying to get unsloth Qwen3.5-35B-A3B-UD-Q4_K_L running well at 64k context. I finally got it into a pretty solid place, so I wanted to share what is working for me.
models.ini entry:
[qwen3.5-35b-64k]
model = Qwen3.5-35B-A3B-UD-Q4_K_L.gguf
c = 65536
t = 6
tb = 8
n-cpu-moe = 11
b = 1024
ub = 512
parallel = 2
kv-unified = true
Router start command
llama-server.exe --models-preset models.ini --models-max 1 --host 0.0.0.0 --webui-mcp-proxy --port 8080
What I’m seeing now
With that preset, I’m reliably getting roughly 40–60 tok/s on many tasks, even with Docker Desktop running in the background.
A few examples from the logs:
- \~56.41 tok/s on a 1050-token generation
- \~46.84 tok/s on a 234-token continuation after a 1087-token prompt
- \~44.97 tok/s on a 259-token continuation after checkpoint restore
- \~41.21 tok/s on a 1676-token generation
- \~42.71 tok/s on a 1689-token generation in a much longer conversation
So not “benchmark fantasy numbers,” but real usable throughput at 64k on a 4060 Ti 16GB.
Other observations
- The startup logs can look “correct” and still produce bad throughput if the effective runtime shape isn’t what you think.
- Looking at:
n_parallelkv_unifiedn_ctx_seqn_ctx_slotn_batchn_ubatchwas way more useful than just staring at the top-level command line.- Keeping VRAM pressure under control mattered more than squeezing out the absolute highest one-off score.
I did not find a database of tuned configs for various cards, but might be something useful to have.
No_Ebb3423@reddit
Did it get dumber since you’re using Q4?
ea_man@reddit
If you tight it up really well this runs in your VRAM: https://huggingface.co/unsloth/Qwen3.5-27B-GGUF?show_file_info=Qwen3.5-27B-IQ4_XS.gguf
Use KV Q_4
np 1
It's tight, would be better if you don't run a desktop with that (or max LXqt, not windows) or you use integrated graphics for that.
Yet it's much better than 35B A3B, runs at \~half speed.
Nutty_Praline404@reddit (OP)
Thanks for the suggestion. I agree that 27B is much better at coding - just finished testing with Q3_K_M, but it runs at 17 tok/s, but output is better, so probably worth it.
ea_man@reddit
Actually 17tok/s ain't bad for the reasoning / planner model, once you do that you may use a 4-8B for agent / tools. Omnicoder 2 comes to mind.
I remind you that QWENS use XML for tools whiule most everything else use json so they don't mix up, hence if you want an AI Coding Agent Harness with QWENs you better stick to Qwencode.
If your QWEN tools calls in *various editors fail it's because of that, Qwencode is really good at tooling with QWEN LLMs (FFS it uses \~11K of context just for that! https://github.com/QwenLM/qwen-code/blob/main/packages/core/src/core/prompts.ts /RANT)
DeepBlue96@reddit
to me runs at 1/4th speed the 27b i have a 3090... 24tks vs 98tks of the 35b they both produce more or less the same quality code (i know that the 27 is dense and the other is a 3b activation but still it's output is good enough for roocode agentic workflow)
ea_man@reddit
> to me runs at 1/4th speed the 27b
I guess most people can't run 25B A3B all in VRAM while you pretty much have to with the dense model, hence the speed difference.
Supposed they both run in VRAM the MOE has only 3B active vs 27B.
Danmoreng@reddit
I recommend trying out the fit and fit-ctx parameters: https://github.com/Danmoreng/local-qwen3-coder-env?tab=readme-ov-file#server-optimization-details
Do you build llama.cpp from scratch or use a pre-compiled binary? Self compilation might be slightly better.
Nutty_Praline404@reddit (OP)
This is great! Thanks for sharing. I'm using pre-comp.
v01dm4n@reddit
Hey OP, i also have 16G vram. I found the results with "qwen3.5-27b-iq3xxs UD" much better than the 35b moe model. That dense model is far more intelligent. I use kv cache at 4 bits and a ctx of 256k. All of this fits in my 16G. Get a speed of ~25tps with a 5060ti.
I use it with hermes or pi and it does a decent job at coding, research, browsing, writing articles etc.
FewBasis7497@reddit
Thanks, interesting. Could you please share your whole config/params.
v01dm4n@reddit
Sure.
llama-serve -m.gguf -c 256000 -ctk q4_0 -ctv q4_0 --no-mmproj
The model is from unsloth. Quant: UD-IQ3_XXS
LocalAI_Amateur@reddit
Wow. I have a 5070 ti 16gb vram and I'm not getting anywhere near your performances. but then again my setup is very different. I'm using LM Studio on a laptop with 32gb ram connected the video card through Oculink.
I'm getting at best 37 tokens per second and that's at 20,000 context window. I wonder which is the biggest factor: Oculink, 32gb of ram, LM Studio, or something else...
dpenev98@reddit
I have a very similar setup but with an integrated 8GB RTX Pro 1000 Blackwell on my laptop.
It runs on 32 t/s with 128k context. Very happy with it
Nutty_Praline404@reddit (OP)
As suggested by u/guigouz I also tried
llm-serverin docker to see whether its automatic hardware/model tuning could reproduce or beat the manual llama.cpp config I ended up with.For my setup, it did not find a working solution for the 35B 64k case.
What happened:
llm-servercorrectly detected my RTX 4060 Ti and the model.moe_offloadstrategy, only placing 17 layers on GPU and 23 on CPU.So for this specific hardware/model combo, the takeaway was:
My hand-tuned native llama.cpp setup beat
llm-server’s automatic strategy.I do still think
llm-serveris interesting, especially for simpler setups or smaller models, but on this 35B MoE / 64k / 16GB VRAM edge case, it seems to be optimizing for safety/conservatism rather than finding the aggressive-but-working configuration.The practical lesson for me was:
parallelkv_unifiedIn other words:
llm-serverwas a good experiment, but it did not replace manual tuning here.If anyone has gotten
llm-serverto successfully discover a working 35B MoE 64k config on a 16GB card, I’d be interested to compare notes.guigouz@reddit
This is what it suggests here
It's worth noting that it checks the free ram/vram in the moment that you run it, so if you have other models loaded or processes using the gpu, it will affect the estimation.
Jester14@reddit
What do you mean it "doesn't fit"? Did you use the
-fitflag? UD-Q4_K_XL is larger than 16 GB so it will overflow to RAM but it will also "fit" if loaded appropriately. I get 30t/s on my 4060 8 GB using-fitwith that quant with 40k context in VRAM.ApprehensiveAd3629@reddit
Is possible to do it with lm studio?
guigouz@reddit
Did you try any other quants? I'm running Q6 here @ ~30t/s with 128k context (q4 k/v cache), using the cmdline generated by https://github.com/raketenkater/llm-server (llm-server --dry-run)
I started at Q8, now I'm testing Q6 which is a bit faster with similar quality. I wonder how low I can go.
Btw: I also tested qwen3.5 9b Q8 (almost same speed of 35b) and gemma4 26b (slower and in my coding tests, dumber)
tomByrer@reddit
Nice info!
Most of the tests I've seen, the speed/RAM/accuracy loss curve is ideal Q4-Q6. Exactly where is best-best depends on the model, who is making the quant, & your use-case. Also I'm guessing you could see some benefit if you fine-tune below Q6, since then the quant will keep more of what you want.
SirApprehensive7573@reddit
Tu aqui tbm?
Te vejo tudo quanto é canto
guigouz@reddit
Tudo quanto é canto de dev e IA :)
Nutty_Praline404@reddit (OP)
I am looking at others, but wanted to start here as it seemed the edge of feasibility. Thanks for pointing out that tool, but I could not get llm-server working right in windows under wsl.
tomByrer@reddit
Tried to use with long sessions? I'm wondering if you have enough room for a larger context, or are only able to run short bursts &/or constant compacting/resetting context.
Elegant_Tech@reddit
I use Q8 and it starts to struggle with file edit at around 90k context. It has to take multiple attempts at writing the edit to get it write. At least it catches and fixes it before finishing the prompt.
tomByrer@reddit
* writing the edit to get it right ;)
Thanks good to know your experience, helps me to decide form my 24Gb VRAM.
Nutty_Praline404@reddit (OP)
It is working with long sessions. Does drop from peak, and as context grows it does slow a bit, for example still at 37 tok/s after coding session that nearly fills context.
tomByrer@reddit
Thanks (to both of you)!
I'm setting up my RTX3090, so I'm sure I can fit all/most of the model & context in 24GB VRAM?
BTW, I'm not using that GPU for for anything but AI (monitor is on 2nd GPU or iGPU); so I have full VRAM available.
So Qwen3.5-35B is better than Qwen3-coder for coding?
PaceZealousideal6091@reddit
Hey.. thanks for sharing this. Quick question. Whats the '-kv_unified' flag exactly for? How does it work ?
Nutty_Praline404@reddit (OP)
If -np (parallel) is set to auto, it can change kv_unified to false which splits the context size across parallel units giving an effective smaller context (i.e. -np 2 can result in 32k context in each to give 64k total - not really what you want). At least that is what I understood, but am no expert.
PaceZealousideal6091@reddit
So, why not add -np 1 instead of -kvu?
dreamai87@reddit
You are correct. If np is given then default is 4 with each having similar context length 64k
MrTechnoScotty@reddit
I have found gemma 4 disappointingly slower than Qwen3.5 but havent worked as hard at optimizing yet
SmartCustard9944@reddit
A4B vs A3B makes some difference. Also the different KV cache.
Serious-Log7550@reddit
llama-server \-ncmoe 17 \--webui-mcp-proxy \--alias "Qwen 3.5 35B A3B" \-hf unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q4_K_XL \--no-mmproj \--cache-ram 134217728 --ctx-size 131072 --kv-unified --cache-type-k q8_0 --cache-type-v q8_0 \--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 \--presence-penalty 0.0 --repeat-penalty 1.0 \--flash-attn on --fit on \--no-mmap \--jinja \--threads -1 \--reasoning on \--reasoning-budget 4096 \--reasoning-budget-message "...Considering the limited time by the user, I have to give the solution based on the thinking directly now."Gives me stable 35-40t/s regardless off used context percentage.
vk3r@reddit
How?
22GB in 16GB ?
Mashic@reddit
Spill to system RAM.
Nutty_Praline404@reddit (OP)
It’s not “22 GB in 16 GB VRAM.”
It fits because llama.cpp is using a GPU + system RAM split, not pure-VRAM loading.
For this model, the working setup was roughly:
So the effective footprint is something like:
That’s why it can run on a 16 GB card even though the total model + runtime footprint is larger than 16 GB.
The tradeoff is:
qubridInc@reddit
This proves you don’t need expensive GPUs just tuned configs; someone should turn this into a shared “GPU config zoo” instead of everyone reinventing the same setup.
ducksoup_18@reddit
Your unsloth link goes to the 9b model. Was a but confused for a sec.
Nutty_Praline404@reddit (OP)
thanks, fixed it.