RTX 5070 Ti + 9800X3D running Qwen3.6-35B-A3B at 79 t/s with 128K context, the --n-cpu-moe flag is the most important part.
Posted by marlang@reddit | LocalLLaMA | View on Reddit | 148 comments
Spent an evening dialing in Qwen3.6-35B-A3B on consumer hardware. Fun side note: I had Claude Opus 4.7 (just the $20 sub) build the config, launch the servers in the background, run the benchmarks, read the VRAM splits from the llama.cpp logs, and iterate on the tuning — basically did the whole thing autonomously. I just told it what hardware I have and what I wanted to run.
Sharing because the common --cpu-moe advice is leaving 54% of your speed on the table on 16GB GPUs.
Hardware
- GPU: RTX 5070 Ti (16GB GDDR7, Blackwell)
- CPU: Ryzen 9800X3D (96MB L3 V-Cache)
- RAM: 32GB DDR5
- Stack: llama.cpp b8829 (CUDA 13.1, Windows x64)
- Model:
unsloth/Qwen3.6-35B-A3B-GGUF—UD-Q4_K_M(22.1 GB)
The finding — --cpu-moe vs --n-cpu-moe N
Everyone’s using --cpu-moe which pushes ALL MoE experts to CPU. On a 16GB GPU with a 22GB MoE model that means only \~1.9 GB of your VRAM gets used — the other \~12 GB sits idle.
--n-cpu-moe N keeps experts of the first N layers on CPU and puts the rest on GPU. With N=20 on a 40-layer model, the split uses VRAM properly.
Benchmarks (300-token generation, Q4_K_M)
| Config | Gen t/s | Prompt t/s | VRAM used |
|---|---|---|---|
--cpu-moe (baseline) |
51.2 | 87.9 | 3.5 GB |
--n-cpu-moe 20 |
78.7 | 100.6 | 12.7 GB |
--n-cpu-moe 20 + -np 1 + 128K ctx |
79.3 | 135.8 | 13.2 GB |
+54% generation speed, +54% prompt speed vs. naive --cpu-moe. Jumping to 128K context is essentially free thanks to -np 1 dropping recurrent-state memory.
Startup command that works
llama-server.exe ^
-m "Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" ^
--n-cpu-moe 20 ^
-ngl 99 ^
-np 1 ^
-fa on ^
-ctk q8_0 -ctv q8_0 ^
-c 131072 ^
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 ^
--presence-penalty 0.0 --repeat-penalty 1.0 ^
--reasoning-budget -1 ^
--host 0.0.0.0 --port 8080
That’s Unsloth’s “Precise Coding” sampling preset. For general use: --temp 1.0 --presence-penalty 1.5.
Gotchas I hit (well, that Opus hit and fixed)
-npdefaults to auto=4 slots. Wastes memory on recurrent state (\~190 MB). Set-np 1for single-user setups (OpenCode etc.).--fit-targetdoesn’t help here —-ngl 99+--n-cpu-moe Nalready gives you deterministic control.-ctk q8_0 -ctv q8_0is nearly lossless and halves your KV cache vs fp16. 128K ctx only costs 1.36 GB VRAM.- Qwen3.6 is a hybrid architecture — only 10 layers are standard attention, the other 40 are Gated Delta Net (recurrent). That’s why KV memory is so small.
How to tune N for your GPU
Each MoE layer on GPU costs \~530 MB VRAM. Non-MoE weights are \~1.9 GB fixed. For a 40-layer model:
| GPU VRAM | Recommended N |
|---|---|
| 8 GB | stay with --cpu-moe |
| 12 GB | N=26 |
| 16 GB | N=20 (sweet spot) |
| 24 GB | N=8 (fits almost everything) |
Start conservative, watch VRAM during a long-context generation, then step N down by 2-3 until you have \~2 GB headroom.
TL;DR
Replace --cpu-moe with --n-cpu-moe 20, add -np 1, and you get 79 t/s + 128K context on a 5070 Ti. The 9800X3D’s V-Cache carries the CPU side effortlessly.
And Claude Opus 4.7 on the $20 Pro sub is genuinely good enough now to run this kind of hardware-tuning loop end-to-end — launch servers in background, parse logs, iterate — without hand-holding. Kind of wild.
Happy to test other configs if anyone wants comparisons.
MykeGuty@reddit
Tengo lo mismo que tu configuración y esto va genial!!, Gracias a ti y a la comunidad :D
italianguy83@reddit
Funziona anche per RTX 5070 12Gb?
Historical_Roll_2974@reddit
I'm getting 30 tokens a second with an rx 9070xt but I'm also using lm studio so I can't get all the customisations
googleaddreddit@reddit
with rx 9070xt I get 42 t/s using vulkan.
Affectionate-Mode766@reddit
56-58 t/s with rx 7800xt using Vulkan
googleaddreddit@reddit
huh
Affectionate-Mode766@reddit
Please share
googleaddreddit@reddit
I did some more testing: https://i.redd.it/aqfjk65dxaxg1.png
googleaddreddit@reddit
I just built with GGML_NATIVE, which is the default anyway, but I had disabled it because of breakage some time ago.
dreamai87@reddit
It’s okay you are exploring all possible stuff But simple command -fit on will get the best from your configuration .
marlang@reddit (OP)
Solid tip, I actually went back and tested this properly after your comment. You’re right,
--fit onarrives at thesame MoE split I calculated manually (20 layers overflowing, 20 on GPU). One command vs hardware math, so yeah, clearly
the better advice.
Full numbers for anyone reading:
--n-cpu-moe 20(my manual tune)--fit on(bare)--fit on -c 131072One caveat people should know: bare
--fit onsilently reduces your context to 4K because it treats-cas anunset argument and minimizes it for max speed. If you want full context (coding/agentic use), you still have to set
-cexplicitly — then fit only decides the offload split.
So the final recommendation for a 16GB GPU is basically:
Thanks for pushing back — updated my scripts.
SummarizedAnu@reddit
Do ya know why no mmap doesn't work for me. It's fine for like 24k context but after that my PC lags to 1 fps and I have to manually restart using button . I'm on arch Linux with rtx 3060 and 16Gb ram . Thanks 🙏
marlang@reddit (OP)
Here’s what’s happening: the model is 22 GB total. On your 3060 (12 GB) you can fit maybe 14 MoE layers on GPU, the other 26 MoE layers
stay on CPU = \~14 GB of model in RAM before you even open a context. Then KV cache + compute buffers grow with context size. At 24K
ctx you fit in 16 GB. Above that, you spill past 16 GB.
With mmap (default), Linux handles this fine, it just evicts cold model pages from the page cache when pressure hits. With --no-mmap,
every page is pinned in llama.cpp’s heap, so the kernel can only swap it. And because swap on a running model = constant thrashing, the
rest of your system (X, browser, everything) gets evicted first → lockup, needs power button.
Fixes, in order of ease:
SummarizedAnu@reddit
im using the iq3 s model which is 13 GB . and context of 60k fills about 1Gb vram max . That was a no mmap problem where it broke something in arch i dont know what .
But now im using this .
Its faster than before .
/llama-server -m ../Qwen3.6-35B-A3B-UD-IQ3_S.gguf -ctk turbo4 -ctv turbo4 --jinja --flash-attn on -np 1 -ngl 30 --fit-ctx 65536 -ncmoe 20 --alias Qwen3.6-35B --fit on --fit-target 512 -b 1024 -ub 512 --temp 0.7 --top-p 0.95 --top-k 20 --min-p 0.0
But i dont know whats using all that swap ram .
Wait i found the command i was using causing problems
its : ./llama-server -m ../Qwen3.6-35B-A3B-UD-IQ3_S.gguf -ctk turbo4 -ctv turbo4 --jinja -ngl 30 -c 65536 -ncmoe
20 --no-mmap
but with mmap
./llama-server -m ../Qwen3.6-35B-A3B-UD-IQ3_S.gguf -ctk turbo4 -ctv turbo4 --jinja -ngl 30 -c 65536 -ncmoe
20
i had no problem running this but it was getting around 20-30 tps . Not good but not bad speed at that time..
Illustrious-Bid-2598@reddit
I value is sensitive to turbo. I would keep that regular ctk q8 with your ctv turbo 4
SummarizedAnu@reddit
Is that for speed?
Illustrious-Bid-2598@reddit
For quality. The speed gain from 8 - 4 is fractional that it’s worth keeping k at 8 as the quality gain is significant especially if you’re going to want to rely on tool calling.
ecompanda@reddit
the heuristic gets you 90% there for most single user inference. where it gets tricky is batched requests or when you want to bias toward more layers on gpu at lower context. manual gives you that control but for the typical case fit works well
Rangali-1@reddit
Zunächst erst mal vielen Dank für die tolle Arbeit!
mit der letzten finalen Version erhlate ich mit meinem 5070 ti und AMD Ryzen 9 9950X3D mit 64GB RAM allerdings nur knapp 69 Token/sekunde.
Ws mache ich falsch?
Auo98@reddit
how to do this in unsloth chat
relmny@reddit
Have you tried something like (relevant part being the "-ot ..."):
with that I get about 22.4t/s with Q6_K_L, while with (tested out of curiosity):
--fit on -c 131072 -np 1 -fa on -ctk q8_0 -ctv q8_0
I get about 7.1t/s
and with:
--fit on --fit-ctx 128000 --fit-target 512
I get 2.6t/s
Fristender@reddit
Thanks for posting the results! I would say the 2t/s tg lost with manual tuning is worth the 34t/s gain with pp
Life-Screen-9923@reddit
I use 'fit-target 256' to emulate 'ngl 99' on my rtx 3060 12gb
And add 'mlock' and 'no-mmap' for performance
dreamai87@reddit
thanks it helps you.
my comment was mainly to help on allocation of better split of model between gpu and cpu.
-c you have to provide otherwise it takes the default.
Danmoreng@reddit
No you need to use fit together with fit-ctx and fit-target: https://github.com/Danmoreng/local-qwen3-coder-env#server-optimization-details
np 1 sounds interesting though, haven’t tried that one out and might improve my config
st0n1th@reddit
Yeah I tried with 20 on cpu and it left 2GB free on my 5080 gpu and noticed more cpu than gpu utilization. Switched back to —fit.
iamapizza@reddit
Thank you for testing with fit. Even though it's doing a lot of the heavy lifting you did at least validate and learn stuff along the way (and teach me something too)
IrisColt@reddit
heh
mrgreatheart@reddit
Thank you, super helpful
Ferilox@reddit
Except when the model has a vision component. Then it kind of struggles.
abu_shawarib@reddit
Command line says it's already on by default.
lolwutdo@reddit
lol no, -fit gives absolute trash performance if you want to specify and use max context with 16gb gpu.
Using lmstudio and maxing out gpu and moe layers with max context gives better performance than whatever the fuck -fit does in llama.cpp; quite literally a jump from 8 tokens per second on lcpp to 70+ tokens per second on lmstudio.
Danmoreng@reddit
Use fit with fit-ctx and fit-target: https://github.com/Danmoreng/local-qwen3-coder-env#server-optimization-details
lolwutdo@reddit
I appreciate it, but thats definitely not just just one command like the original comment implies.
SlipperyCorruptor@reddit
I'm doing 210k ctx window, 30 MoE offload. 7600x, 32GB RAM, 5080. Getting:
prompt eval time = 43716.75 ms / 20087 tokens ( 2.18 ms per token, 459.48 tokens per second)
eval time = 22200.84 ms / 907 tokens ( 24.48 ms per token, 40.85 tokens per second)
total time = 65917.60 ms / 20994 tokens
Output: 40.85 Tokens/sec
Honestly.. That thing is impressive AF.
I've put it through some tests and it even correctly identified environmental issue with testcontainers and Rancher:
https://github.com/testcontainers/testcontainers-java/issues/11482
prob saved me like two days of investigation
Illustrious-Bid-2598@reddit
What do you guys think on a a rtx 5060, i7-14700f 32gb ram, llama.cpp on wsl2 with models stored on Linux fs, windows debloated
Kiro369@reddit
I tried running Qwen3.6-35B-A3B-UD-IQ4_NL_XL was getting 9 t/s
With your last command it went up to 50 t/s!
Thanks a lot!
sherrytelli@reddit
using unlsoth/Qwen3.6-35B-A3B:Q4_K_M
with your final config i am able to get around stable 42-47 tg/s
my pc specs:
RTX 5060ti 16gb
i5-12400f
32gb ddr4 @ 3200 MHz
i model i was previously using: unsloth/glm-4.7-flash-23-23B-A3B:Q4_K_M. used to get around 32-35 tg/s
thanks for the config :)
nasty84@reddit
How do we change some of the settings in LM Studio. I am only getting 37 tokens per second with same kinda of hardware
nasty84@reddit
Here are LM Studio settings. Change them as you like based on your rig:
Load Qwen3.6-35B-A3B-UD-Q4_K_M from unsloth, then in the load modal → Advanced Configuration:
- Context Length: 131072 (or 65536 if it complains)
- GPU Offload: max (all the way right)
- CPU Thread Pool Size: 8 (matches your 7800X3D's 8 cores)
- Flash Attention: ON
- K Cache Quantization Type: Q8_0
- V Cache Quantization Type: Q8_0
- Offload MoE Experts to CPU: 20 ← the key setting
- Try mmap(): ON
- Keep Model in Memory: OFF
texifornian@reddit
Took some work - but on a similar setup-ish...
The Hardware:
Memory and CPU changes weren't letting me run the Q4, but going to Q3 got me to 70 t/s -
.\llama-server.exe \^
-hf unsloth/Qwen3.6-35B-A3B-GGUF \^
-m Qwen3.6-35B-A3B-UD-Q3_K_M.gguf \^
--device CUDA0 \^
--n-cpu-moe 15 \^
--mmproj "" \^
-t 8 \^
-fa on \^
-b 2048 \^
-ub 2048 \^
-ctk q8_0 \^
-ctv q8_0 \^
-c 128000 \^
--temp 0.6 \^
--chat-template-kwargs "{\"preserve_thinking\": true}" \^
--port 8033
raswill0@reddit
Awesome post!
Sharing my results (RTX 5060 Ti 16GB - 32GB DDR5 RAM - AMD Ryzen 7 8700F):
JustSayin_thatuknow@reddit
Amazing job!! I’ll be waiting for your final final final final final command boss! 😅🙏🏻💪💪💪💪
OldPappy_@reddit
Thanks for this. Im going to try some of these configurations out on my 9070XT
admajic@reddit
On a 3090 24gb vram get 110 token /s and i can afford a car too. Interesting world we live in.
smolpotat0_x@reddit
we are not a car.
admajic@reddit
The car = cost of a video card... geez
SinnersDE@reddit
Thanks for your hard work!
I share my results just if sb cares ( RTX 4080 16 GB, 32 GB DDR4)
.\llamacpp\llama-server.exe -m "./models/Qwen-3.6-35B-A3B-Q4_K_XL/Qwen-3.6-35B-A3B-Q4_K_XL.gguf" --fit on --fit-ctx 128000 --fit-target 256 -np 1 -fa on --no-mmap --mlock -b 2048 -ub 2048 -ctk q8_0 -ctv q8_0 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 --reasoning-budget -1 --chat-template-kwargs "{\"preserve_thinking\": true}" --host 0.0.0.0 --port 8033
Getting: 58 t/s low to 43 t/s after CTX-Windows filled upto 60-70%. Didn´t get further.
SinnersDE@reddit
Just a short question:
How do i call the -no-mmap and -mlock flags in a preset.ini?
[ini_THINK_GENERAL_Qwen-3.6-35B-A3B-Q4_K_XL]
m = ./models/Qwen-3.6-35B-A3B-Q4_K_XL/Qwen-3.6-35B-A3B-Q4_K_XL.gguf
temp = 1.0
top-p = 0.95
top-k = 20
min-p = 0.0
presence-penalty = 1.5
repeat-penalty = 1.0
fit = on
; whether to adjust unset arguments to fit in device memory ('on' or 'off', default: 'on')
fitc = 128000
; minimum ctx size that can be set by --fit option, default: 4096
fitt = 256
; target margin per device for --fit, comma-separated list of values, single value is broadcast across all devices, default: 1024
np = 1
; number of server slots (default: -1, -1 = auto)
fa = on
; set Flash Attention use ('on', 'off', or 'auto', default: 'auto')
b = 2048
; logical maximum batch size (default: 2048)
ub = 2048
; physical maximum batch size (default: 512)
ctk = q8_0
; KV cache data type for K (allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1, default: f16)
ctv = q8_0
; KV cache data type for V (allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1, default: f16)
-no-mmap ???
; --mmap, --no-mmap whether to memory-map model. (if mmap disabled, slower load but may reduce pageouts if not using mlock) (default: enabled)
-mlock ???
; force system to keep model in RAM rather than swapping or compressing
chat-template-kwargs = {"preserve_thinking":true}
reasoning-budget = -1
milpster@reddit
how do you deal with having such low context?
Emergency-Most1859@reddit
Bro 🔥🔥 Running this model with qwen code and it works better and kinda smarter than alibaba's cloud qwen that I've used before. They discontinued free tier so I started to look for alternatives.. Really impressed with that nodel quality. Works fine on RX6800 with 7900x3d (changed some flags though)
Dreeseaw@reddit
To add a datapoint, my recently-purchased prebuilt gaming PC (iBP Y40 Pro with a 5080 (16gb vram), 32gb ram, 9800) is executing fat 100k context prompts on the order of 45s, and breezing through opencode driven workflows (largely replacing the analysis portion of an optimization loop I work with).
OP this is black magic. Thank you.
Mister_bruhmoment@reddit
Hey, I basically have the last version of your rig besides the RAM - 4070 ti super, R7 7800X3D. Are those settings applicable in lm studio? I am still figuring out how everything works with LLMs atm
marlang@reddit (OP)
LM Studio settings for your rig
Load Qwen3.6-35B-A3B-UD-Q4_K_M from unsloth, then in the load modal → Advanced Configuration:
- Context Length: 131072 (or 65536 if it complains)
- GPU Offload: max (all the way right)
- CPU Thread Pool Size: 8 (matches your 7800X3D's 8 cores)
- Flash Attention: ON
- K Cache Quantization Type: Q8_0
- V Cache Quantization Type: Q8_0
- Offload MoE Experts to CPU: 20 ← the key setting
- Try mmap(): ON
- Keep Model in Memory: OFF
x10der_by@reddit
Wow it's magic. With "GPU Offload" to max and "Offload MoE Experts to CPU" to 20 speed increased from 15 to 50 t/s on my config O_o
moahmo88@reddit
Try this @ 59 t/s with 5070ti:
LM Studio settings for you:
Load Qwen3.6-35B-A3B-UD-Q5_K_M from unsloth,You can use Q5_K_M:
- GPU Offload: max (all the way right)
- Offload MoE Experts to CPU: 24 ← the key setting
nixudos@reddit
Thank you for the tip! I was struggling to get any meaningful speed on a 4090 with the Q6_K_XL size. This doubled my speed from 18 t/s to 42!
Comfortable_Dog1610@reddit
I have the same speed on 7900xtx using the Q6_KXL model. I can sleep tonight if I see you
BubrivKo@reddit
Thanks bro. On my AMD configuration I get like additional 15-20 tks :P
The_Dung_Beetle@reddit
Thanks! I get about 25 tok/sec. Rig I've tested this on : 6950XT/5700X3D/32GBDDR4@3200.
monacoax@reddit
What settings do you think for a 4090 + 12700k? Thanks for the info!
marlang@reddit (OP)
If you want the full 128K ctx: with 24 GB VRAM the KV cache at Q8_0 eats \~1.4 GB + compute buffers \~0.6 GB + non-MoE weights \~1.9 GB,
leaving \~19 GB for MoE experts. Each expert layer costs \~530 MB on GPU, so \~36 of the 40 layers fit.
and leave cpu on 8, i think its best for the 12700k
- Context Length: 131072 (or 65536 if it complains)
- GPU Offload: max (all the way right)
- CPU Thread Pool Size: 8
- Flash Attention: ON
- K Cache Quantization Type: Q8_0
- V Cache Quantization Type: Q8_0
- Offload MoE Experts to CPU: 36 ← the key setting
- Try mmap(): ON
- Keep Model in Memory: OFF
yoohjm@reddit
lmstudio user with a RTX 5070ti here. This is amazing, such a speed up from my previous config and that context length is much more than i thought i would be able to fit
many thanks!
Mister_bruhmoment@reddit
Thank you so much!
Embarrassed_Elk_4733@reddit
Yes, absolutely. My setup is a Ryzen 7 5800X3D + RTX 4070 Ti Super + 32GB DDR4. Running a 128K context window in LM Studio gives me around 39–40 tokens/s. However, when I used the Llama configuration provided by the original poster, the same hardware achieved 45–46 tokens/s in Llama. Sharing this for your reference!
Guilty_Rooster_6708@reddit
This is literally perfect for me. Thanks for the tip on mlock and ub !!
TodayExcellent9756@reddit
Hey, thanks a lot for this!
I’m able to achieve 55 tok/s using your config on these specs: RTX 5070 Ti (16GB) with 32GB DDR4 and an Intel Core i5-14600KF. The coding results are amazing too. Now I won’t get stuck when my Codex runs out of tokens. 🤣
BitGreen1270@reddit
Thank you for sharing. I only have a 780M which I'm using with vulkan and gemma:e4b. I assume most of what you've shared is not applicable to me because I don't have vram?
CriticalCup6207@reddit
The --n-cpu-moe flag is doing serious work here. For anyone who hasn't seen it: it offloads the MoE routing to CPU, which frees VRAM for the active expert weights and meaningfully improves throughput on cards that would otherwise bottleneck. On our setup (3090 + i9) we saw \~40% throughput improvement. The 9800X3D's cache size probably also helps with the routing overhead on the CPU side.
Ok-Palpitation-905@reddit
Nice.
moahmo88@reddit
Amazing work! Thanks a million for sharing.
rebelSun25@reddit
Nicely done
Cool-Cap2509@reddit
I just tried it. Getting 24 t/s in processing. What am I doing wrong? I got the same model, 9950X3D + 64GB RAM + 4080 Super. Can you please suggest any solution?
marlang@reddit (OP)
- CUDA0 model buffer size = XXXX MiB — if this is 0 or tiny, nothing's on GPU.
On 64 GB you can safely drop --no-mmap --mlock — you don't need them.
-t 8 --cpu-mask 0xFF
llama-bench.exe -m "Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" -fitt 256 -fitc 65536 -fa 1 -ctk q8_0 -ctv q8_0 -b 2048 -ub 2048 -p 2048 -n 128 -r 3
Should give 3000+ pp2048 and \~100 tg128.
Cool-Cap2509@reddit
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: projected to use 26614 MiB of device memory vs. 14997 MiB of free device memory
llama_params_fit_impl: cannot meet free memory target of 256 MiB, need to reduce device memory by 11873 MiB
llama_params_fit_impl: context size reduced from 262144 to 128000 -> need 2668 MiB less memory in total
llama_params_fit_impl: with only dense weights in device memory there is a total surplus of 9456 MiB
llama_params_fit_impl: filling dense-only layers back-to-front:
llama_params_fit_impl: - CUDA0 (NVIDIA GeForce RTX 4080 SUPER): 41 layers, 5748 MiB used, 9248 MiB free
llama_params_fit_impl: converting dense-only layers to full layers and filling them front-to-back with overflow to next device/system memory:
llama_params_fit_impl: - CUDA0 (NVIDIA GeForce RTX 4080 SUPER): 41 layers (21 overflowing), 14720 MiB used, 276 MiB free
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 0.34 seconds
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4080 SUPER) (0000:01:00.0) - 15061 MiB free
llama_model_loader: loaded meta data with 54 key-value pairs and 733 tensors from q35.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
marlang@reddit (OP)
Your setup is actually fine, 21 layers overflow, GPU is being used correctly. This is netting you 24t/s?
Cool-Cap2509@reddit
BTW, am I getting half the token because it is one generation older?
Cool-Cap2509@reddit
I didn't use code. I gave it a New Yorker magazine and asked x article to summarize. And the processing was 24 t/s. Tried different variations with Gemini help, was able to get 1300 t/s for the 50% context. Then slowly it gets down to less than 100 t/s. But hey as you said I don't need to use those two lines memory related. That solved the 90% memory usage issue. Thank you for sharing, otherwise I would never download a 35B model. The highest I tried was Gemma 4 26B Q4 and that was already slow enough and spilled the VRAM usage.
Cool-Cap2509@reddit
I believe I am using the same model.
.\llama-bench.exe -m q35.gguf -ngl 99 --n-cpu-moe 20 -fa 1 -ctk q4_0 -ctv q4_0 -b 2048 -ub 2048 -p 2048 -n 128 -r 3
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 16375 MiB):
Device 0: NVIDIA GeForce RTX 4080 SUPER, compute capability 8.9, VMM: yes, VRAM: 16375 MiB
load_backend: loaded CUDA backend from C:\ai\llama\ggml-cuda.dll
load_backend: loaded RPC backend from C:\ai\llama\ggml-rpc.dll
load_backend: loaded CPU backend from C:\ai\llama\ggml-cpu-zen4.dll
| model | size | params | backend | ngl | n_cpu_moe | n_ubatch | type_k | type_v | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | -------: | -----: | -----: | -: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium | 20.60 GiB | 34.66 B | CUDA | 99 | 20 | 2048 | q4_0 | q4_0 | 1 | pp2048 | 2456.42 ± 139.48 |
| qwen35moe 35B.A3B Q4_K - Medium | 20.60 GiB | 34.66 B | CUDA | 99 | 20 | 2048 | q4_0 | q4_0 | 1 | tg128 | 77.47 ± 1.05 |
build: 23b8cc499 (8838)
BuildDevv@reddit
As a new player for local llm’s, scrolling through the comments, this community is very supportive. Thanks for the tip y’all!
BP041@reddit
Wow, 79 t/s with a Qwen3.6-35B-A3B on consumer hardware is fantastic! This is exactly the kind of optimization that pushes local LLM development forward. At NTU, and in my work with CanMarket's AI infrastructure, we're always looking for ways to maximize inference speed and context handling on diverse hardware setups. Could you elaborate on how much
--n-cpu-moeimpacted performance for you, or if you encountered any specific bottlenecks you had to tune around?jadbox@reddit
This is amazing... but why can't Llama do this all automatically for us?
nextgenpotato@reddit
I have the exact same hardware as you do. Trying to run your final final command, I am getting OOM errors. What am I missing? I am on Ubuntu 26.04 and a noob when it comes to llama.cpp
Potential-Leg-639@reddit
A recent Fedora Kernel upgrade made my system unstable and als had strange issues in llama.cpp. With the Kernel version from before everything fine again.
FatheredPuma81@reddit
That certainly does sound like Opus with its training data being from pretty much around the time they switched from specifying the tensors to the much better --n-cpu-moe command.
fucking_cuntbag@reddit
Thanks for this - I have the same setup and was struggling getting a reasonable tps Had switched to lower quants but with this config I can get 80tps on iq4
bdsmmaster007@reddit
Ive not fumbled around with local hosting in quite a while, but qwen intrigues me. Tho im on AMD and not sure how its looking with the support. Can anybody estimate me how much i could get on a 7600x and a rx6800?
New_Spray_7886@reddit
I get 25 t/s with a 6700xt +32gb ram when setting aside vram for full context (prefill is ~300 t/s), so you should be quite workably higher than that. This qwen-3.6 moe is quite a bit more performant than even gemma-4 on these older consumer setups
deRTIST@reddit
i'm at 12 t/s with a 6800xt and 64 of ram, would you min sharing settings? Anything different than op?
New_Spray_7886@reddit
Here are two settings.
I use this when I want more ram available to use the computer at the same time (i.e. web-browsing), it is like OPs. Qwen3.6-35B-A3B-IQ4_XS (Bartowski) is 24-25 t/s here @ no context, 22 t/s @ 20% context (50k or so). Q4_K_M is a little slower at 20.5 t/s @ 20% context. I have many quants left to try but I like IQ4_XS so far.
export HSA_OVERRIDE_GFX_VERSION=10.3.0 llama-server \ --model /path/to/Qwen_Qwen3.6-35B-A3B-IQ4_XS.gguf \ --jinja \ -c 0 \ -ngl 99 \ --no-mmap \ --cpu-moe \ --n-cpu-moe 186 \ --min-p 0.0 \ --top-p 0.95 \ --top-k 20 \ --temp 0.6 \ --parallel 1 \ --chat-template-kwargs '{"preserve_thinking":true}' \ --host 127.0.0.1 \ --port 8033 2>&1 | tee /path/to/log.txt
I use this if I'm not also using the computer - maybe this will be agents running overnight soon. Llama.cpp maximizes the performance by doing the fitting for you, so it is easier than testing how many layers you can offload. \~27.4 tok/s at no context as above.
export HSA_OVERRIDE_GFX_VERSION=10.3.0 llama-server \ --model /path/to/Qwen_Qwen3.6-35B-A3B-IQ4_XS.gguf \ --jinja \ -c 0 \ --no-mmap \ --fit on \ --min-p 0.0 \ --top-p 0.95 \ --top-k 20 \ --temp 0.6 \ --parallel 1 \ --chat-template-kwargs '{"preserve_thinking":true}' \ --host 127.0.0.1 \ --port 8033 2>&1 | tee /path/to/log.txt
The logging is nice as if it runs slower than you want you can just ask the LLM to calculate how many --n-cpu-moe layers you should offload by uploading that file & your server start-up command. I tested smaller context sizes and the speeds were very minimally different on my setup so I'm keeping the max currently.
deRTIST@reddit
oh you're on xs, that makes sense, i'll try a lower quant then, right now i'm at 16tk/s after a couple of hours of hammering at it (apparently for my use case, op was right. ncpumoe was actually the way
thanks for the tips, i'll try it tomorrow
zkstx@reddit
I'm getting 60-70 tps TG / 1300 tps PP, with up to 55k context window (100k+ @ Q8 KV) for Qwen3.5 35B IQ3_XXS on my RX6800 XT running llama.cpp compiled for rocm. It handles most things I care about pretty well. Will switch to 3.6 very soon
deRTIST@reddit
yeah i think i might need to get a smaller quant honesly, q4m at the speeds i'm talking about is way too sluggish. quality is good but it takes 1h to fully execute a (by the end of it) 65k tk task
bdsmmaster007@reddit
thanks for the specific numbers :0, motivates me to fumble around
DefNattyBoii@reddit
I think the problem you will face is more with prompt processing speed. I'd recommend checking out the latest vulkan build for lama.cpp, thats the easiest to get started with. above 10 token/s is usable but not perfect, it depends on length of your context.
bdsmmaster007@reddit
thx for the pointers to start, will look into it\^\^!
SeriousPanic34@reddit
I'm getting 40 tk/s with 7900 gre 16gb + 32gb ram on core ultra 7 265k with OPs config
bdsmmaster007@reddit
oh damn giving me hope :0, thx for the reply
Artistic_Okra7288@reddit
I’ve been getting about 30tps tg at 1M context on my M5 Max 128GB with q4_0 kv and using unsloth Q8_K_XL gguf.
Nnyan@reddit
Thank you for this I’m just starting my local LLM project and have a few GPU options similar to yours.
Horror-Veterinarian4@reddit
16gb vram nice I know what my next move is i want to test see how gemma 4 26b e123abc whatever the fuck it us runs compared to this one on my ancient e5 2697 v2 and v100 16gb vram
nikolaiownz@reddit
Almost the same setup I have. I get 72ish tk/s
Thanks for this good thread. I am going to mess around with it next week. From what I saw just tinkering around with it and opencode - this is very good.
MysticOrbit7@reddit
I got 59 tok/s on (Edit 3 conf) 5060 Ti 16 GB + 9950x + 128 GB DDR6 . Anyone touched better on this chip ?
Late_Session7298@reddit
I’m using oMLX on m2 pro max 32 Gigs at 128K context with 35t/s speed
The most simplest setup ever!
konohrik@reddit
Why not use exl2 instead of gguf?
marlang@reddit (OP)
because 22gb model > 16gb vram
quick maths
omidmatin@reddit
Can you guys help me with the best config for my RTX 3090 + 5800X3D? I need large context window (at least 512K and preferably 1M). I think it's possible with this MoE model on TurboQuant?!
Ill-Stand-6678@reddit
onde encontrou um projeto llama.cpp com turboquant funcional?
omidmatin@reddit
I asked Gemini and just copy pasted it. Currently I got it working with 256K context and turbo3. But I used Q4_UD_S version. It's running 100 t/s. I can run Q4_UD_XL with 192K context with around 95 t/s but I also have to offload the mmproj to the RAM.
BassAzayda@reddit
I use the 3 fits so --fit on --fit-ctx 128000 --fit-target 512 Moe and dense works a treat everytime
marlang@reddit (OP)
Thank you! my startup scripts get better and better with every comment in here
minceShowercap@reddit
Are you editing OP with the latest update to your startup script?
I've found a few comments confirming that someone has helped and you have updated, but it's hard to know which is latest without you updating OP.
Great thread though. Exactly what I need because I want to try this model later for local coding.
Have you found it to be strong at local coding?
ecompanda@reddit
yeah updated the OP twice now. dropped all the manual layer flags and just using fit on instead. numbers are basically the same but the config is way cleaner.
marlang@reddit (OP)
I've been comparing it mainly against Gemma 4 and, subjectively, Qwen3.6-35B-A3B is clearly better for coding, and for agentic coding it's miles ahead.
PaceZealousideal6091@reddit
If you have already set the context at 128k, why do you need -fit-context 128k? Any reason? Can you explain how these 3 together helps?
vialoh@reddit
Does the `--n-cpu-moe` matter for those of us on Apple silicon? I suppose I could just ask AI... 😅
pefman@reddit
Good findings!
Objective-Stranger99@reddit
Your recommendation for 8 GB is wrong. I found --cpu-moe to be up to 20% slower on my GTX 1080 compared to tuning --n-cpu-moe
andy2na@reddit
what is the benefit of N=8 on a 24gb VRAM GPU for Qwen3.6-35B? With q8/q8 cache, you can already fit 256k context with the IQ4_NL quant, and likely still close to that with the Q4_K_M fully on GPU
My llama-swap config:
KptEmreU@reddit
Commenting to save. Great experiment
inquam@reddit
I managed to just squeeze Q5_S with 260k context into my 5090 enierly in vram when using Q8_0 for KV cache.
I was on Qwen 3 Coder a long time and then Qwen Coder Next for a bit. And also a sting on 3.5. But 3.6 seems pretty solid so far.
ecompanda@reddit
79 t/s at 128K is genuinely impressive for a 35B model on consumer hardware. the interesting part is what happens as you actually fill that context. MoE attention at max context can be unpredictable and some models drop to 30 40 t/s by 100K tokens in. did you observe any speed drop as the conversation grew, or did it hold steady?
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
mr_Owner@reddit
You missed preserve thinking flag though, and play with ubatch size 4096 and drop lower. Ubatch impacts the pps and vram size.
altdotboy@reddit
I have spent the last week building my own harness. This has proven to be the most important test for my rig.
11118888888855 → 118885 | 79999775555 → 99755 | AAABBBYUDD → ? Solve the pattern, and put your final answer within \boxed{}.
My system would get it correct maybe 1 in 10 times. I had to tune my system settings and prompt to get it right at least 3 times in a row.
What this test exposes is the delicate sensitivity of MoE router gates. Simply put: are your prompts going to the correct experts?
Dense models have an easier time with the question. Give this a shot and see if your system gets it correct 3 times in a row with fresh context each time.
Quantization, incorrect settings, and poor system prompts will hurt your MoE model.
cesaqui89@reddit
Is it possible to apply those fine tunes for ollama?
mrgreatheart@reddit
Thank you. I have a very similar system to yours, and it’s great to know I can run 3.6 so well on it.
Does the —fit-ctx 128000 mean 128K context window in system RAM?
HockeyDadNinja@reddit
I'm running a 5060 ti 16G and 4060 ti 16G with 64G system ram here. A couple days ago I finally started tuning. I've added things from your post and now I'm running Qwen3.6-35B-A3B at Q8. 98k context, a small overflow to CPU.
I'm using opencode and it's doing really well. I can code with this! 27 t/s at the moment. That used 3090 is looking really good right now.
slippery@reddit
You just kicked off my next project. Thanks for the detailed write up!
I'm going to try to get it running on a 12 GB 4070Ti.
AcrobaticChain1846@reddit
hey, I'm trying to run the `unsloth/qwen3.6-35b-a3b UD Q2_K_XL`
I also have same hardware as yours
5070 Ti
64 GB RAM
9950x3d
Can you help me with setting things up?
https://www.reddit.com/r/LocalLLaMA/comments/1soqtry/comment/oguxrw1/
Also I'm getting really slow prompt processing...
I want to know which model you are using like is it q4_k_xl or something else?
Thank you :)
dreamai87@reddit
Okay I saw your comment that brought me here.
Looking at settings; move this slider to 50% then see the performance, bringing 100% will put all experts to CPU which also reduce performance but still better that what you are getting. so first check at 50% then look GPU suage from task manager 10/12 or 8/12.
reduce from 50% to 20% and see where you are getting this best gpu usage find balance 10/12 assuming rtx 5070 12gb vram
I replied on your thread as well
AcrobaticChain1846@reddit
That did the trick,
Now I'm running `qwen3.6-35b-a3b@q5_k_m` with following settings
Based on LM Studio Logs I'm getting 45-50tk/s and 300-500 pp/s
I think this can be further fine tuned as I can see my CPU usage is more compared to GPU but I'm not sure what parameters to play with
CPU utilization 62%
GPU Utilization 50% sometimes it spikes to 70-80%
dreamai87@reddit
First eject model then Bring slider to from 28 to 22 or somewhere around 20 see if gpu is 10gb or 12gb 90 % around leave 2 gb space for kv cache
AcrobaticChain1846@reddit
I have 5070ti so 16GB VRAM + 64GB RAM
MoE on CPU: 22 - my graphics driver froze.
MoE on CPU: 24 - Getting 5tk/s almost 10x performance decrease
MoE on CPU: 25 - Getting 50tk/s and \~80% GPU and \~50% CPU
MoE on CPU: 26 - Getting 16tk/s and \~90-100% GPU and \~40-50% CPU
Note: Evaluation Batch Size was set to 4096 in all scenarios.
I generally use 2048 but I was getting slow prompt processing speeds so I set it to 4096
Hope these stats are useful considering how a single MoE more or less on CPU is causing such drastic changes in token generation speeds.
PiotreksMusztarda@reddit
Confirming on Linux (Ubuntu 26.04, 5070 Ti, CUDA 12.4 with sm_89 PTX fallback), 76 tok/s with your --fit config, and heads up: if you load the vision mmproj, add --no-mmproj-offload or it OOMs right after model load.
kisiel02@reddit
I only get like 15t/s with rtx5070 (19 layers on GPU) and ddr4 ram sadly. And when compressing KV cache to q8 I get 25t/s. Seems to much of a boost, I have like 10/12gb VRAM and 26/32gb RAM taken
Several_Newspaper808@reddit
Hey, great info, thanks! I wonder though, how much of the perf is from the ddr5 ram and whatever bus speed you have from the pcie on your mb?
frozenYogurtLover2@reddit
anyone else getting crashes and segfaults (error 139) with prompt cache enabled
met_MY_verse@reddit
I’m running a smaller quant with less context entirely in VRAM, I’m assuming this is faster than offloading any experts at all?
AncientGrief@reddit
Nice work. Did some testing myself too now. 4090RTX with 131k context size. Used Open code to create a C# Snake-Clone with SFML 3.0 ... 75% context used (it had to actually look up the nuget specs for SFML 3.0 to fix some errors it produced automatically, it's a rather new release afaik) ... works pretty well and was about done in < 5 Minutes.
One shotted it easily.
\~159.9 tok/s
With:
And opencode.json:
TurnUpThe4D3D3D3@reddit
Solid numbers! The 4090's 24GB VRAM really flexes here — almost double my gen speed. Makes sense with --fit-target 1536 giving you way more headroom to pack MoE layers on GPU compared to my 256.
One thing I noticed: you're using --split-mode none + --main-gpu 0 with --gpu-layers auto. That's clean for single-GPU. But since the 4090 can likely fit nearly all MoE layers, you might want to check what --fit actually landed on for N (it should log it at startup). Curious if you're hitting the full GPU or still offloading some experts to CPU.
Also, one-shotting a Snake clone with SFML 3.0 lookups at 75% context is a great real-world test. The model handling live package spec retrieval and auto-correcting from it is impressive for a 35B MoE.
^(This comment was generated by GLM-5.1)
o0genesis0o@reddit
I used to do this test by hands with the previous 30B A3B model. Managed to bring tg from 20-ish to 40-ish on my 4060Ti with 64k max context by playing around with n-cpu-moe.
Ill_Initiative_4233@reddit
Попробовал данную связку на своём компе I7 14700KF ram 32 GPU 5070ti 3312 tokens 53s 61.76 t/
Ranmark@reddit
iirc you can drop your top_p, presence_penalty, and reasoning_budget arga as they by default has these values. https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md
FriendlyTitan@reddit
Have you tested higher batch and ubatch numbers? I notice that for myself, giving up more experts to cpu and giving vram to batch improves prefill speed massively. Set -b and -ub to 4096 or even higher if you want to experiment.
truthputer@reddit
In my testing context 256k (44 t/s) was slightly faster than context 128k (35 t/s). But my hardware is weird and heavily leans on the CPU with that context size.
Commenting here to remind me to try your config and will update this comment later.
Jackw78@reddit
The prefill speed is either inaccurate due to cold startup or something very wrong with the setup. Should be 1k minimum for a 5070ti
marlang@reddit (OP)
You were right. I went back with llama-bench.exe, the right tool instead of a short completion test, and got:
- pp512: 927 t/s
- pp2048: 1068 t/s
- tg128: 82 t/s