Qwen3.6 35B MoE on 8GB VRAM — working llama-server config + a max_tokens / thinking trap I ran into
Posted by Antonio_Sammarzano@reddit | LocalLLaMA | View on Reddit | 34 comments
Hi all,
I wanted to share a setup that’s working for me with Qwen3.6-35B-A3B on a laptop RTX 4060 (8GB VRAM) + 96GB RAM.
This is not an interactive chat setup. I’m using it as a coding subagent inside an agentic pipeline, so some of the choices below are specific to that use case.
Hardware / runtime
- GPU: RTX 4060 Laptop, 8GB VRAM
- RAM: 96GB DDR5
- Runtime: llama-server
- Model: Qwen3.6-35B-A3B GGUF
- Use case: coding subagent / structured pipeline work
Current server command
llama-server \
-m Qwen3.6-35B-A3B-Q4_K_M.gguf \
-ngl 99 \
--n-cpu-moe 99 \
-c 50000 \
-np 1 \
-fa on \
--cache-type-k q8_0 \
--cache-type-v turbo2 \
--no-mmap \
--mlock \
--ctx-checkpoints 1 \
--cache-ram 0 \
--jinja \
--reasoning on \
--reasoning-budget -1 \
-b 2048 \
-ub 2048
PowerShell env:
$env:LLAMA_SET_ROWS = "1"
$env:LLAMA_CHAT_TEMPLATE_KWARGS = '{"preserve_thinking":true}'
Notes on the non-obvious choices
--n-cpu-moe 99: on 8GB VRAM, I’m currently pushing MoE layers to CPU. This is partly based on my own constraints and partly on community tuning discussions, not on official guidance.-np 1: this is a single-user / single-agent setup, so I don’t want extra slots wasting RAM.-b 2048 -ub 2048: in my tests this gave noticeably better prefill on prompts above \~2K tokens than lower defaults.LLAMA_SET_ROWS=1: community tip, easy to try, seems worth keeping.preserve_thinking: true: I’m using this because Qwen3.6 explicitly supports it, and for agent workflows it helps keep prior reasoning in cache instead of re-deriving everything every turn.
Important distinction: official vs empirical
A few things here are officially documented for Qwen3.6:
enable_thinkingpreserve_thinking- thinking mode being on by default
- recommended sampling presets for coding / thinking / non-thinking use
Other parts of this config are just my current best empirical setup or community-derived tuning, especially around MoE placement, KV config, and batch / ubatch choices.
So I’m posting this as “working setup + observations”, not as a universal best config.
The trap I ran into: thinking can eat the whole output budget
What initially looked like a weird bug turned out to be a budgeting issue.
I’m calling llama-server through the OpenAI-compatible API with chat.completions.create, and I was setting max_tokens per request.
With:
--reasoning on--reasoning-budget -1- moderately large prompts
- coding tasks that invite long internal reasoning
…the model could spend the entire output budget on thinking and return no useful visible answer.
In practice I saw cases like this:
| max_tokens | thinking | finish_reason | visible code output | elapsed |
|---|---|---|---|---|
| 6000 | ON | length |
empty / unusable | \~190s |
| 10000 | ON | length |
empty / unusable | \~330s |
| 5000 | OFF | stop |
\~3750 tokens of clean code | \~126s |
So for some coding tasks, the model wasn’t “failing” in the classic sense. It was just burning the whole budget on reasoning.
The useful part: there is a per-request fix
I originally thought reasoning budget might only be controllable server-side.
But llama-server supports a per-request field:
{
"thinking_budget_tokens": 1500
}
As I understand it, this works if you did not already fix the reasoning budget via CLI.
So the cleaner approach for my use case is probably:
- don’t hardcode a global reasoning budget if I want request-level control
- disable thinking for straightforward refactors
- use bounded thinking for tasks that genuinely benefit from it
My current rule of thumb
Right now I’m leaning toward:
| Task type | Thinking | My current view |
|---|---|---|
| Clear refactor from precise spec | OFF | better throughput, less token waste |
| Moderately ambiguous coding | ON, but bounded | probably best with request-level budget |
| Architecture / design tradeoffs | ON | worth the cost |
| Fixed-schema extraction / structured transforms | OFF | schema does most of the work |
One more thing: soft switching thinking
For Qwen3.6, I would not rely on /think or /nothink style prompting as if it were the official control surface.
The documented path is chat_template_kwargs, especially enable_thinking: false when you want non-thinking mode.
So my current plan is to switch modes that way instead of prompt-hacking it.
What I’d love feedback on
--n-cpu-moeon 8GB VRAM Has anyone found a better split than “just shove everything to CPU” on this class of hardware?-b/-ubtuning for very long prompts 2048 looks good for me so far, but I’d love data points from people pushing 50K+ context regularly.- KV config with Qwen3.6 in practice I’m using
turbo2right now based on community findings and testing. Curious what others ended up with. - Thinking policy for agentic coding If you use Qwen3.6 locally as a coding worker, when do you keep thinking on vs force it off?
Happy to share more details if useful. This is part of a local knowledge-compiler / project-memory pipeline, so I care a lot more about reliable structured output than about chat UX.
DhammaKS@reddit
I am a beginner for trying to run local llm on my laptop. I have 64GB ram ddr5 and blackwell 2000 8GB. Any recommendation for the configuration?
Antonio_Sammarzano@reddit (OP)
Hi,
Your setup is similar – try this: copy my post, paste it into MD, and feed it to an LLM; link to the TurboQuant+ repository, and you’ll see something good come out of it.
J3loodRuby@reddit
3060 Ti (8GB) + 32GB DDR4 3200. Setup: 200K ctx, n-cpu-moe 38, ubatch 1024, q8_0 for K/V cache, no-mmproj. pp hits 700–750 tok/s on a 50K prompt and stays above 500 Generation runs at ~33 tok/s early on and around ~22 tok/s when context fills up to 200K
Antonio_Sammarzano@reddit (OP)
That’s a very useful datapoint, thanks.
Your
n-cpu-moe 38on 8GB is exactly the kind of comparison I was hoping to get, because I’m currently just brute-forcing it with99and treating the whole thing as “CPU MoE + RAM spill”.A couple of questions if you don’t mind:
q8_0 for K/V, are you usingq8_0for both cache K and cache V?ubatchthan 1024, or was that your sweet spot at 200K ctx?Your prefill numbers are way better than I expected on 8GB, so now I’m wondering whether
99is just the lazy/safe config and a partial split is actually better even this low on VRAM.J3loodRuby@reddit
I use Unsloth Q6_K + main llamacpp + q8_0 for both K and V. Tried ubatch 2048 and get 1k pp. Ubatch 512 cause pp drop to 3xx.
Antonio_Sammarzano@reddit (OP)
That's a very tight fit — Q6_K at 29GB on 32GB DDR4 + 7.6GB VRAM is basically zero headroom. Impressive that it runs stable.
Would you mind sharing your full server command? Specifically curious about:
-nglvalue you're using (with 7.6GB VRAM, how many layers actually land on GPU?)Asking because I'm on a similar VRAM budget (RTX 4060 Laptop 8GB) but with 96GB DDR5 and Q4_K_M, and I'm stuck at \~10–12 tok/s with n-cpu-moe 99. Your 33 tok/s suggests the partial split is doing real work.
synw_@reddit
1050ti 4g vram here: for moe models I recommend to always set n-cpu-moe manually by trial and error: it's not fun to do but beats fit all time. This can make a difference between unusable to slow for the gpu poor: with fit only Qwen 35b is unusable, with n-cpu-moe set (35 here) it works at around 10tps for a Q3 quant
RelicDerelict@reddit
Are you squeezing the KV on GPU memory or you run bigger KV on system RAM?
synw_@reddit
I use the default --kv-offload enabled. I should try --no-kv-offload for very small models, good idea thanks
ea_man@reddit
As it is a MoE can't you just use --fit or --fit-context?
EggDroppedSoup@reddit
isn't that 29.3gb? how does that fit on your system, unless you use it as a server and don't run an operating system?
J3loodRuby@reddit
Haha yes i run only llamacpp on fedora without GUI.
hedsht@reddit
thanks a lot for your settings! i tried your settings and with the newer llama.cpp you should try the new speculative decoding.
i was looking for a sidekick for my 5090 qwen3.5 27b and bc i do have another pc with a 4060 its perfect for qwen3.6 35b with your settings. the model itself is quick, so it fits perfectly with the lower vram. i had to lower ctx to 128k, because otherwise i would swap very hard, running arch linux headless mit 32gb ram as well.
i'm at ~33.55 tokens/sec now, here are my settings:
YourNightmar31@reddit
Hey man i got a 3060 12GB and 128GB DDR4 3200, pretty similar setup to yours. Also running Qwen3.6 35B A3B at Q6 with max context set to 128k k/v at q8. I used `--fit on` because as far as i've read it should do the `n-cpu-moe` thing automatically? Not sure. My `-u` and `-ub` are both set to 4096.
For a 50k prompt i get about 850 tok/s and like 19 tok/s generation.
I don't know much about it so do you know if i'd benefit from using `n-cpu-moe` instead of `-fit on`? Any other tips for that setup?
Antonio_Sammarzano@reddit (OP)
That’s a very relevant comparison point.
My current view is:
-fitshould be treated as the automatic baseline--n-cpu-moetuning is worth testing only if you want to push past that baselineOn my machine, the brute-force config was clearly bad:
--n-cpu-moe 99→ \~10–12 tok/s generation--n-cpu-moe 38→ \~36.6 tok/s generationSo manual tuning definitely mattered there.
What I don’t know yet is whether
38beats your current-fitplacement on your exact setup, because I haven’t done a clean A/B against-fityet. That’s the next benchmark I need to run properly.So my practical advice would be:
-fitconfig as baseline38, maybe30/45)If
-fitis already placing experts near the sweet spot, manual tuning may do little. If not, it can matter a lot.ea_man@reddit
Aye, I set --fit-target 60 and that's about as good as it can get.
J3loodRuby@reddit
Use fit give me the same pp and tg speed. Manually set n-cpu-moe save a little vram for my case. Also i build llamacpp from source code.
Bootes-sphere@reddit
Nice setup. Quick heads-up on the max_tokens trap you mentioned—it's more common than people realize. The issue is that `max_tokens` in llama-server doesn't always account for KV cache overhead the way you'd expect. With MoE models especially, you're burning VRAM on the active expert weights *plus* the full KV cache for every token you generate.
Have you tried constraining it lower than your theoretical max and measuring actual memory usage under load? I'd bet you're leaving tokens on the table because the server is conservatively reserving space.
Also—35B MoE on 8GB is tight. Are you offloading to system RAM aggressively, or keeping the full model in VRAM? The latency difference matters for agent loops.
Antonio_Sammarzano@reddit (OP)
Thanks for the heads-up, but I think there's a slight misread of what I was seeing.
The max_tokens issue isn't memory — it's token budget accounting. With
--reasoning-budget -1, the thinking trace runs inside the samemax_tokenswindow as the output. So if I setmax_tokens=6000and the model decides to think for 6000 tokens, I getfinish_reason: lengthwith zero output content. No VRAM pressure, just the model spending its entire runway on internal reasoning before ever writing a line.On the MoE/VRAM question: with
--n-cpu-moe 99, all expert weights live in system RAM (\~18GB of the 19.7GB model). VRAM holds only the attention layers (\~1.9GB) plus KV cache (fixed at startup, \~600MB for 50K ctx with q8_0). Latency for generation is acceptable for agent loops (\~36 tok/s TG) because the bottleneck is RAM bandwidth, not VRAM. Prefill is slower (CPU-bound for expert layers) but that's the tradeoff on 8GB.The real open question is still whether there's a
thinking_budget_tokensparameter I can set per-request via the API to cap thinking at \~1500-2000 tokens without disabling it entirely.keyboardhack@reddit
For anyone else looking into
LLAMA_SET_ROWS. The flag no longer exist, you don't need to set it.It was enabled by default(LLAMA_SET_ROWS=1) the 2nd of August 2025. https://github.com/ggml-org/llama.cpp/pull/14959
The flag itself was removed the 28th of August 2025. https://github.com/ggml-org/llama.cpp/pull/15505
deleted_by_reddit@reddit
[removed]
EggDroppedSoup@reddit
full GPU offloading (
--gpu-layers 99) is mandatory for viable speed, but you can slash VRAM usage and smooth out compute spikes without losing throughput with KV cache set toq4_0, batch sizes to2048/512, and locking CPU threads to 16 (adjust to ur cpu). That combo consistently hit \~40 tok/s on my machinewhile running noticeably lighter and more stable and by 13 tokens per second past experience, but if ur using similar CUDA stack, write a ps1 file to serve, stop, and benchmark at different params and then a ps1 to set different configs to be tested with autorestart if it crashes (THIS IS IMPORTANT)puncia@reddit
From my understanding you should use
-fit(which is enabled by default) instead of manually setting--n-cpu-moeand the other parameters.Anyway, my current setup is a GTX 1060 6GB + 48 GB RAM, using
Qwen3.6-35B-A3B-UD-IQ4_NLby Unsloth. I've been going back and forth with these parameters and I'm sure I can get better results, still need to figure things out.With these settings, I can get ~12 t/s. It's not pretty but it still works.
Antonio_Sammarzano@reddit (OP)
Interesting datapoint, thanks.
I should test
-fitmore systematically instead of assuming manual--n-cpu-moetuning is always better.That said, on my machine I did get a very large gain from moving away from the brute-force config:
--n-cpu-moe 99→ generation \~10–12 tok/s--n-cpu-moe 38→ generation \~36.6 tok/sSo at least in my case, manual split clearly beat the “push everything to CPU” approach.
I also tested a run with speculative / ngram-style settings, but I don’t think it was a fair benchmark yet:
So basically no real gain vs the
36.6 tok/sbaseline.My current interpretation is that ngram draft didn’t help because the output was too short. With only \~700 generated tokens, there isn’t enough warmup for the cache to build useful draft predictions. I’d expect the benefit to show up much more on long outputs (3000+ tokens), especially repetitive ones like tests / boilerplate-heavy code.
So right now my take is:
-fitis probably the right auto baseline to compare againstIf you’ve compared your current config directly against
-fit, I’d be very interested in the delta.giant3@reddit
You have to do binary search to find the right value for n-cpu-moe.
Run llama-bench with n-cpu-moe as 30. Note the tg value. Double it to 60. If performance improves, keep adding 10 at a time. If not, find the mid-point which is 45. Run again.
BTW llama-bench doesn't accept all the command line options. Try to use carry over as many options as supported from your llama-server options.
puncia@reddit
Yes, you start seeing the benefits of spec decoding only after iterating. Let it generate a file, for example, and only once you allow the model to make edits you start seeing the benefits since it has to repeat many lines of code in most cases.
-fitis enabled by default if you omit it. In fact, when I run my script, this is the output right after it starts:Antonio_Sammarzano@reddit (OP)
That log is very helpful, thanks — this is exactly the kind of evidence I wanted.
So yes, agreed:
-fitis not just “some suggestion”, it’s actively doing the placement work automatically and should be the real baseline before manual tuning.At this point my updated view is:
-fit= default auto baseline--n-cpu-moe= something to test against that baseline, not instead of measurementMy
38result is still real on my setup, but I now want to compare it properly against plain-fit, not against my old brute-force99.ea_man@reddit
--fit takes a big default safe headroom, you got little VRAM so use
Antonio_Sammarzano@reddit (OP)
Real test result from my side:
The key point: the model did not fail on code quality. It failed on runway.
The structure was correct, typing was correct, merge/open-loops logic was correct. The only breakage was at the very end: last line got truncated because thinking had already consumed half the output budget.
So my current practical rule is:
enable_thinking = false+ largermax_tokensAt least on my setup, the limiting factor here is token budgeting, not model quality.
D2OQZG8l5BI1S06@reddit
Like you I had much better perf increasing the -ub than trying to fit one or two layers on the GPU.
Awwtifishal@reddit
For the reasoning budget you also need a message, something like:
--reasoning-budget-message "\n\nConsidering the limited time by the user, I have to give the solution based on the thinking directly now."Also try ik_llama with ubergarm quants, you may squeeze some more performance out of it.
Antonio_Sammarzano@reddit (OP)
Interesting, thanks — I hadn’t tested
--reasoning-budget-messageyet.That sounds useful as a behavioral nudge to force the model to exit thinking and actually produce an answer, even if it’s not the same thing as a hard per-request budget.
So in practice there may be 3 different levers here:
thinking_budget_tokens= actual budget control--reasoning-budget-message= “stop thinking and answer now” nudgeenable_thinking: false= full disable for clear coding tasksI’ll test that next.
Also thanks for the
ik_llama/ ubergarm quants suggestion — I’ve been focused mostly on the current llama-server path, but if there’s extra performance headroom on this hardware tier, that’s worth checking.If you’ve compared them directly on Qwen3.6 35B A3B, I’d be very interested in:
Awwtifishal@reddit
The reasoning budget message is only used when the budget is spent, and qwen 3.x have been trained with it in mind. If the budget is not used, the message is not inserted. I haven't used ik_llama just yet, but from what people say, it seems it excels at lower quants (i.e. you can use a slightly lower quant with the same quality as in mainline llama.cpp) in addition to having some CPU optimizations.
AvidCyclist250@reddit
+1 to thinking eating the budget. It’s a problem.