Qwen3.6 35B MoE on 8GB VRAM — working llama-server config + a max_tokens / thinking trap I ran into

Posted by Antonio_Sammarzano@reddit | LocalLLaMA | View on Reddit | 34 comments

Hi all,

I wanted to share a setup that’s working for me with Qwen3.6-35B-A3B on a laptop RTX 4060 (8GB VRAM) + 96GB RAM.

This is not an interactive chat setup. I’m using it as a coding subagent inside an agentic pipeline, so some of the choices below are specific to that use case.

Hardware / runtime

GPU: RTX 4060 Laptop, 8GB VRAM
RAM: 96GB DDR5
Runtime: llama-server
Model: Qwen3.6-35B-A3B GGUF
Use case: coding subagent / structured pipeline work

Current server command

llama-server \
  -m Qwen3.6-35B-A3B-Q4_K_M.gguf \
  -ngl 99 \
  --n-cpu-moe 99 \
  -c 50000 \
  -np 1 \
  -fa on \
  --cache-type-k q8_0 \
  --cache-type-v turbo2 \
  --no-mmap \
  --mlock \
  --ctx-checkpoints 1 \
  --cache-ram 0 \
  --jinja \
  --reasoning on \
  --reasoning-budget -1 \
  -b 2048 \
  -ub 2048

PowerShell env:

$env:LLAMA_SET_ROWS = "1"
$env:LLAMA_CHAT_TEMPLATE_KWARGS = '{"preserve_thinking":true}'

Notes on the non-obvious choices

--n-cpu-moe 99: on 8GB VRAM, I’m currently pushing MoE layers to CPU. This is partly based on my own constraints and partly on community tuning discussions, not on official guidance.
-np 1: this is a single-user / single-agent setup, so I don’t want extra slots wasting RAM.
-b 2048 -ub 2048: in my tests this gave noticeably better prefill on prompts above \~2K tokens than lower defaults.
LLAMA_SET_ROWS=1: community tip, easy to try, seems worth keeping.
preserve_thinking: true: I’m using this because Qwen3.6 explicitly supports it, and for agent workflows it helps keep prior reasoning in cache instead of re-deriving everything every turn.

Important distinction: official vs empirical

A few things here are officially documented for Qwen3.6:

enable_thinking
preserve_thinking
thinking mode being on by default
recommended sampling presets for coding / thinking / non-thinking use

Other parts of this config are just my current best empirical setup or community-derived tuning, especially around MoE placement, KV config, and batch / ubatch choices.

So I’m posting this as “working setup + observations”, not as a universal best config.

The trap I ran into: thinking can eat the whole output budget

What initially looked like a weird bug turned out to be a budgeting issue.

I’m calling llama-server through the OpenAI-compatible API with chat.completions.create, and I was setting max_tokens per request.

With:

--reasoning on
--reasoning-budget -1
moderately large prompts
coding tasks that invite long internal reasoning

…the model could spend the entire output budget on thinking and return no useful visible answer.

In practice I saw cases like this:

max_tokens	thinking	finish_reason	visible code output	elapsed
6000	ON	`length`	empty / unusable	\~190s
10000	ON	`length`	empty / unusable	\~330s
5000	OFF	`stop`	\~3750 tokens of clean code	\~126s

So for some coding tasks, the model wasn’t “failing” in the classic sense. It was just burning the whole budget on reasoning.

The useful part: there is a per-request fix

I originally thought reasoning budget might only be controllable server-side.

But llama-server supports a per-request field:

{
  "thinking_budget_tokens": 1500
}

As I understand it, this works if you did not already fix the reasoning budget via CLI.

So the cleaner approach for my use case is probably:

don’t hardcode a global reasoning budget if I want request-level control
disable thinking for straightforward refactors
use bounded thinking for tasks that genuinely benefit from it

My current rule of thumb

Right now I’m leaning toward:

Task type	Thinking	My current view
Clear refactor from precise spec	OFF	better throughput, less token waste
Moderately ambiguous coding	ON, but bounded	probably best with request-level budget
Architecture / design tradeoffs	ON	worth the cost
Fixed-schema extraction / structured transforms	OFF	schema does most of the work

One more thing: soft switching thinking

For Qwen3.6, I would not rely on /think or /nothink style prompting as if it were the official control surface.

The documented path is chat_template_kwargs, especially enable_thinking: false when you want non-thinking mode.

So my current plan is to switch modes that way instead of prompt-hacking it.

What I’d love feedback on

--n-cpu-moe on 8GB VRAM Has anyone found a better split than “just shove everything to CPU” on this class of hardware?
-b / -ub tuning for very long prompts 2048 looks good for me so far, but I’d love data points from people pushing 50K+ context regularly.
KV config with Qwen3.6 in practice I’m using turbo2 right now based on community findings and testing. Curious what others ended up with.
Thinking policy for agentic coding If you use Qwen3.6 locally as a coding worker, when do you keep thinking on vs force it off?

Happy to share more details if useful. This is part of a local knowledge-compiler / project-memory pipeline, so I care a lot more about reliable structured output than about chat UX.

[-]

DhammaKS@reddit

I am a beginner for trying to run local llm on my laptop. I have 64GB ram ddr5 and blackwell 2000 8GB. Any recommendation for the configuration?

[-]

Antonio_Sammarzano@reddit (OP)

Hi,

Your setup is similar – try this: copy my post, paste it into MD, and feed it to an LLM; link to the TurboQuant+ repository, and you’ll see something good come out of it.

[-]

3060 Ti (8GB) + 32GB DDR4 3200. Setup: 200K ctx, n-cpu-moe 38, ubatch 1024, q8_0 for K/V cache, no-mmproj. pp hits 700–750 tok/s on a 50K prompt and stays above 500 Generation runs at ~33 tok/s early on and around ~22 tok/s when context fills up to 200K

[-]

Antonio_Sammarzano@reddit (OP)

That’s a very useful datapoint, thanks.

Your n-cpu-moe 38 on 8GB is exactly the kind of comparison I was hoping to get, because I’m currently just brute-forcing it with 99 and treating the whole thing as “CPU MoE + RAM spill”.

A couple of questions if you don’t mind:

are you also using Qwen3.6-35B-A3B Q4_K_M?
are you on mainline llama.cpp or a TurboQuant build?
when you say q8_0 for K/V, are you using q8_0 for both cache K and cache V?
did you try higher ubatch than 1024, or was that your sweet spot at 200K ctx?

Your prefill numbers are way better than I expected on 8GB, so now I’m wondering whether 99 is just the lazy/safe config and a partial split is actually better even this low on VRAM.

[-]

J3loodRuby@reddit

I use Unsloth Q6_K + main llamacpp + q8_0 for both K and V. Tried ubatch 2048 and get 1k pp. Ubatch 512 cause pp drop to 3xx.

[-]

Antonio_Sammarzano@reddit (OP)

That's a very tight fit — Q6_K at 29GB on 32GB DDR4 + 7.6GB VRAM is basically zero headroom. Impressive that it runs stable.

Would you mind sharing your full server command? Specifically curious about:

what -ngl value you're using (with 7.6GB VRAM, how many layers actually land on GPU?)
why n-cpu-moe 38 specifically — did you arrive at that by testing or is it a rule of thumb for this class of hardware?
are you seeing any RAM pressure / swap at 29GB model + OS overhead?

Asking because I'm on a similar VRAM budget (RTX 4060 Laptop 8GB) but with 96GB DDR5 and Q4_K_M, and I'm stuck at \~10–12 tok/s with n-cpu-moe 99. Your 33 tok/s suggests the partial split is doing real work.

[-]

synw_@reddit

1050ti 4g vram here: for moe models I recommend to always set n-cpu-moe manually by trial and error: it's not fun to do but beats fit all time. This can make a difference between unusable to slow for the gpu poor: with fit only Qwen 35b is unusable, with n-cpu-moe set (35 here) it works at around 10tps for a Q3 quant

[-]

RelicDerelict@reddit

Are you squeezing the KV on GPU memory or you run bigger KV on system RAM?

[-]

synw_@reddit

I use the default --kv-offload enabled. I should try --no-kv-offload for very small models, good idea thanks

[-]

ea_man@reddit

As it is a MoE can't you just use --fit or --fit-context?

[-]

EggDroppedSoup@reddit

isn't that 29.3gb? how does that fit on your system, unless you use it as a server and don't run an operating system?

[-]

J3loodRuby@reddit

Haha yes i run only llamacpp on fedora without GUI.

[-]

hedsht@reddit

thanks a lot for your settings! i tried your settings and with the newer llama.cpp you should try the new speculative decoding.

i was looking for a sidekick for my 5090 qwen3.5 27b and bc i do have another pc with a 4060 its perfect for qwen3.6 35b with your settings. the model itself is quick, so it fits perfectly with the lower vram. i had to lower ctx to 128k, because otherwise i would swap very hard, running arch linux headless mit 32gb ram as well.

i'm at ~33.55 tokens/sec now, here are my settings:

./llama.cpp/build/bin/llama-server \
  -m ./models/qwen36/Qwen3.6-35B-A3B-UD-Q6_K.gguf \
  --alias qwen3.6-35b \
  --host 0.0.0.0 \
  --port 8080 \
  --ctx-size 131072 \
  --gpu-layers 99 \
  --n-cpu-moe 38 \
  --threads 6 \
  --batch-size 2048 \
  --ubatch-size 1024 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --reasoning on \
  --chat-template-kwargs "{\"preserve_thinking\": true}" \
  --presence-penalty 0.0 \
  --repeat-penalty 1.0 \
  --top-p 0.95 \
  --top-k 20 \
  --temp 0.95 \
  --min-p 0.0 \
  -fa on \
  --jinja \
  --no-mmap \
  --cache-ram 0 \
  --parallel 1 \
  --no-mmproj \
  -nocb \
  --perf --metrics \
  --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64

[-]

YourNightmar31@reddit

Hey man i got a 3060 12GB and 128GB DDR4 3200, pretty similar setup to yours. Also running Qwen3.6 35B A3B at Q6 with max context set to 128k k/v at q8. I used `--fit on` because as far as i've read it should do the `n-cpu-moe` thing automatically? Not sure. My `-u` and `-ub` are both set to 4096.

For a 50k prompt i get about 850 tok/s and like 19 tok/s generation.

I don't know much about it so do you know if i'd benefit from using `n-cpu-moe` instead of `-fit on`? Any other tips for that setup?

[-]

Antonio_Sammarzano@reddit (OP)

That’s a very relevant comparison point.

My current view is:

-fit should be treated as the automatic baseline
manual --n-cpu-moe tuning is worth testing only if you want to push past that baseline

On my machine, the brute-force config was clearly bad:

--n-cpu-moe 99 → \~10–12 tok/s generation
--n-cpu-moe 38 → \~36.6 tok/s generation

So manual tuning definitely mattered there.

What I don’t know yet is whether 38 beats your current -fit placement on your exact setup, because I haven’t done a clean A/B against -fit yet. That’s the next benchmark I need to run properly.

So my practical advice would be:

keep your current -fit config as baseline
test one or two manual splits (38, maybe 30 / 45)
compare decode tok/s, not just prefill
keep everything else fixed

If -fit is already placing experts near the sweet spot, manual tuning may do little. If not, it can matter a lot.

[-]

ea_man@reddit

Aye, I set --fit-target 60 and that's about as good as it can get.

[-]

J3loodRuby@reddit

Use fit give me the same pp and tg speed. Manually set n-cpu-moe save a little vram for my case. Also i build llamacpp from source code.

[-]

Bootes-sphere@reddit

Nice setup. Quick heads-up on the max_tokens trap you mentioned—it's more common than people realize. The issue is that `max_tokens` in llama-server doesn't always account for KV cache overhead the way you'd expect. With MoE models especially, you're burning VRAM on the active expert weights *plus* the full KV cache for every token you generate.

Have you tried constraining it lower than your theoretical max and measuring actual memory usage under load? I'd bet you're leaving tokens on the table because the server is conservatively reserving space.

Also—35B MoE on 8GB is tight. Are you offloading to system RAM aggressively, or keeping the full model in VRAM? The latency difference matters for agent loops.

[-]

Antonio_Sammarzano@reddit (OP)

Thanks for the heads-up, but I think there's a slight misread of what I was seeing.

The max_tokens issue isn't memory — it's token budget accounting. With --reasoning-budget -1, the thinking trace runs inside the same max_tokens window as the output. So if I set max_tokens=6000 and the model decides to think for 6000 tokens, I get finish_reason: length with zero output content. No VRAM pressure, just the model spending its entire runway on internal reasoning before ever writing a line.

On the MoE/VRAM question: with --n-cpu-moe 99, all expert weights live in system RAM (\~18GB of the 19.7GB model). VRAM holds only the attention layers (\~1.9GB) plus KV cache (fixed at startup, \~600MB for 50K ctx with q8_0). Latency for generation is acceptable for agent loops (\~36 tok/s TG) because the bottleneck is RAM bandwidth, not VRAM. Prefill is slower (CPU-bound for expert layers) but that's the tradeoff on 8GB.

The real open question is still whether there's a thinking_budget_tokens parameter I can set per-request via the API to cap thinking at \~1500-2000 tokens without disabling it entirely.

[-]

keyboardhack@reddit

For anyone else looking into LLAMA_SET_ROWS. The flag no longer exist, you don't need to set it.

It was enabled by default(LLAMA_SET_ROWS=1) the 2nd of August 2025. https://github.com/ggml-org/llama.cpp/pull/14959

The flag itself was removed the 28th of August 2025. https://github.com/ggml-org/llama.cpp/pull/15505

[-]

deleted_by_reddit@reddit

[removed]

[-]

EggDroppedSoup@reddit

full GPU offloading (--gpu-layers 99) is mandatory for viable speed, but you can slash VRAM usage and smooth out compute spikes without losing throughput with KV cache set to q4_0, batch sizes to 2048/512, and locking CPU threads to 16 (adjust to ur cpu). That combo consistently hit \~40 tok/s on my machinewhile running noticeably lighter and more stable and by 13 tokens per second past experience, but if ur using similar CUDA stack, write a ps1 file to serve, stop, and benchmark at different params and then a ps1 to set different configs to be tested with autorestart if it crashes (THIS IS IMPORTANT)

[-]

puncia@reddit

From my understanding you should use -fit (which is enabled by default) instead of manually setting --n-cpu-moe and the other parameters.

Anyway, my current setup is a GTX 1060 6GB + 48 GB RAM, using Qwen3.6-35B-A3B-UD-IQ4_NL by Unsloth. I've been going back and forth with these parameters and I'm sure I can get better results, still need to figure things out.

#!/bin/sh
LD_LIBRARY_PATH="$HOME/llama-output/bin"
export LD_LIBRARY_PATH
$HOME/llama-output/bin/llama-server \
-m ~/ai_models/Qwen3.6-35B-A3B-UD-IQ4_NL.gguf \
--batch-size 2048 \
--ubatch-size 128 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
-t 6 \
-tb 6 \
--cpu-strict 1 \
--poll 100 \
--temp 0.6 \
--top-k 20 \
--top-p 0.95 \
--min-p 0.00 \
--presence-penalty 1.5 \
--repeat-penalty 1.0 \
-np 1 \
--slot-save-path ./slots \
--no-context-shift \
--jinja \
--no-mmap \
--no-warmup \
-dio \
--port 8111 \
--alias Qwen/Qwen3.6-35B-A3BD \
--log-prefix \
--reasoning on \
--log-colors on \
--spec-type ngram-map-k \
--draft-max 48 \
--draft-min 1 \
--spec-ngram-size-n 16 \
--spec-ngram-size-m 48 \
--ctx-checkpoints 16 \
-c 80000

With these settings, I can get ~12 t/s. It's not pretty but it still works.

[-]

Antonio_Sammarzano@reddit (OP)

Interesting datapoint, thanks.

I should test -fit more systematically instead of assuming manual --n-cpu-moe tuning is always better.

That said, on my machine I did get a very large gain from moving away from the brute-force config:

--n-cpu-moe 99 → generation \~10–12 tok/s
--n-cpu-moe 38 → generation \~36.6 tok/s

So at least in my case, manual split clearly beat the “push everything to CPU” approach.

I also tested a run with speculative / ngram-style settings, but I don’t think it was a fair benchmark yet:

697 completion tokens in 28.95s total
prompt was 4933 tok
estimated generation still lands around \~36.7 tok/s

So basically no real gain vs the 36.6 tok/s baseline.

My current interpretation is that ngram draft didn’t help because the output was too short. With only \~700 generated tokens, there isn’t enough warmup for the cache to build useful draft predictions. I’d expect the benefit to show up much more on long outputs (3000+ tokens), especially repetitive ones like tests / boilerplate-heavy code.

So right now my take is:

-fit is probably the right auto baseline to compare against
partial MoE split is definitely worth testing manually on low-VRAM setups
speculative / ngram needs a longer-output benchmark before I judge it

If you’ve compared your current config directly against -fit, I’d be very interested in the delta.

[-]

giant3@reddit

You have to do binary search to find the right value for n-cpu-moe.

Run llama-bench with n-cpu-moe as 30. Note the tg value. Double it to 60. If performance improves, keep adding 10 at a time. If not, find the mid-point which is 45. Run again.

BTW llama-bench doesn't accept all the command line options. Try to use carry over as many options as supported from your llama-server options.

[-]

puncia@reddit

My current interpretation is that ngram draft didn’t help because the output was too short. With only ~700 generated tokens, there isn’t enough warmup for the cache to build useful draft predictions. I’d expect the benefit to show up much more on long outputs (3000+ tokens), especially repetitive ones like tests / boilerplate-heavy code.

Yes, you start seeing the benefits of spec decoding only after iterating. Let it generate a file, for example, and only once you allow the model to make edits you start seeing the benefits since it has to repeat many lines of code in most cases.

If you’ve compared your current config directly against -fit, I’d be very interested in the delta.

-fit is enabled by default if you omit it. In fact, when I run my script, this is the output right after it starts:

I llama_params_fit_impl: projected to use 17696 MiB of device memory vs. 5135 MiB of free device memory
I llama_params_fit_impl: cannot meet free memory target of 1024 MiB, need to reduce device memory by 13585 MiB
I llama_params_fit_impl: context size set by user to 80000 -> no change
I llama_params_fit_impl: with only dense weights in device memory there is a total surplus of 936 MiB
I llama_params_fit_impl: filling dense-only layers back-to-front:
I llama_params_fit_impl:   - CUDA0 (NVIDIA GeForce GTX 1060 6GB): 41 layers,   3538 MiB used,   1596 MiB free
I llama_params_fit_impl: converting dense-only layers to full layers and filling them front-to-back with overflow to next device/system memory:
I llama_params_fit_impl:   - CUDA0 (NVIDIA GeForce GTX 1060 6GB): 41 layers (39 overflowing),   4043 MiB used,   1091 MiB free
I llama_params_fit: successfully fit params to free device memory
I llama_params_fit: fitting params to free memory took 4.92 seconds

[-]

Antonio_Sammarzano@reddit (OP)

That log is very helpful, thanks — this is exactly the kind of evidence I wanted.

So yes, agreed: -fit is not just “some suggestion”, it’s actively doing the placement work automatically and should be the real baseline before manual tuning.

At this point my updated view is:

-fit = default auto baseline
manual --n-cpu-moe = something to test against that baseline, not instead of measurement
spec/ngram = probably only worth judging on longer, repetitive outputs / edit loops

My 38 result is still real on my setup, but I now want to compare it properly against plain -fit, not against my old brute-force 99.

[-]

ea_man@reddit

--fit takes a big default safe headroom, you got little VRAM so use

--fit-target 60
or double that if you crash.

[-]

Antonio_Sammarzano@reddit (OP)

Real test result from my side:

prompt: 6378 tok
completion: 5895 / 6000
thinking used: \~3000
visible code output: \~2895 tok = 248 lines

The key point: the model did not fail on code quality. It failed on runway.

The structure was correct, typing was correct, merge/open-loops logic was correct. The only breakage was at the very end: last line got truncated because thinking had already consumed half the output budget.

So my current practical rule is:

for large files with a clear spec: enable_thinking = false + larger max_tokens
use thinking only for genuinely ambiguous tasks with multiple new interacting components

At least on my setup, the limiting factor here is token budgeting, not model quality.

[-]

D2OQZG8l5BI1S06@reddit

Like you I had much better perf increasing the -ub than trying to fit one or two layers on the GPU.

[-]

Awwtifishal@reddit

For the reasoning budget you also need a message, something like:

--reasoning-budget-message "\n\nConsidering the limited time by the user, I have to give the solution based on the thinking directly now."

Also try ik_llama with ubergarm quants, you may squeeze some more performance out of it.

[-]

Antonio_Sammarzano@reddit (OP)

Interesting, thanks — I hadn’t tested --reasoning-budget-message yet.

That sounds useful as a behavioral nudge to force the model to exit thinking and actually produce an answer, even if it’s not the same thing as a hard per-request budget.

So in practice there may be 3 different levers here:

thinking_budget_tokens = actual budget control
--reasoning-budget-message = “stop thinking and answer now” nudge
enable_thinking: false = full disable for clear coding tasks

I’ll test that next.

Also thanks for the ik_llama / ubergarm quants suggestion — I’ve been focused mostly on the current llama-server path, but if there’s extra performance headroom on this hardware tier, that’s worth checking.

If you’ve compared them directly on Qwen3.6 35B A3B, I’d be very interested in:

tokens/s difference
quality hit (if any)
whether the gain is mostly prefill, decode, or both

[-]

Awwtifishal@reddit

The reasoning budget message is only used when the budget is spent, and qwen 3.x have been trained with it in mind. If the budget is not used, the message is not inserted. I haven't used ik_llama just yet, but from what people say, it seems it excels at lower quants (i.e. you can use a slightly lower quant with the same quality as in mainline llama.cpp) in addition to having some CPU optimizations.

[-]

AvidCyclist250@reddit

+1 to thinking eating the budget. It’s a problem.