Qwen3.6 35B MoE on 8GB VRAM — working llama-server config + a max_tokens / thinking trap I ran into

Posted by Antonio_Sammarzano@reddit | LocalLLaMA | View on Reddit | 34 comments

Hi all,

I wanted to share a setup that’s working for me with Qwen3.6-35B-A3B on a laptop RTX 4060 (8GB VRAM) + 96GB RAM.

This is not an interactive chat setup. I’m using it as a coding subagent inside an agentic pipeline, so some of the choices below are specific to that use case.

Hardware / runtime

Current server command

llama-server \
  -m Qwen3.6-35B-A3B-Q4_K_M.gguf \
  -ngl 99 \
  --n-cpu-moe 99 \
  -c 50000 \
  -np 1 \
  -fa on \
  --cache-type-k q8_0 \
  --cache-type-v turbo2 \
  --no-mmap \
  --mlock \
  --ctx-checkpoints 1 \
  --cache-ram 0 \
  --jinja \
  --reasoning on \
  --reasoning-budget -1 \
  -b 2048 \
  -ub 2048

PowerShell env:

$env:LLAMA_SET_ROWS = "1"
$env:LLAMA_CHAT_TEMPLATE_KWARGS = '{"preserve_thinking":true}'

Notes on the non-obvious choices

Important distinction: official vs empirical

A few things here are officially documented for Qwen3.6:

Other parts of this config are just my current best empirical setup or community-derived tuning, especially around MoE placement, KV config, and batch / ubatch choices.

So I’m posting this as “working setup + observations”, not as a universal best config.

The trap I ran into: thinking can eat the whole output budget

What initially looked like a weird bug turned out to be a budgeting issue.

I’m calling llama-server through the OpenAI-compatible API with chat.completions.create, and I was setting max_tokens per request.

With:

…the model could spend the entire output budget on thinking and return no useful visible answer.

In practice I saw cases like this:

max_tokens thinking finish_reason visible code output elapsed
6000 ON length empty / unusable \~190s
10000 ON length empty / unusable \~330s
5000 OFF stop \~3750 tokens of clean code \~126s

So for some coding tasks, the model wasn’t “failing” in the classic sense. It was just burning the whole budget on reasoning.

The useful part: there is a per-request fix

I originally thought reasoning budget might only be controllable server-side.

But llama-server supports a per-request field:

{
  "thinking_budget_tokens": 1500
}

As I understand it, this works if you did not already fix the reasoning budget via CLI.

So the cleaner approach for my use case is probably:

My current rule of thumb

Right now I’m leaning toward:

Task type Thinking My current view
Clear refactor from precise spec OFF better throughput, less token waste
Moderately ambiguous coding ON, but bounded probably best with request-level budget
Architecture / design tradeoffs ON worth the cost
Fixed-schema extraction / structured transforms OFF schema does most of the work

One more thing: soft switching thinking

For Qwen3.6, I would not rely on /think or /nothink style prompting as if it were the official control surface.

The documented path is chat_template_kwargs, especially enable_thinking: false when you want non-thinking mode.

So my current plan is to switch modes that way instead of prompt-hacking it.

What I’d love feedback on

  1. --n-cpu-moe on 8GB VRAM Has anyone found a better split than “just shove everything to CPU” on this class of hardware?
  2. -b / -ub tuning for very long prompts 2048 looks good for me so far, but I’d love data points from people pushing 50K+ context regularly.
  3. KV config with Qwen3.6 in practice I’m using turbo2 right now based on community findings and testing. Curious what others ended up with.
  4. Thinking policy for agentic coding If you use Qwen3.6 locally as a coding worker, when do you keep thinking on vs force it off?

Happy to share more details if useful. This is part of a local knowledge-compiler / project-memory pipeline, so I care a lot more about reliable structured output than about chat UX.