Qwen3.5-27B, Qwen3.5-122B, and Qwen3.6-35B on 4x RTX 3090 — MoEs struggle with strict global rules

Posted by DehydratedWater_@reddit | LocalLLaMA | View on Reddit | 55 comments

Long-time lurker, first-time poster. Ran three Qwen models through 20+ sessions of live agentic work each on 4x RTX 3090 — Qwen3.5-27B dense, Qwen3.5-122B-A10B MoE, Qwen3.6-35B-A3B MoE. Numbers below parsed from vLLM logs under constant organic load, not synthetic benchmarks.

Workload context that matters for every number in this post: the harness is a multi-agent orchestrator running 1-6 concurrent OpenCode sessions with 30-60k-token prompts, and it enforces a tight bash allow-list — exact uv run scripts/<name>.py patterns per tool, no shell decorators (| head, | tail, timeout, 2>&1), no absolute paths on Read, no cd && ... chains. That makes rule-following measurably different from a looser harness where those shapes go through.

All three routed MoEs are systematically worse than the dense 27B at holding those strict global rules — size, active-param count, and fine-tune target don't change it much. Speed numbers first for context, rule-following gap afterward.

Models and quants, each picked to maximise quality while fitting 262k context on 4x24GB:

Qwen3.5-27B dense — INT8 (AWQ-BF16-INT8) weights, FP8 KV, MTP speculative decoding
Qwen3.5-122B-A10B MoE — AWQ-INT4 weights, FP8 KV. Q4 is the only way it fits alongside 262k context
Qwen3.6-35B-A3B MoE — FP8 weights, FP16 KV (FP8 KV was unstable on this model)

Smaller models get all the precision they can use, bigger models get only as much as fits. Tables below are at 250W (sweet spot from testing 200/250/300W). vLLM v0.19.0.

How the data is collected: vLLM emits Avg prompt throughput, Avg generation throughput, and Running: N reqs every 10s. Each cell is the mean of windows at that concurrency — n=6 ≈ 60s of wall time at that state. Idle windows count; this is sustained throughput, not peak.

Generation throughput by concurrency (250W, avg t/s)

n in parentheses is the sample count (number of 10-second windows).

Concurrent reqs	Qwen3.5-27B (n)	Qwen3.5-122B (n)	Qwen3.6-35B (n)
1	85 (8)	74 (21)	122 (90)
2	97 (28)	48 (13)	174 (34)
3	133 (36)	111 (9)	215 (16)
4	112 (19)	123 (9)	288 (8)
5	68 (34)	138 (17)	348 (4)
6	98 (16)	33 (3)	296 (5)

The 3.6-35B runs away with generation at every level. The 122B is uneven (c=2 dip to 48 t/s, c=6 drop to 33 at n=3) but internally coherent across c=3-5. The 27B sits between the two, and is the tightest of the three across the concurrency range — its variance per cell is the smallest, even where its average is below the 122B at c=4-5.

Prefill throughput by concurrency (250W, avg t/s)

Same n convention as the generation table above (each cell's n is the same for both tables — one window = one data point with both prefill and generation values). Prefill is averaged over all windows at that concurrency, including ones where the engine spent the window purely generating (prefill=0). That's the more honest representation of sustained prefill throughput at that concurrency state. 122B c=6 at n=3 is noise-dominated.

Concurrent reqs	Qwen3.5-27B (n)	Qwen3.5-122B (n)	Qwen3.6-35B (n)
1	926 (8)	573 (21)	626 (90)
2	553 (28)	2343 (13)	1589 (34)
3	364 (36)	1849 (9)	1799 (16)
4	726 (19)	2499 (9)	1856 (8)
5	1001 (34)	1754 (17)	1896 (4)
6	1427 (16)	2480 (3)	2983 (5)

Aggregate sustained averages (c=1-6, all windows at 250W): Qwen3.5-27B \~756 t/s, Qwen3.5-122B \~1651 t/s, Qwen3.6-35B \~1124 t/s. The 122B still wins prefill by roughly 2x. With prefix caching handling most of the 30-60k tokens on any given turn, the uncached tail is only a few thousand tokens per turn, so the 122B lead matters less in practice than on paper.

Prefill throughput when actively prefilling (zero-prefill windows excluded)

If you want "when the engine is actually processing a prompt, how fast does it go?" instead of the sustained average, the numbers below drop all windows where prefill=0 from each cell's average. n in parens is the count of prefill-active windows in each cell, so it varies per cell.

Concurrent reqs	Qwen3.5-27B (n)	Qwen3.5-122B (n)	Qwen3.6-35B (n)
1	1235 (6)	669 (18)	751 (75)
2	860 (18)	2769 (11)	1743 (31)
3	505 (26)	2377 (7)	1799 (16)
4	985 (14)	3213 (7)	1856 (8)
5	1260 (27)	1987 (15)	1896 (4)
6	1757 (13)	3720 (2)	2983 (5)

Aggregate active-only: Qwen3.5-27B \~1025 t/s, Qwen3.5-122B \~2155 t/s, Qwen3.6-35B \~1124 t/s. The sustained table above is closer to what an agent pipeline actually experiences averaged across its concurrency states; this table is closer to what vLLM can deliver when it's actually prefilling. Pick based on whether you care about "what does my agent stack do" or "what is this model capable of".

Completed requests per minute (250W)

Token rates are one thing; how many actual tasks finish per minute is another. Counted by tallying POST /v1/chat/completions HTTP/1.1" 200 log lines per 10-second window and bucketing by the concurrency at that window. Mixed-task (short and long responses both count as 1), so this is a functional-throughput metric for the workload mix, not a per-task latency.

Concurrent reqs	Qwen3.5-27B	Qwen3.5-122B	Qwen3.6-35B
1	8.2/min	9.1/min	14.9/min
2	6.6/min	9.7/min	23.1/min
3	6.7/min	10.0/min	26.6/min
4	7.3/min	10.0/min	36.8/min
5	7.8/min	8.8/min	27.0/min
6	13.9/min	12.0/min	45.6/min

3.6-35B finishes 2-4x more requests per minute than either sibling across most concurrency levels (the gap is smallest at c=1, biggest around c=4). The 27B holds a flat \~7/min across c=1-5 (slow-but-steady). The 122B saturates at \~9-10/min from c=2 onward — adding concurrency past 2 doesn't help it finish more work, it just spreads across more queued requests.

The rule-following gap

Oranges-to-oranges across \~20 sessions of comparable workloads (same task types, never the exact same query twice):

Model	Sessions	Tool calls	Errors	Err/tool
qwen3.5-27b (dense)	21	161	9	5.6%
qwen3.5-122b-a10b (MoE)	17	128	13	10.2%
qwen3.6-35b-a3b (MoE)	20	158	19	12.0%

The dense 27B makes about half the tool-call errors of either MoE. I added Qwen3.5-35B-A3B as a control — same architecture as the 3.6-35B (identical 35B total / 3B active / 256 experts top-8), only the fine-tune differs. It landed at 11.3%. Three routed MoEs spanning 3B to 10B active parameters, 8M to 20M per-expert capacity, and completely different fine-tune targets — all sit in a narrow 10-12% error band. The architecture caps the rate; post-training only moves which kinds of errors happen, not how often.

How the models fail matters more than how often. On a long multi-stage research task where each stage ends with a 3-call state handshake, the 3.6-35B could not finish a single stage. It kept retrying denied bash variants (ls scripts/ | grep -E "search|web", curl -s 'https://...', invented flags like --no-agent, hallucinated scripts like youtube_fetcher.py) and burned its turn budget without emitting the state transition. The 27B later picked up the exact task instance the 3.6-35B had stalled and finished it cleanly — it pivoted to a different allowed script on the first denial.

The pattern holds across all three MoEs: retry variants of the same blocked shape (| head -5 → | head -10 → | tail -3) rather than change strategy. The dense pivots. My reading: routing loses rule specificity — each token activates a small slice, and context-specified rules compete with pretraining priors for "what bash looks like". Shell idioms have a dense prior, custom allow-lists don't, and post-training changes which idioms leak, not whether they leak.

Configs

Hardware context that explains the flags: 4x RTX 3090, two NVLinked + two PCI-only, all undervolted and pinned at 250W each. --disable-custom-all-reduce works around vLLM's topology confusion on the mixed-link setup. -O3 is worth the coldstart + extra VRAM for the throughput it buys on both prefill and generation.

Two Qwen3-specific flag notes before the configs, in case anyone copy-pastes onto a different family: --reasoning-parser qwen3 only applies to Qwen3 thinking models (will fail on non-thinking variants); the qwen3_next_mtp speculative decoding method in the 27B config is Qwen3.5-Next-specific and won't work on other model families.

Qwen3.5-27B (my daily driver)

name: vllm-thinking

services:
  vllm:
    image: vllm/vllm-openai:v0.19.0
    restart: unless-stopped
    runtime: nvidia
    shm_size: 8gb
    ipc: host
    environment:
      - NVIDIA_VISIBLE_DEVICES=0,2,3,4
      - CUDA_DEVICE_ORDER=PCI_BUS_ID
      - RAY_memory_monitor_refresh_ms=0
      - NCCL_CUMEM_ENABLE=0
      - NCCL_NVLINK_DISABLE=0
      - VLLM_ENABLE_CUDAGRAPH_GC=1
      - VLLM_USE_FLASHINFER_SAMPLER=1
      - PYTORCH_ALLOC_CONF=expandable_segments:True
    volumes:
      - "/mnt/ssd-4tb/ai_models/models/hub:/root/.cache/huggingface/hub"
    ports:
      - "8082:8000"
    command: >
      --model cyankiwi/Qwen3.5-27B-AWQ-BF16-INT8
      --served-model-name cyankiwi/Qwen3.5-27B-AWQ-BF16-INT8
      --quantization compressed-tensors
      --port 8000
      --host 0.0.0.0
      --tensor-parallel-size 4
      -O3
      --max-model-len 262144
      --gpu-memory-utilization 0.9
      --dtype auto
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder
      --reasoning-parser qwen3
      --limit-mm-per-prompt '{"image":10,"video":2}'
      --enable-prefix-caching
      --disable-custom-all-reduce
      --kv-cache-dtype fp8
      --max-num-seqs 12
      --max-num-batched-tokens 8192
      --compilation-config '{"cudagraph_capture_sizes":[1,2,4,8,12]}'
      --trust-remote-code
      --no-use-tqdm-on-load
      --generation-config auto
      --attention-backend FLASHINFER
      --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
      --override-generation-config '{"temperature":1.0,"top_p":0.95,"top_k":20,"min_p":0.0,"presence_penalty":1.5,"repetition_penalty":1.0}'

    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 300s

Sampling is the "general thinking" preset (temperature 1.0, top_p 0.95, top_k 20, presence_penalty 1.5). The coding-thinking preset had agents looping or repeating the same action, worse on MoEs. --max-num-seqs 12 matches the cudagraph capture sizes. MTP with 2 speculative tokens is stable; 3+ starts causing random crashes.

Qwen3.5-122B-A10B (when I want raw prefill)

name: vllm-thinking

services:
  vllm:
    image: vllm/vllm-openai:v0.19.0
    restart: unless-stopped
    runtime: nvidia
    shm_size: 8gb
    ipc: host
    environment:
      - NVIDIA_VISIBLE_DEVICES=0,2,3,4
      - CUDA_DEVICE_ORDER=PCI_BUS_ID
      - RAY_memory_monitor_refresh_ms=0
      - NCCL_CUMEM_ENABLE=0
      - NCCL_NVLINK_DISABLE=0
      - VLLM_ENABLE_CUDAGRAPH_GC=1
      - VLLM_USE_FLASHINFER_SAMPLER=1
      - PYTORCH_ALLOC_CONF=expandable_segments:True
    volumes:
      - "/mnt/ssd-4tb/ai_models/models/hub:/root/.cache/huggingface/hub"
    ports:
      - "8082:8000"
    command: >
      --model QuantTrio/Qwen3.5-122B-A10B-AWQ
      --served-model-name QuantTrio/Qwen3.5-122B-A10B-AWQ
      --port 8000
      --host 0.0.0.0
      --tensor-parallel-size 4
      --enable-expert-parallel
      -O3
      --max-model-len 262144
      --gpu-memory-utilization 0.94
      --kv-cache-dtype fp8
      --dtype auto
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder
      --reasoning-parser qwen3
      --limit-mm-per-prompt '{"image":10,"video":2}'
      --enable-prefix-caching
      --disable-custom-all-reduce
      --max-num-seqs 8
      --max-num-batched-tokens 8192
      --compilation-config '{"cudagraph_capture_sizes":[1,2,4,8]}'
      --trust-remote-code
      --quantization awq_marlin
      --attention-backend FLASHINFER
      --no-use-tqdm-on-load
      --generation-config auto
      --override-generation-config '{"temperature":1.0,"top_p":0.95,"top_k":20,"min_p":0.0,"presence_penalty":1.5,"repetition_penalty":1.0}'

    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 600s

--enable-expert-parallel is the MoE-specific addition. --max-num-seqs 8 because at AWQ-INT4 weights + FP8 KV + 262k context that's the largest cudagraph batch size that fits across 4x24GB without OOM during startup. In practice per-request throughput collapses past 3-4 concurrent on long prompts anyway; 8 is for handling bursts of small tool calls.

Qwen3.6-35B-A3B (speed king, coding-tuned)

name: vllm-thinking

services:
  vllm:
    image: vllm/vllm-openai:v0.19.0
    restart: unless-stopped
    runtime: nvidia
    shm_size: 8gb
    ipc: host
    environment:
      - NVIDIA_VISIBLE_DEVICES=0,2,3,4
      - CUDA_DEVICE_ORDER=PCI_BUS_ID
      - RAY_memory_monitor_refresh_ms=0
      - NCCL_CUMEM_ENABLE=0
      - NCCL_NVLINK_DISABLE=0
      - VLLM_ENABLE_CUDAGRAPH_GC=1
      - VLLM_USE_FLASHINFER_SAMPLER=1
      - PYTORCH_ALLOC_CONF=expandable_segments:True
    volumes:
      - "/mnt/ssd-4tb/ai_models/models/hub:/root/.cache/huggingface/hub"
    ports:
      - "8082:8000"
    command: >
      --model Qwen/Qwen3.6-35B-A3B-FP8
      --served-model-name Qwen/Qwen3.6-35B-A3B-FP8
      --port 8000
      --host 0.0.0.0
      --tensor-parallel-size 4
      --enable-expert-parallel
      -O3
      --max-model-len 262144
      --gpu-memory-utilization 0.94
      --dtype auto
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder
      --reasoning-parser qwen3
      --limit-mm-per-prompt '{"image":10,"video":2}'
      --enable-prefix-caching
      --disable-custom-all-reduce
      --max-num-seqs 8
      --max-num-batched-tokens 8192
      --compilation-config '{"cudagraph_capture_sizes":[1,2,4,8]}'
      --trust-remote-code
      --no-use-tqdm-on-load
      --attention-backend FLASHINFER
      --generation-config auto
      --override-generation-config '{"temperature":1.0,"top_p":0.95,"top_k":20,"min_p":0.0,"presence_penalty":1.5,"repetition_penalty":1.0}'

    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 300s

No --kv-cache-dtype fp8 — 3.6-35B is unstable with FP8 KV, runs on default FP16 KV instead.

Takeaways

MoEs leak pretraining shell habits when the harness bans them. All three routed Qwen MoEs sat in a 10-12% tool-call error band vs 5.6% for the dense 27B; fine-tune target doesn't close it. This is the post's actual news; everything else is operational detail.
MoEs are great for throughput-bound work and coding agents whose harnesses allow the shell idioms they reach for (| head, timeout, 2>&1, &&/|| chains). If your harness denies those, you'll fight the model all day.
Per-request generation throughput drops off past 3-4 concurrent on all three. Keep concurrency low if per-agent latency matters.
250W is the sweet spot for the 27B. The 3.6-35B actually scales with power (300W gives 74% more generation than 250W). The 122B scales monotonically too (200W: 59 → 250W: 84 → 300W: 98 t/s aggregate), though per-cell variance stays wider than the 27B at any power.
Quantization matters more for MoEs. INT8 on the dense 27B is clean; AWQ-INT4 on the 122B produces garbled tool calls that never happened on the dense model.

More details

Full writeup with per-power tables, per-request throughput, tokens-per-watt, and the failure-class breakdown by model: https://dehydratedwater.dev/blog/qwen35-4x3090-optimal-agentic-inteligence
Hypothesis for why the MoE rule-following ceiling looks structural (four-Qwen analysis, confounds ruled out): https://dehydratedwater.dev/blog/moe-rule-binding-hypothesis

Curious if anyone else running MoEs against strict allow-lists has seen similar rule-following patterns — or whether my harness is just unusually strict. Also happy to answer config questions.

[-]

ai_guy_nerd@reddit

The observation that MoEs struggle with strict global rules compared to dense models is fascinating. It suggests that the routing mechanism might be bypassing some of the critical instruction-following neurons that a dense model hits every time. When running an agentic harness with a tight bash allow-list, that consistency is everything.

One way to mitigate this is to use a very small, dense guard model to validate the output of the MoE before it hits the shell. It adds a bit of latency but prevents the agent from drifting into forbidden patterns. This kind of verification layer is often necessary when the cost of a shell failure is high.

A similar approach to orchestration is used in OpenClaw to ensure tool-calls remain within safe bounds. It is interesting to see the performance gap persist even with the massive parameter count of the 122B MoE.

[-]

c	Qwen3.5-27B (MTP)	Qwen3.5-122B	Qwen3.6-35B
1	74.7% / 2.9% (n=106)	51.3% / 7.5% (n=50)	88.7% / 2.5% (n=105)
2	73.7% / 5.2% (n=86)	48.4% / 17.2% (n=44)	89.0% / 3.8% (n=57)
3	73.3% / 6.8% (n=57)	48.1% / 23.6% (n=32)	89.6% / 5.5% (n=46)
4	73.7% / 9.7% (n=41)	29.8% / 32.2% (n=24)	90.0% / 6.9% (n=35)
5	71.9% / 14.6% (n=51)	33.5% / 43.5% (n=30)	90.0% / 8.7% (n=12)
6	71.9% / 16.9% (n=25)	14.3% / 74.2% (n=24)	89.5% / 8.4% (n=6)
7	71.0% / 19.2% (n=20)	17.3% / 74.2% (n=30)	—
8	69.9% / 24.6% (n=14)	16.0% / 77.8% (n=52)	—