Qwen3.5-27B, Qwen3.5-122B, and Qwen3.6-35B on 4x RTX 3090 — MoEs struggle with strict global rules
Posted by DehydratedWater_@reddit | LocalLLaMA | View on Reddit | 55 comments
Long-time lurker, first-time poster. Ran three Qwen models through 20+ sessions of live agentic work each on 4x RTX 3090 — Qwen3.5-27B dense, Qwen3.5-122B-A10B MoE, Qwen3.6-35B-A3B MoE. Numbers below parsed from vLLM logs under constant organic load, not synthetic benchmarks.
Workload context that matters for every number in this post: the harness is a multi-agent orchestrator running 1-6 concurrent OpenCode sessions with 30-60k-token prompts, and it enforces a tight bash allow-list — exact uv run scripts/<name>.py patterns per tool, no shell decorators (| head, | tail, timeout, 2>&1), no absolute paths on Read, no cd && ... chains. That makes rule-following measurably different from a looser harness where those shapes go through.
All three routed MoEs are systematically worse than the dense 27B at holding those strict global rules — size, active-param count, and fine-tune target don't change it much. Speed numbers first for context, rule-following gap afterward.
Models and quants, each picked to maximise quality while fitting 262k context on 4x24GB:
- Qwen3.5-27B dense — INT8 (AWQ-BF16-INT8) weights, FP8 KV, MTP speculative decoding
- Qwen3.5-122B-A10B MoE — AWQ-INT4 weights, FP8 KV. Q4 is the only way it fits alongside 262k context
- Qwen3.6-35B-A3B MoE — FP8 weights, FP16 KV (FP8 KV was unstable on this model)
Smaller models get all the precision they can use, bigger models get only as much as fits. Tables below are at 250W (sweet spot from testing 200/250/300W). vLLM v0.19.0.
How the data is collected: vLLM emits Avg prompt throughput, Avg generation throughput, and Running: N reqs every 10s. Each cell is the mean of windows at that concurrency — n=6 ≈ 60s of wall time at that state. Idle windows count; this is sustained throughput, not peak.

Generation throughput by concurrency (250W, avg t/s)
n in parentheses is the sample count (number of 10-second windows).
| Concurrent reqs | Qwen3.5-27B (n) | Qwen3.5-122B (n) | Qwen3.6-35B (n) |
|---|---|---|---|
| 1 | 85 (8) | 74 (21) | 122 (90) |
| 2 | 97 (28) | 48 (13) | 174 (34) |
| 3 | 133 (36) | 111 (9) | 215 (16) |
| 4 | 112 (19) | 123 (9) | 288 (8) |
| 5 | 68 (34) | 138 (17) | 348 (4) |
| 6 | 98 (16) | 33 (3) | 296 (5) |
The 3.6-35B runs away with generation at every level. The 122B is uneven (c=2 dip to 48 t/s, c=6 drop to 33 at n=3) but internally coherent across c=3-5. The 27B sits between the two, and is the tightest of the three across the concurrency range — its variance per cell is the smallest, even where its average is below the 122B at c=4-5.
Prefill throughput by concurrency (250W, avg t/s)
Same n convention as the generation table above (each cell's n is the same for both tables — one window = one data point with both prefill and generation values). Prefill is averaged over all windows at that concurrency, including ones where the engine spent the window purely generating (prefill=0). That's the more honest representation of sustained prefill throughput at that concurrency state. 122B c=6 at n=3 is noise-dominated.
| Concurrent reqs | Qwen3.5-27B (n) | Qwen3.5-122B (n) | Qwen3.6-35B (n) |
|---|---|---|---|
| 1 | 926 (8) | 573 (21) | 626 (90) |
| 2 | 553 (28) | 2343 (13) | 1589 (34) |
| 3 | 364 (36) | 1849 (9) | 1799 (16) |
| 4 | 726 (19) | 2499 (9) | 1856 (8) |
| 5 | 1001 (34) | 1754 (17) | 1896 (4) |
| 6 | 1427 (16) | 2480 (3) | 2983 (5) |
Aggregate sustained averages (c=1-6, all windows at 250W): Qwen3.5-27B \~756 t/s, Qwen3.5-122B \~1651 t/s, Qwen3.6-35B \~1124 t/s. The 122B still wins prefill by roughly 2x. With prefix caching handling most of the 30-60k tokens on any given turn, the uncached tail is only a few thousand tokens per turn, so the 122B lead matters less in practice than on paper.
Prefill throughput when actively prefilling (zero-prefill windows excluded)
If you want "when the engine is actually processing a prompt, how fast does it go?" instead of the sustained average, the numbers below drop all windows where prefill=0 from each cell's average. n in parens is the count of prefill-active windows in each cell, so it varies per cell.
| Concurrent reqs | Qwen3.5-27B (n) | Qwen3.5-122B (n) | Qwen3.6-35B (n) |
|---|---|---|---|
| 1 | 1235 (6) | 669 (18) | 751 (75) |
| 2 | 860 (18) | 2769 (11) | 1743 (31) |
| 3 | 505 (26) | 2377 (7) | 1799 (16) |
| 4 | 985 (14) | 3213 (7) | 1856 (8) |
| 5 | 1260 (27) | 1987 (15) | 1896 (4) |
| 6 | 1757 (13) | 3720 (2) | 2983 (5) |
Aggregate active-only: Qwen3.5-27B \~1025 t/s, Qwen3.5-122B \~2155 t/s, Qwen3.6-35B \~1124 t/s. The sustained table above is closer to what an agent pipeline actually experiences averaged across its concurrency states; this table is closer to what vLLM can deliver when it's actually prefilling. Pick based on whether you care about "what does my agent stack do" or "what is this model capable of".
Completed requests per minute (250W)
Token rates are one thing; how many actual tasks finish per minute is another. Counted by tallying POST /v1/chat/completions HTTP/1.1" 200 log lines per 10-second window and bucketing by the concurrency at that window. Mixed-task (short and long responses both count as 1), so this is a functional-throughput metric for the workload mix, not a per-task latency.
| Concurrent reqs | Qwen3.5-27B | Qwen3.5-122B | Qwen3.6-35B |
|---|---|---|---|
| 1 | 8.2/min | 9.1/min | 14.9/min |
| 2 | 6.6/min | 9.7/min | 23.1/min |
| 3 | 6.7/min | 10.0/min | 26.6/min |
| 4 | 7.3/min | 10.0/min | 36.8/min |
| 5 | 7.8/min | 8.8/min | 27.0/min |
| 6 | 13.9/min | 12.0/min | 45.6/min |
3.6-35B finishes 2-4x more requests per minute than either sibling across most concurrency levels (the gap is smallest at c=1, biggest around c=4). The 27B holds a flat \~7/min across c=1-5 (slow-but-steady). The 122B saturates at \~9-10/min from c=2 onward — adding concurrency past 2 doesn't help it finish more work, it just spreads across more queued requests.
The rule-following gap
Oranges-to-oranges across \~20 sessions of comparable workloads (same task types, never the exact same query twice):
| Model | Sessions | Tool calls | Errors | Err/tool |
|---|---|---|---|---|
| qwen3.5-27b (dense) | 21 | 161 | 9 | 5.6% |
| qwen3.5-122b-a10b (MoE) | 17 | 128 | 13 | 10.2% |
| qwen3.6-35b-a3b (MoE) | 20 | 158 | 19 | 12.0% |
The dense 27B makes about half the tool-call errors of either MoE. I added Qwen3.5-35B-A3B as a control — same architecture as the 3.6-35B (identical 35B total / 3B active / 256 experts top-8), only the fine-tune differs. It landed at 11.3%. Three routed MoEs spanning 3B to 10B active parameters, 8M to 20M per-expert capacity, and completely different fine-tune targets — all sit in a narrow 10-12% error band. The architecture caps the rate; post-training only moves which kinds of errors happen, not how often.
How the models fail matters more than how often. On a long multi-stage research task where each stage ends with a 3-call state handshake, the 3.6-35B could not finish a single stage. It kept retrying denied bash variants (ls scripts/ | grep -E "search|web", curl -s 'https://...', invented flags like --no-agent, hallucinated scripts like youtube_fetcher.py) and burned its turn budget without emitting the state transition. The 27B later picked up the exact task instance the 3.6-35B had stalled and finished it cleanly — it pivoted to a different allowed script on the first denial.
The pattern holds across all three MoEs: retry variants of the same blocked shape (| head -5 → | head -10 → | tail -3) rather than change strategy. The dense pivots. My reading: routing loses rule specificity — each token activates a small slice, and context-specified rules compete with pretraining priors for "what bash looks like". Shell idioms have a dense prior, custom allow-lists don't, and post-training changes which idioms leak, not whether they leak.
Configs
Hardware context that explains the flags: 4x RTX 3090, two NVLinked + two PCI-only, all undervolted and pinned at 250W each. --disable-custom-all-reduce works around vLLM's topology confusion on the mixed-link setup. -O3 is worth the coldstart + extra VRAM for the throughput it buys on both prefill and generation.
Two Qwen3-specific flag notes before the configs, in case anyone copy-pastes onto a different family: --reasoning-parser qwen3 only applies to Qwen3 thinking models (will fail on non-thinking variants); the qwen3_next_mtp speculative decoding method in the 27B config is Qwen3.5-Next-specific and won't work on other model families.
Qwen3.5-27B (my daily driver)
name: vllm-thinking
services:
vllm:
image: vllm/vllm-openai:v0.19.0
restart: unless-stopped
runtime: nvidia
shm_size: 8gb
ipc: host
environment:
- NVIDIA_VISIBLE_DEVICES=0,2,3,4
- CUDA_DEVICE_ORDER=PCI_BUS_ID
- RAY_memory_monitor_refresh_ms=0
- NCCL_CUMEM_ENABLE=0
- NCCL_NVLINK_DISABLE=0
- VLLM_ENABLE_CUDAGRAPH_GC=1
- VLLM_USE_FLASHINFER_SAMPLER=1
- PYTORCH_ALLOC_CONF=expandable_segments:True
volumes:
- "/mnt/ssd-4tb/ai_models/models/hub:/root/.cache/huggingface/hub"
ports:
- "8082:8000"
command: >
--model cyankiwi/Qwen3.5-27B-AWQ-BF16-INT8
--served-model-name cyankiwi/Qwen3.5-27B-AWQ-BF16-INT8
--quantization compressed-tensors
--port 8000
--host 0.0.0.0
--tensor-parallel-size 4
-O3
--max-model-len 262144
--gpu-memory-utilization 0.9
--dtype auto
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--reasoning-parser qwen3
--limit-mm-per-prompt '{"image":10,"video":2}'
--enable-prefix-caching
--disable-custom-all-reduce
--kv-cache-dtype fp8
--max-num-seqs 12
--max-num-batched-tokens 8192
--compilation-config '{"cudagraph_capture_sizes":[1,2,4,8,12]}'
--trust-remote-code
--no-use-tqdm-on-load
--generation-config auto
--attention-backend FLASHINFER
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
--override-generation-config '{"temperature":1.0,"top_p":0.95,"top_k":20,"min_p":0.0,"presence_penalty":1.5,"repetition_penalty":1.0}'
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 300s
Sampling is the "general thinking" preset (temperature 1.0, top_p 0.95, top_k 20, presence_penalty 1.5). The coding-thinking preset had agents looping or repeating the same action, worse on MoEs. --max-num-seqs 12 matches the cudagraph capture sizes. MTP with 2 speculative tokens is stable; 3+ starts causing random crashes.
Qwen3.5-122B-A10B (when I want raw prefill)
name: vllm-thinking
services:
vllm:
image: vllm/vllm-openai:v0.19.0
restart: unless-stopped
runtime: nvidia
shm_size: 8gb
ipc: host
environment:
- NVIDIA_VISIBLE_DEVICES=0,2,3,4
- CUDA_DEVICE_ORDER=PCI_BUS_ID
- RAY_memory_monitor_refresh_ms=0
- NCCL_CUMEM_ENABLE=0
- NCCL_NVLINK_DISABLE=0
- VLLM_ENABLE_CUDAGRAPH_GC=1
- VLLM_USE_FLASHINFER_SAMPLER=1
- PYTORCH_ALLOC_CONF=expandable_segments:True
volumes:
- "/mnt/ssd-4tb/ai_models/models/hub:/root/.cache/huggingface/hub"
ports:
- "8082:8000"
command: >
--model QuantTrio/Qwen3.5-122B-A10B-AWQ
--served-model-name QuantTrio/Qwen3.5-122B-A10B-AWQ
--port 8000
--host 0.0.0.0
--tensor-parallel-size 4
--enable-expert-parallel
-O3
--max-model-len 262144
--gpu-memory-utilization 0.94
--kv-cache-dtype fp8
--dtype auto
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--reasoning-parser qwen3
--limit-mm-per-prompt '{"image":10,"video":2}'
--enable-prefix-caching
--disable-custom-all-reduce
--max-num-seqs 8
--max-num-batched-tokens 8192
--compilation-config '{"cudagraph_capture_sizes":[1,2,4,8]}'
--trust-remote-code
--quantization awq_marlin
--attention-backend FLASHINFER
--no-use-tqdm-on-load
--generation-config auto
--override-generation-config '{"temperature":1.0,"top_p":0.95,"top_k":20,"min_p":0.0,"presence_penalty":1.5,"repetition_penalty":1.0}'
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 600s
--enable-expert-parallel is the MoE-specific addition. --max-num-seqs 8 because at AWQ-INT4 weights + FP8 KV + 262k context that's the largest cudagraph batch size that fits across 4x24GB without OOM during startup. In practice per-request throughput collapses past 3-4 concurrent on long prompts anyway; 8 is for handling bursts of small tool calls.
Qwen3.6-35B-A3B (speed king, coding-tuned)
name: vllm-thinking
services:
vllm:
image: vllm/vllm-openai:v0.19.0
restart: unless-stopped
runtime: nvidia
shm_size: 8gb
ipc: host
environment:
- NVIDIA_VISIBLE_DEVICES=0,2,3,4
- CUDA_DEVICE_ORDER=PCI_BUS_ID
- RAY_memory_monitor_refresh_ms=0
- NCCL_CUMEM_ENABLE=0
- NCCL_NVLINK_DISABLE=0
- VLLM_ENABLE_CUDAGRAPH_GC=1
- VLLM_USE_FLASHINFER_SAMPLER=1
- PYTORCH_ALLOC_CONF=expandable_segments:True
volumes:
- "/mnt/ssd-4tb/ai_models/models/hub:/root/.cache/huggingface/hub"
ports:
- "8082:8000"
command: >
--model Qwen/Qwen3.6-35B-A3B-FP8
--served-model-name Qwen/Qwen3.6-35B-A3B-FP8
--port 8000
--host 0.0.0.0
--tensor-parallel-size 4
--enable-expert-parallel
-O3
--max-model-len 262144
--gpu-memory-utilization 0.94
--dtype auto
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--reasoning-parser qwen3
--limit-mm-per-prompt '{"image":10,"video":2}'
--enable-prefix-caching
--disable-custom-all-reduce
--max-num-seqs 8
--max-num-batched-tokens 8192
--compilation-config '{"cudagraph_capture_sizes":[1,2,4,8]}'
--trust-remote-code
--no-use-tqdm-on-load
--attention-backend FLASHINFER
--generation-config auto
--override-generation-config '{"temperature":1.0,"top_p":0.95,"top_k":20,"min_p":0.0,"presence_penalty":1.5,"repetition_penalty":1.0}'
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 300s
No --kv-cache-dtype fp8 — 3.6-35B is unstable with FP8 KV, runs on default FP16 KV instead.
Takeaways
- MoEs leak pretraining shell habits when the harness bans them. All three routed Qwen MoEs sat in a 10-12% tool-call error band vs 5.6% for the dense 27B; fine-tune target doesn't close it. This is the post's actual news; everything else is operational detail.
- MoEs are great for throughput-bound work and coding agents whose harnesses allow the shell idioms they reach for (
| head,timeout,2>&1,&&/||chains). If your harness denies those, you'll fight the model all day. - Per-request generation throughput drops off past 3-4 concurrent on all three. Keep concurrency low if per-agent latency matters.
- 250W is the sweet spot for the 27B. The 3.6-35B actually scales with power (300W gives 74% more generation than 250W). The 122B scales monotonically too (200W: 59 → 250W: 84 → 300W: 98 t/s aggregate), though per-cell variance stays wider than the 27B at any power.
- Quantization matters more for MoEs. INT8 on the dense 27B is clean; AWQ-INT4 on the 122B produces garbled tool calls that never happened on the dense model.
More details
- Full writeup with per-power tables, per-request throughput, tokens-per-watt, and the failure-class breakdown by model: https://dehydratedwater.dev/blog/qwen35-4x3090-optimal-agentic-inteligence
- Hypothesis for why the MoE rule-following ceiling looks structural (four-Qwen analysis, confounds ruled out): https://dehydratedwater.dev/blog/moe-rule-binding-hypothesis
Curious if anyone else running MoEs against strict allow-lists has seen similar rule-following patterns — or whether my harness is just unusually strict. Also happy to answer config questions.
ai_guy_nerd@reddit
The observation that MoEs struggle with strict global rules compared to dense models is fascinating. It suggests that the routing mechanism might be bypassing some of the critical instruction-following neurons that a dense model hits every time. When running an agentic harness with a tight bash allow-list, that consistency is everything.
One way to mitigate this is to use a very small, dense guard model to validate the output of the MoE before it hits the shell. It adds a bit of latency but prevents the agent from drifting into forbidden patterns. This kind of verification layer is often necessary when the cost of a shell failure is high.
A similar approach to orchestration is used in OpenClaw to ensure tool-calls remain within safe bounds. It is interesting to see the performance gap persist even with the massive parameter count of the 122B MoE.
DehydratedWater_@reddit (OP)
Yeah, I was also wondering if the problem can be mitigated, e.g., by wrapping Bash tool calls into native JSON-based tools. That would remove bash priors that may be overfitted for programming harnesses, forcing the model to use different paths.
Another interesting direction would be testing that against the Gemma 4 26B A4B. It has a shared expert that's always active alongside the 8 routed experts (it may work in practice kinda like the small dense model you describe, but integrated directly into the architecture), so it would be interesting to check if that makes any difference. That said, Gemma 4 also uses interleaved sliding-window and global attention, which could introduce a whole set of other uncontrolled variables.
jinnyjuice@reddit
Aren't these on by default anyway?
The documentaiton aren't very helpful explaining. What do these do for you? Why 10 and 2?
DehydratedWater_@reddit (OP)
Flashinfer should be slightly faster attention if you have Ampere under concurrent load (on Hopper it's probably slightly slower vs FA3), but the difference won't be massive either way vs FlashAttention2. The sampler can be tuned for batch sizes, and here I have small batch sizes.
Yes, the prefix cache default is true. I'm not sure what the limit-mm default is, but I usually test with and without mm support, so it's easier to have it on. (It means I can send 10 images and 2 videos with one prompt.) The --disable-custom-all-reduce is enabled by default, but then gets disabled at runtime unless you have P2P drivers. It's easier for me to see flags explicitly stated than to rely on default values.
altdotboy@reddit
You have a good test setup. I have been working on something similar for that past few weeks and will share a few lessons I learned.
Any quantization on MOE models is bad for serious production environments. Why, MOE models use a gating strategy to choose which experts to use. The gating system is very sensitive and quantization blurs the gate system by producing less confident connections with the correct expert. In short, using a quantized model will either get you the wrong experts or you’ll be focusing to heavenly on one expert. Note: Most MOE quants on huggginface are not done well. Check gate precision s and you will see.
You can’t prompt an MOE model the same way you do a dense model, in order to activate all the correct experts. Your prompt should be created in a way that it doesn’t focus on one expert only.
Using the above strategies will give you better quality results on complex tasks and less models loops and repetition
Most MOE quants are good for fun chat, email, light tasks but if you want serious production work you should be use bf16 or fp16. The most quantized model are actually broken and will not work well. I learned this the hard way.
Velocita84@reddit
I've never seen a gguf with quantized gating tensors, is AWQ that naive?
DehydratedWater_@reddit (OP)
That tracks. I briefly tried that workflow with Qwen3.5-35B in full precision, with unquantized context, but it still wasn't very reliable. It would probably require more substantial changes to the prompts and loosening the harness a bit. For the benchmark on Qwen3.6-35B, I got the official FP8 quant directly from Qwen and was running it without quantized memory. But based on my previous experience with Qwen3.5, running it in full precision wouldn't help much here.
EstarriolOfTheEast@reddit
My humble suggestion is to the experimental design is to try with more models and variations in prompt and various hyperparam settings. There are: GLM 4.5 Air, gpt-oss-120B, Stepfun flash, devstral small, gemma4's dense and MoEs that are recent and within your range for a start. You've been very thorough on what you did test but the rule you posit requires a significantly broader test set.
Also, even holding prompt fixed, consider using API to find if your setup has MoEs that consistently pass, as it's not likely to be a fact for all MoEs.
DehydratedWater_@reddit (OP)
Ah yes, I've hosted all of them except for gemma4 dense. I just found it curious how systematically the MoE versions of Qwen fail by ignoring global rules, and the failure rate stayed very similar no matter the model size or quant. Only the failure mode changed, i.e., what type of tool misuse was detected. That pointed me toward a more fundamental architectural difference in the model. I actually did a more controlled experiment comparing Qwen3.6-35B vs. Qwen3.5-35B, and the lack of adherence was statistically identical, but the distribution of errors changed. Here it is in more detail: https://dehydratedwater.dev/blog/moe-rule-binding-hypothesis/
So this was more of a controlled experiment I came upon by accident rather than a value judgment on the models themselves.
EstarriolOfTheEast@reddit
I see--it's just that your title seems to be about MoEs in general and not just Qwen models. Perhaps make the distinction clearer as "Qwen MoEs struggle with strict global rules". Have you also tried tuning the prompts and observing what difference it makes? Tool use adherence to me seems a generalization issue in Qwen models due to a lack of sufficient variation during training. Smaller MoEs especially, are more susceptible to overfitting on training regimes.
DehydratedWater_@reddit (OP)
Well, I do list only Qwen models in the title after all. But sure, it could have been more precise. This can also be expanded and verified across different variables/dimensions. Regarding the prompts, the details on how these are generated, along with tool rule examples and failure modes, are actually included in the articles. There is just a lot of it.
The scope here is narrow: only one variable changes, and it is the model. The prompt that actually works for Qwen3.5-27b (even when initially tuned for GLM-4.7) starts making a consistent number of errors on MoEs from the Qwen family, all connected to the way they use the terminal. The particular version of MoE only changes the type of error made, not the number of errors made. The model was the only changing variable, and all tested MoEs from this family, no matter the size or quant, generated a similar number of errors (which is interesting in itself). I also tested GLM-4.5-AIR, but it generated even more tool errors on the same prompt, so it was not fair to compare it directly, especially as there is no Dense variant. Only testing it on the Gemma family would be a relevant expansion.
But I plan to switch the prompt around to check if there are any repeatable patterns that could help reinforce the MoEs' rule adherence. Maybe the whole trick of duplicating the prompt would be enough to fix this. Who knows.
Spare_Newspaper_9662@reddit
Great post.
Opteron67@reddit
did read your blog, i should try that O3 stuff
DehydratedWater_@reddit (OP)
Glad to hear that. Hope the CUDA gods are more forgiving with your setup. At least my 3090 doesn't seem to like custom-reduce very much.
Opteron67@reddit
p2p enabled driver ?
DehydratedWater_@reddit (OP)
Ok, I see, i need this custom fork as the defualt driver does not support it -> https://github.com/tinygrad/open-gpu-kernel-modules
Opteron67@reddit
https://github.com/aikitoria/open-gpu-kernel-modules
DehydratedWater_@reddit (OP)
Ok, this seems very promising, but also has a non-zero chance of transforming into a quick 15-min debugging adventure that will last more than a day, and I would rather have physical access to my machine for that (it's in a different city), so I'll pin this for later, update the drivers when I'm back, and probably do another benchmark comparing the speedup.
What kind of speedup were you able to achieve with that?
Opteron67@reddit
--disable-custom-all-reduce
😭😭😭😭 noooo
DehydratedWater_@reddit (OP)
Unfortunately, it was randomly stopping and freezing while loading vLLM with --custom-all-reduce on, maybe I'll try that again on the next stable vLLM version, but it seems to not like NVlink too much.
Opteron67@reddit
VLLM_SKIP_P2P_CHECK=1 \ CUDA_VISIBLE_DEVICES=0,1 \ VLLM_NCCL_SO_PATH=/home/nicolas/nccl/build/lib/libnccl.so.2.30.3 \ NCCL_P2P_LEVEL=SYS \ vllm serve ...
in our case i would do: DP=2 TP=2 NCCL_P2P_LEVEL=NVL and enable expert parallelism
DehydratedWater_@reddit (OP)
Ok, and this is setup for Qwen3.6-35B? Or are you running it with Qwen-27b in Q4 but with limited context. Don't think 122B would fit with DP=2 unless we are talking about something stronger then 3090?
Opteron67@reddit
DP2 TP2 --enable-expert-parrallel for maximum bach throughput with 36B should be awesome
DehydratedWater_@reddit (OP)
Ok, sure, I'll test that out
DehydratedWater_@reddit (OP)
Well, I've added:
And removed:
and it won't be that easy, vLLM by itself decided custom-all-reduce is not for me:
(Worker pid=509) WARNING 04-20 22:12:26 [custom_all_reduce.py:154] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(Worker pid=489) WARNING 04-20 22:12:26 [custom_all_reduce.py:154] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(Worker pid=475) WARNING 04-20 22:12:26 [custom_all_reduce.py:154] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(Worker pid=465) WARNING 04-20 22:12:26 [custom_all_reduce.py:154] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
I'll try to pin only 2 gpus with NVLINK this time
DehydratedWater_@reddit (OP)
Ah, I've forgotten about DP=2
DehydratedWater_@reddit (OP)
vLLM still don't like me, for TP=2 and DP=2 it fails
(Worker pid=990) WARNING 04-20 22:18:32 [custom_all_reduce.py:165] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(Worker pid=989) WARNING 04-20 22:18:32 [custom_all_reduce.py:165] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
but I'll try that just for the 2 NVLinked cards
DehydratedWater_@reddit (OP)
Looked promising, and then it crashed:
(Worker_TP1_EP1 pid=466) INFO 04-20 22:27:00 [kv_cache_utils.py:829] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=8
(Worker_TP0_EP0 pid=465) INFO 04-20 22:27:00 [kv_cache_utils.py:829] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=8
(Worker_TP1_EP1 pid=466) INFO 04-20 22:27:00 [gpu_model_runner.py:5876] Profiling CUDA graph memory: PIECEWISE=4 (largest=8), FULL=4 (largest=8)
(Worker_TP0_EP0 pid=465) INFO 04-20 22:27:00 [gpu_model_runner.py:5876] Profiling CUDA graph memory: PIECEWISE=4 (largest=8), FULL=4 (largest=8)
Failed: Cuda error /workspace/csrc/custom_all_reduce.cuh:455 'invalid argument'
Failed: Cuda error /workspace/csrc/custom_all_reduce.cuh:455 'invalid argument'
(EngineCore pid=324) ERROR 04-20 22:27:03 [multiproc_executor.py:273] Worker proc VllmWorker-0 died unexpectedly, shutting down executor.
(EngineCore pid=324) ERROR 04-20 22:27:03 [core.py:1108] EngineCore failed to start.
(EngineCore pid=324) ERROR 04-20 22:27:03 [core.py:1108] Traceback (most recent call last):
(EngineCore pid=324) ERROR 04-20 22:27:03 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core
(EngineCore pid=324) ERROR 04-20 22:27:03 [core.py:1108] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=324) ERROR 04-20 22:27:03 [core.py:1108] \^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^
(EngineCore pid=324) ERROR 04-20 22:27:03 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
That's probably some quirk of rtx3090 intersecting with vLLM, rtx 5090 gets probably more love from NVidia at the moment.
Opteron67@reddit
i do fp8 on dual 5090 with p2p driver
Opteron67@reddit
nccl p2p level = SYS needs p2p enabled driver but in your case go NVL
vex_humanssucks@reddit
The MoE global rule observation matches what I see too. My theory is that the expert routing activates different specialised subnetworks per token, so instructions that need to be applied globally across a long generation get inconsistently weighted depending on which experts fire. Dense models keep the full residual stream in play throughout, so a rule stated at turn 1 stays accessible. Has your testing shown whether the failure is more about rule recall or rule application? i.e. if you probe mid-generation does the model seem to "know" the rule exists but ignore it, or does it actually drop it from context?
DehydratedWater_@reddit (OP)
Haven't tested that manually. The system is fully autonomous and messages sent by me are each consumed in a separate OpenCode session that has access to chat history (the system can basically decide what to do with my message, respond, do some tasks, search for something, etc.). It can also trigger messages on its own and has multiple background loops that may trigger interactions.
But the failures seem to be distributed throughout the sessions I've previewed, not concentrated at any particular point: they occur at the beginning, in the middle, and at the end. So the distance from the rule to the failed tool use doesn't seem to be the dominant factor, but I haven't tested that methodically.
DehydratedWater_@reddit (OP)
Some tools can return guidance on error, and OpenCode by default also returns tool permissions on errors, but this doesn't seem to pull the model back to the correct approach once it has already started looping.
DangerousString4435@reddit
Really great work! I learned a lot from your post and article.
But I'm just curious about your decision to test with just 1 harness. My learnings with local LLM's is that you must conform to the model somewhat to get good results. So if you customized the prompts and harness to each model (in a reasonable automated way, I have a prompt for each agent that takes in an input prompt and outputs a model customized prompt), would the results for the 35B model look better?
I've had pretty decent results with 35B so far, but I have put some work into the harness as well.
DehydratedWater_@reddit (OP)
Yeah, originally the harness was built for GLM-4.5-Air (and GLM-4.7, GLM-5, as I later bought Z.ai Max for like $256 for a year), but then new Qwen models dropped and I was able to cleanly switch to Qwen3.5-27B without any real issues. So the lack of adherence to the global rule is more of a happy accidental discovery that seems to be repeatable between MoE models, rather than a value judgment that MoEs aren't worth running.
The only case where raw inteligence and rule adherance seems to matter more are the workloads that are difficult by design, like RALF on the whole agent suite in my case. For the rest, I'm sure I'd be able to decompose them to work on Qwen3.6-35B, but that would probably require me to break current workloads into simpler steps and tinker with prompts until I got reliable results. At least for now, Qwen3.5-27B gives me good results without prompt/harness tweaking. But I think just allowing
| headwould fix half of the problems Qwen3.6 has. Maybe I'll test that just 4funDehydratedWater_@reddit (OP)
But this system is like the n-th iteration. One of my past systems ran on multiple intersecting loops of uncensored Llama-3.2-3B and it was surprisingly capable, but most of the agent was embedded in the harness around the model.
DangerousString4435@reddit
Thanks for all the detail! I'm also mid development on my...6th? 8th? 20th? agent harness and workflow implementation. Maybe this will finally be the one that solves all problems haha!
I'm really enjoying building a harness/workflows around the local models this time because they expose the weakness in the system very quickly. I need to really think about decomposition and even the conversation itself I can't just let the agent run like Claude. Fun stuff.
DehydratedWater_@reddit (OP)
Trueee, pushing small LLMs far beyond their reasonable abilities is a sport in itself. And harness building seems to have some overlap with 3D printing in that way. For some people the point is just to print stuff, while for others the point of 3D printing is improving the printer itself to expand what's possible to print. Local model harnesses seem to have the same quality.
DehydratedWater_@reddit (OP)
For prompt tuning I usually use Claude Code plus tests, expressed either as unit tests or as a textual list of requirements it tries to maximize each agent for. But it takes a while to optimise whole suite and run integration tests, so for most models I don't bother.
tmvr@reddit
The concurrency 1 results for Qwen3.6 35B seem very low especially the prefill, isn't there a better version you can use? The 3090 has no native FP8 support so an INT8 version would probably be faster? Even with that, the performance for running on 4 cards with tensor-parallel seems very slow. I can't replicate this because I only have 1x 4090, but based on the sizes it would do about 80 tok/s decode/tg (I guess someone with a modded 48GB one could check/confirm). As for prefill I still get over 2000 tok/s with 200K context.
DehydratedWater_@reddit (OP)
True, INT8 would be faster, but I added it to the benchmark only about 2h after it dropped, and at that time only the INT4 version was available. Even so, prefill peaks at around 4k tok/s at c=1 and around 8k tok/s at c=6, there are more detailed diagrams in the blog post. But what I'm measuring here is the actual average workload for opencode sessions, with lots of small requests coming and going. Most of the prompts hit the prefill cache, so there isn't even that much to parse. So this is more of a benchmark of how the system behaves under organic load over an extended period of time than how fast a particular request is.
Processing img u2tkkiebxdwg1...
tmvr@reddit
OK, thanks for the explanation!
Equivalent_Bit_461@reddit
Impressive
I kneel
Makers7886@reddit
Awesome work - I've been doing similar tests on a 4x3090 system with 2 nvlinked vs 8x3090s with 122b fp8 as a baseline and your numbers are very similar to what I am getting. 122b indeed sharpens up at higher quants as I did not encounter the issues you did with the 122b at int4. The differences I see in capabilities has been coming down to nuances that I need to review myself. Otherwise all 3 models at high precision have been within standard deviation in my own capabilities gap finding benches.
A pattern emerging is q3.6 35b being more diverse with the same prompt/settings with 27b being the most consistent and 122b a notch under 27b in consistency. I'm still early in testing but I'm seeing some pretty good results by leveraging 3 concurrent 35b agents to do the same research/planning/diagnosing task then feed it to 27b or 122b to judge/review/consolidate.
Also I found MTP/Dflash greatly speed up high frequency short context tasks but hurt cache hits and actually slow down/hurt performance at high context situations. I now do not run MTP nor dflash as it's not worth the increased latency from the lack of cache hits.
DehydratedWater_@reddit (OP)
Ok, I've checked the raw data from the benchmark for KV cache. All three were measured during the same "spinning up tasks" section, so the workload shouldn't explain the difference. Qwen3.5-27B sits around \~73% prefix hit rate while Qwen3.6-35B floats around \~89%. This gap maybe be explained by MTP blocks slowly mutating generated output, and when that output is reintroduced, it no longer matches/expands the prefix cache.
For Qwen3.5-122B on this setup, it looks like there simply isn't enough memory to fit all the prefixes, so more and more blocks fall into active requests KV cache as concurrency grows. So at that scale, MTP seems to hurt me less than the lack of cache memory itself. At least on this workload on 4x3090 scale.
Makers7886@reddit
Agreed on your MTP mutating output hypothesis. Also after reading the link/article you posted (awesome job) I think your harness is less prone than a typical coding harness. The gains are large/well worth it for most of my project endpoints but it's painfully obvious in something like qwen-code or even hermes agent.
Loved the article and think many could benefit from learning about "min/maxing"
DehydratedWater_@reddit (OP)
Appreciate it, glad it landed well
Medium_Chemist_4032@reddit
Is there an improvement in fp8 over int4? I'm at 4x3090 and considering going 8x. I have been using this model in OpenCode on legacy code and it's been a huge token server for me
Makers7886@reddit
Yes, I see an improvement across when going to 8bit or bf16 on the smaller models. It's hard to quantify but the benches I'm running are meant to find capabilities gaps between frontier api/397b tier and 122b/27b. Just as a rough idea 397b scores 9.6 avg and122b fp8 was around 9.3 avg with int4 122b dropping to 8.9-ish range.
Imo right now 4x3090s are the most powerful they have ever been running 27b/new 35b at full weights + vllm throughput. If the q36 122b drops and we see a fraction of the gains of what old 35b to new did then I think think going for 8x3090s will be worth it.
sleepy_quant@reddit
Running a similar multi-agent setup on M1 Max 64GB with A3B Q8, and the retry-instead-of-pivot behavior you're describing is exactly what I've been seeing too. I assumed my allow-list was just too aggressive. Good to know it might be architectural. Curious on the prefix caching — with sessions diverging per agent, are you actually getting cache hits past the static system prompt/tool list, or is that where the benefit stops?
DehydratedWater_@reddit (OP)
Unfortunately it depends. I've posted more on caching in the other comment thread, but generally for dense Qwen27B and Qwen3.6-35B caching seems to work fine. For 122B there probably just isn't enough free VRAM to keep it.
https://www.reddit.com/r/LocalLLaMA/comments/1sqspgy/comment/ohagye9/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
ShaneBowen@reddit
Newbie question, is splitting a model across 4 GPUs something exclusive to CUDA? Is there a reason someone doesn't just wire up 4x RX 580s? to get an effective 32GB Card?
DehydratedWater_@reddit (OP)
Welll, CUDA just makes it easy. Technically you can split a model however you want, between VRAM, RAM, disk, or even across different machines connected over Ethernet, but the price you pay is ease of use and speed.
For AMD cards there's a CUDA alternative called ROCm, which is basically CUDA but less reliable and not as well supported. AMD is also less willing to support old hardware with new updates, so you can get stuck on an old ROCm version and be limited to a smaller subset of models and quantizations. For the RX 580 it seems AMD has dropped new ROCm support, so you're basically limited to what works now on ROCm or the Vulkan backend. You'd also be stuck with llama.cpp or Ollama in general. Also splitting a model across GPUs is not neutral, the memory does not just sum up, so realistically some of the memory would be lost just to distribute the model across the cards.
I have a mini PC with Strix Halo, and even a brand new AMD machine is a bit annoying to set up in a way that has good throughput and doesn't crash.
But if you have time, cheap electricity, a good deal on the cards, and you like to tinker, there may be a way to run it on such a stack. But I in general wouldn't recommend that
Potential-Leg-639@reddit
Nice one, thanks! Great setup 👍🏻
Medium_Chemist_4032@reddit
Oh dang. That's a gold nugget for the nx3090 gang. Thanks!