Qwen 3.6-35B-A3B Apple Silicon benchmarks: speed parity with 3.5, NVFP4 crushes, qx86-hi Deckard tested on real code-gen

Posted by the_real_druide67@reddit | LocalLLaMA | View on Reddit | 0 comments

Qwen released Qwen3.6-35B-A3B last week (HF upload April 15, officially announced April 16, Apache 2.0, 35B total / 3B active, same qwen3_5_moe architecture as 3.5). I benchmarked 4 quantizations on a Mac Mini M4 Pro 64GB to answer: is it worth swapping from 3.5 in production?

[tok\/s across 4 Qwen 3.6 quantizations on Mac Mini M4 Pro 64GB. Measurements: asiai (https:\/\/asiai.dev, OSS Python CLI, stdlib-only core). Raw JSON on request.](

TL;DR:

Setup / methodology

Results — Qwen 3.6-35B-A3B: three quantizations tested (standard 256-token generation)

All three rows below are Qwen 3.6, different quantizations. The 3.5 baseline appears separately in the next table.

Setup (all Qwen 3.6) tok/s TTFT VRAM RSS peak Power tok/s/W
Ollama qwen3.6:35b-a3b-nvfp4 68.5 0.09s 19.6 GB 20.5 GB 14.2 W 4.81
LM Studio nightmedia/Qwen3.6-35B-A3B-qx86-hi-mlx 55.5 0.33s 36.8 GB 38.2 GB 13.0 W 4.25
Ollama qwen3.6:35b-a3b-mxfp8 50.6 0.14s 34.2 GB 35.2 GB 13.9 W 3.63

All stable (CV < 0.3%), thermal nominal throughout.

Results — 3.5 vs 3.6 parity (both on Ollama MLX NVFP4)

Context 3.5 tok/s 3.6 tok/s VRAM (both) TTFT
Standard 68.7 68.5 19.6 GB 0.10s
16k fill 61.7 61.8 19.6 GB 4.3s
64k fill 49.5 49.4 19.6 GB 23.2s

Perfect parity (<1% delta on all metrics). VRAM stays flat at 19.6 GB even at 64k context — the DeltaNet + MoE architecture really does avoid KV cache explosion. For comparison, Gemma 4 31B (dense) hits 46 GB RSS at similar prompts on the same machine.

The interesting finding — qx86-hi vs mxfp8

Both are \~37 GB on disk, both 8-bit on MLX. I expected Ollama's officially-supported mxfp8 to beat LM Studio. Got the opposite:

nightmedia's Deckard(qx) schema preserves critical layers (attention/embedding) in higher precision while keeping the bulk at 8-bit. Less uniform "8-bit mass" = less compute on average. If you want 8-bit quality on M-series, skip Ollama 8-bit tags and use nightmedia/Qwen3.6-35B-A3B-qx86-hi-mlx via LM Studio or mlx-lm.

Minor note: nightmedia's README for the 3.6 qx86-hi claims it's faster than 3.6 qx64-hi (+4% tok/s, 1474 vs 1414). They're measuring prompt-processing throughput, not generation. On my gen bench vs the 3.5 qx64-hi baseline (not 3.6, since no 3.6 qx64-hi exists yet), qx86-hi is -11% tok/s for +30% boolq (their self-reported quality metric).

Bonus — Real-world code quality (3.5 vs 3.6, n=10)

One concrete coding task, 10 runs per model, same prompt (temperature 0.2, think: false, num_predict=2048):

"Write a production-ready Python class for a thread-safe LRU cache. O(1) get/put (doubly-linked list + hash map, not OrderedDict shortcut). threading.Lock. 3 unittest tests. Code only, no explanations."

Automated checks (ast.parse + string/AST patterns on the extracted code block):

Check Qwen 3.5 coding-nvfp4 Qwen 3.6 nvfp4
Python syntax parses 10/10 10/10
threading.Lock used 10/10 10/10
Doubly-linked list (not just OrderedDict) 10/10 10/10
3 test_* methods 10/10 10/10
import unittest present (required because code uses unittest.TestCase) 1/10 10/10
Actually runnable end-to-end (no missing imports) 1/10 (10%) 10/10 (100%)
Avg output lines 141 201 (+43%)
Avg output tokens 1188 1661 (+40%)
Avg wall time 18.2s 25.3s (+39%)

Key finding: on this exact prompt, 3.5 coding silently drops the import unittest line on 9/10 runs. The code parses (test class just references unittest.TestCase by name, syntactically fine), so CI with python -m py_compile would miss it. The test file then crashes at runtime with NameError: name 'unittest' is not defined. 3.6 got the import right on every single run of a 10-run set.

I also ran 3.6 with num_predict=4096 (5 additional runs) to check the earlier "EOS mid-comment" failure mode — 5/5 still 100% runnable, avg 1688 tokens actually used. The earlier truncation on an n=5 pilot was an outlier, not a systemic issue at 2048.

Trade-off: 3.6 outputs +40% more code / tokens / wall time, mostly due to docstrings (every method has Args/Returns documented) and extra helper methods (_move_to_head, _pop_tail). If you prefer terse code, this will feel verbose. If you want runnable code by default, it's an improvement.

Caveat: single prompt, single temperature, Ollama NVFP4 only. n=10 gives a strong signal (10% vs 100% executability), but one coding task doesn't generalize.

Bonus — Second code test: token bucket rate limiter with traps (n=10)

Second coding prompt, designed with 4 classic traps:

"Implement TokenBucketRateLimiter(rate, capacity) with consume(n) -> bool. Thread-safe. Continuous refill. Monotonic clock. 3 unittest tests covering burst, refill, and concurrent access. Code only."

Traps checked automatically:

Results (10 runs per model, num_predict=2048, think: false):

Check Qwen 3.5 coding Qwen 3.6
Syntax parses 10/10 10/10
consume() returns bool 10/10 10/10
Uses threading.Lock 10/10 10/10
Uses time.monotonic (not time.time) 10/10 10/10
tokens as float accumulator 10/10 10/10
No time.sleep inside lock block 10/10 10/10
≥3 test_* methods 10/10 10/10
Runnable (import unittest ok) 10/10 10/10
Avg output tokens 663 1045 (+58%)
Avg wall time 10.6s 15.8s (+49%)

Result: both models ace this test — every single run avoids every trap. Qwen 3.5 coding is clearly trained on classic concurrency patterns; Qwen 3.6 base inherits the same knowledge. This is good news for infrastructure code.

Combined interpretation of the two coding tests:

Bonus — Third code test: deep merge with circular references (n=10)

Third prompt, specifically designed around less-classic traps:

"Write deep_merge(a, b) merging two nested dicts. Dicts → recurse. Lists → concatenate. Otherwise → b wins. Handle circular references without infinite recursion. Must NOT mutate inputs. Include unittest with ≥4 test methods."

Dynamic checks (import the generated module in a subprocess with SIGALRM timeout, run 4 scenarios):

Scenario Qwen 3.5 coding Qwen 3.6
Simple nested merge works 4/10 7/10
List concatenation on conflict 4/10 6/10
Circular reference handled (no infinite recursion) 3/10 7/10
Inputs not mutated 4/10 7/10
deep_merge function extractable (syntax OK + function present) 4/10 8/10

Additional observation: 6/10 of Qwen 3.5's outputs have unclosed markdown code blocks (```python on line 1, never closed). 3.6 only left 1/10 unclosed. This is the third hygiene axis after import unittest on the LRU test — 3.5 `coding` has a consistent tendency toward "plan out loud, don't wrap properly". A couple of 3.5's outputs were literally just commentary thinking about the problem ("Wait, what if 'b' has a reference to 'a'... let me try another approach...") with no actual function definition, hitting the token budget mid-reasoning.

Qwen 3.6 handled the circular-reference trap correctly in 7/10 runs using seen/memo sets with id() tracking, or copy.deepcopy which has that protection built in. 3.5 only managed 3/10.

Summary of the 3 coding tests:

Does LMS qx86-hi do better than Ollama NVFP4 on quality? (spoiler: no, and here's why)

Fair question since the whole point of the qx86-hi rec was "quality ceiling". Ran the same 3 coding prompts on qwen3.6-35b-a3b-qx86-hi-mlx via LM Studio 0.4.9, 3 runs each, max_tokens=4096, temperature=0.2:

Prompt Ollama NVFP4 (n=10) LMS qx86-hi (n=3)
LRU — runnable 10/10 3/3
LRU — import unittest present 10/10 2/3
Rate limiter — all traps avoided 10/10 1/3
Rate limiter — time.monotonic 10/10 3/3
Deep merge — content field non-empty 10/10 0/3

The deep merge result is the key tell: all 3 runs returned empty message.content. The actual response was in message.reasoning_content — LMS splits thinking output into a separate field when a template has preserve_thinking=true hardcoded (the qx86-hi config.json has this baked in). With the deep merge prompt (the most complex one), 4096 tokens gets entirely consumed by reasoning before any code emits, and standard OpenAI-compat clients just see an empty response.

Rate limiter at qx86-hi scored 1/3 "all traps avoided" vs NVFP4's 10/10 — the failing runs forgot the float accumulator (used integer tokens, which drifts at low rates). Possibly a sampling variance issue at n=3, but even giving it the benefit of the doubt it's not better than NVFP4 here.

Also one HTTP 400 crash on LMS during these 9 runs — consistent with the concurrency instability I saw earlier.

Conclusion on the "quality ceiling" claim: on my 3 coding prompts, qx86-hi doesn't outperform NVFP4 in practice. The Deckard 8-bit schema may still pay off on benchmarks that nightmedia measures (+30% boolq, logic), but for Python code-gen under a real API, the preserve_thinking baked into qx86-hi makes it operationally awkward. Either set max_tokens aggressively (6000+), disable thinking via chat_template_kwargs (LMS 0.4.9 often returns HTTP 400 on that — didn't find a working config), or pick a quant without hardcoded thinking.

Caveat: n=3 is small, LMS-specific. Would be very interested in concurrent data from anyone running qx86-hi through mlx-lm directly (no LMS layer).

Bonus — Concurrency (1x / 2x / 4x parallel on single model instance)

Tested with a \~50-line stdlib Python script (ThreadPoolExecutor + urllib, one HTTP request per thread, same prompt, same model instance). Key params: num_predict=256, stream=false, sequential cooldown 5s between batches. Full script + the 2 code-quality harnesses (LRU, rate limiter, deep merge with subprocess + SIGALRM timeout for circular-ref safety) are \~500 LOC total, pure stdlib — happy to drop them in comments or DM.

n Ollama NVFP4 wall_total Ollama aggregate tok/s LMS qx86-hi wall_total LMS aggregate tok/s
1 3.8s 67 4.6s 55
2 7.7s (≈ 2× 1x) 66 9.7s + 1/2 failed 26 (partial)
4 15.3s (≈ 4× 1x) 67 18.7s + 1/4 failed / full batch crash 40 (partial)

Ollama: fully serializes requests. Aggregate tok/s stays at \~67 regardless of N — adding parallel clients just queues them. No errors.

LM Studio 0.4.9: tries to parallelize but fails under load — HTTP 400/500 errors at N≥2 and full crashes at N=4. When requests succeed, throughput is also serialized. Whether this is a known MLX backend limitation in LMS 0.4.9 or a queueing bug, I can't tell — would love confirmation from others.

Takeaway for multi-agent / swarm users: on a single instance of a single model, neither engine gives real parallelism on Apple Silicon. If your swarm orchestrator expects N subagents to run simultaneously, either (a) run multiple instances on different ports (N × 20 GB VRAM for NVFP4 — 64 GB Mac can only host \~3), or (b) accept queue wait times.

Bonus — Tool call stability (Merlin-style delegation, 12 runs)

4 orchestrator-style tool-use prompts, 3 runs each:

Prompt type Ollama NVFP4 LMS qx86-hi
Simple delegation 3/3 pass 3/3 pass
Multi-step (create_milestone + spawn ×2) 0/3 fail (finish=length) 0/3 fail (finish=length)
Web search delegation 3/3 pass 3/3 pass
Routing trap (project-owner vs domain) 3/3 pass 3/3 pass
Total 9/12 (75%) 9/12 (75%)

Identical on both quantizations — qx86-hi gave no measurable tool-call advantage here. The multi-step prompt fails systematically: 3.6 spends its 512-token budget explaining the plan in natural language before emitting the first tool_call. Bumping max_tokens to 1024+ likely fixes it. This is new behavior vs 3.5, which called tools more eagerly.

Routing correctness: the "Leantime ticket #42: add a temp sensor to the bedroom" prompt was correctly routed to the domotic-agent (domain-based), not the Leantime-project-agent (ownership-based), 3/3 on both setups.

A note on Qwen 3.6's thinking mode

Qwen 3.6's chat template has thinking enabled by default. In practice on Ollama 0.20.2:

Thinking Preservation (the big new 3.6 feature — <think> content from turn N carried into turn N+1 context): I tried a multi-turn test but turn 2 came back empty on LMS. Needs a proper protocol. If anyone has a working multi-turn template-kwargs config for LMS 0.4.9, I'd love to see it.

For smaller Macs (16-32 GB)

I didn't benchmark 2-bit or Q3 variants on this machine, but Unsloth published GGUF quants including Q2_K_XL (\~12 GB) and Q4_K_M (\~21 GB). A 16 GB M1/M2 Air should handle Q2_K_XL, 24 GB can fit Q4_K_M. Since the architecture is MoE with 3B active params, quality degradation from aggressive quantization should be smaller than dense models at equivalent bits. Would like to see numbers from Air users.

Caveats

Takeaways

  1. Swap 3.5 → 3.6 on the same quantization is safe (performance parity confirmed on 256/16k/64k prompts).
  2. Daily driver / speed: Ollama qwen3.6:35b-a3b-nvfp4. 20 GB RSS, 68 tok/s, TTFT 90 ms, simple.
  3. Quality ceiling (theoretical): nightmedia/Qwen3.6-35B-A3B-qx86-hi-mlx. The 8-bit Deckard schema claims +30% boolq (nightmedia self-reported). In my tests (code-gen + tool-call on LMS 0.4.9), it did NOT outperform NVFP4 — hardcoded preserve_thinking in the template causes empty content responses on complex prompts unless you bump max_tokens to 6000+. Use at your own risk until mlx-lm native becomes the primary path.
  4. Skip mxfp8 — no win over qx86-hi on Apple Silicon.
  5. Multi-agent / swarm: one model instance = serialization on both engines. Plan for multiple instances.
  6. Tool calling with complex orchestration: bump max_tokens to 1024+ — 3.6 plans aloud before calling.
  7. Thinking mode eats response budget if enabled — pass think: false in Ollama for plain code-gen use.
  8. Give 3.6 more tokens for code: 2048 is enough on simple prompts, but bump to 3072-4096 if you have a complex multi-step task or you're running with think: true.
  9. 3.5 coding remains competitive on well-known algorithmic traps (rate limiter: parity 10/10). On less-classic traps (deep merge with circular refs), 3.6 is 1.75× more reliable. The gain from 3.6 is on hygiene + robustness on non-standard problems, not raw algorithm knowledge on canonical ones.

Full JSON exports, raw code samples, and the concurrency/tool-call scripts: happy to share on request. Would especially love concurrent-load data from someone who can run multiple Ollama instances on a 128 GB+ Mac.