Qwen 3.6-35B-A3B Apple Silicon benchmarks: speed parity with 3.5, NVFP4 crushes, qx86-hi Deckard tested on real code-gen

Posted by the_real_druide67@reddit | LocalLLaMA | View on Reddit | 0 comments

Qwen released Qwen3.6-35B-A3B last week (HF upload April 15, officially announced April 16, Apache 2.0, 35B total / 3B active, same qwen3_5_moe architecture as 3.5). I benchmarked 4 quantizations on a Mac Mini M4 Pro 64GB to answer: is it worth swapping from 3.5 in production?

[tok\/s across 4 Qwen 3.6 quantizations on Mac Mini M4 Pro 64GB. Measurements: asiai (https:\/\/asiai.dev, OSS Python CLI, stdlib-only core). Raw JSON on request.](

TL;DR:

Speed parity with 3.5 on the same quantization (<1% delta on standard and context-filled prompts up to 64k).
Surprise: nightmedia's qx86-hi-mlx (MLX Deckard schema) beats Ollama's mxfp8 by +10% tok/s at similar VRAM. But in my code-quality tests on LMS, qx86-hi did NOT outperform NVFP4 in practice — hardcoded preserve_thinking in its template caused empty responses. Details below.
Concrete code-quality tests (3 prompts × 10 runs each, automated AST + dynamic-import checks):
LRU cache (hygiene): 3.6 10/10 runnable, 3.5 coding 1/10 — 3.5 drops import unittest 9/10 runs (crashes at runtime with NameError).
Rate limiter (classic concurrency traps): both score 10/10 — monotonic clock, float accumulator, no-sleep-in-lock all handled.
Deep merge w/ circular references (less-standard traps): 3.6 7/10 correct vs 3.5 coding 4/10 — 3.5 also leaves 6/10 outputs with unclosed markdown code blocks, breaking standard code-extraction regex.
Multi-agent heads-up: no real parallelism on a single model instance on either engine. Plan for multiple instances.

Setup / methodology

Mac Mini M4 Pro 64GB, macOS Tahoe 26.3
Ollama 0.20.2 (MLX backend), LM Studio 0.4.9 (MLX backend)
3 runs per bench, 95% CI, single model loaded at a time (full unload + 30s cooldown between runs)
Memory pressure verified normal before each bench (compressor < 200 MB, > 35 GB free)
Default sampling (Ollama/LMS defaults, no system prompt manipulation) for speed benches. Temperature 0.2 and think: false for the code-quality test (3.6's thinking mode eats the response budget otherwise — more on that below).
Tool: asiai (OSS Python CLI, stdlib-only core)
Raw JSON exports available on request.

Results — Qwen 3.6-35B-A3B: three quantizations tested (standard 256-token generation)

All three rows below are Qwen 3.6, different quantizations. The 3.5 baseline appears separately in the next table.

Setup (all Qwen 3.6)	tok/s	TTFT	VRAM	RSS peak	Power	tok/s/W
Ollama `qwen3.6:35b-a3b-nvfp4`	68.5	0.09s	19.6 GB	20.5 GB	14.2 W	4.81
LM Studio `nightmedia/Qwen3.6-35B-A3B-qx86-hi-mlx`	55.5	0.33s	36.8 GB	38.2 GB	13.0 W	4.25
Ollama `qwen3.6:35b-a3b-mxfp8`	50.6	0.14s	34.2 GB	35.2 GB	13.9 W	3.63

All stable (CV < 0.3%), thermal nominal throughout.

Results — 3.5 vs 3.6 parity (both on Ollama MLX NVFP4)

Context	3.5 tok/s	3.6 tok/s	VRAM (both)	TTFT
Standard	68.7	68.5	19.6 GB	0.10s
16k fill	61.7	61.8	19.6 GB	4.3s
64k fill	49.5	49.4	19.6 GB	23.2s

Perfect parity (<1% delta on all metrics). VRAM stays flat at 19.6 GB even at 64k context — the DeltaNet + MoE architecture really does avoid KV cache explosion. For comparison, Gemma 4 31B (dense) hits 46 GB RSS at similar prompts on the same machine.

The interesting finding — qx86-hi vs mxfp8

Both are \~37 GB on disk, both 8-bit on MLX. I expected Ollama's officially-supported mxfp8 to beat LM Studio. Got the opposite:

LMS qx86-hi: 55.5 tok/s, 38.2 GB RSS
Ollama mxfp8: 50.6 tok/s, 35.2 GB RSS

nightmedia's Deckard(qx) schema preserves critical layers (attention/embedding) in higher precision while keeping the bulk at 8-bit. Less uniform "8-bit mass" = less compute on average. If you want 8-bit quality on M-series, skip Ollama 8-bit tags and use nightmedia/Qwen3.6-35B-A3B-qx86-hi-mlx via LM Studio or mlx-lm.

Minor note: nightmedia's README for the 3.6 qx86-hi claims it's faster than 3.6 qx64-hi (+4% tok/s, 1474 vs 1414). They're measuring prompt-processing throughput, not generation. On my gen bench vs the 3.5 qx64-hi baseline (not 3.6, since no 3.6 qx64-hi exists yet), qx86-hi is -11% tok/s for +30% boolq (their self-reported quality metric).

Bonus — Real-world code quality (3.5 vs 3.6, n=10)

One concrete coding task, 10 runs per model, same prompt (temperature 0.2, think: false, num_predict=2048):

"Write a production-ready Python class for a thread-safe LRU cache. O(1) get/put (doubly-linked list + hash map, not OrderedDict shortcut). threading.Lock. 3 unittest tests. Code only, no explanations."

Automated checks (ast.parse + string/AST patterns on the extracted code block):

Check	Qwen 3.5 `coding-nvfp4`	Qwen 3.6 `nvfp4`
Python syntax parses	10/10	10/10
`threading.Lock` used	10/10	10/10
Doubly-linked list (not just `OrderedDict`)	10/10	10/10
3 `test_*` methods	10/10	10/10
`import unittest` present (required because code uses `unittest.TestCase`)	1/10 ⚠	10/10 ✓
Actually runnable end-to-end (no missing imports)	1/10 (10%)	10/10 (100%)
Avg output lines	141	201 (+43%)
Avg output tokens	1188	1661 (+40%)
Avg wall time	18.2s	25.3s (+39%)

Key finding: on this exact prompt, 3.5 coding silently drops the import unittest line on 9/10 runs. The code parses (test class just references unittest.TestCase by name, syntactically fine), so CI with python -m py_compile would miss it. The test file then crashes at runtime with NameError: name 'unittest' is not defined. 3.6 got the import right on every single run of a 10-run set.

I also ran 3.6 with num_predict=4096 (5 additional runs) to check the earlier "EOS mid-comment" failure mode — 5/5 still 100% runnable, avg 1688 tokens actually used. The earlier truncation on an n=5 pilot was an outlier, not a systemic issue at 2048.

Trade-off: 3.6 outputs +40% more code / tokens / wall time, mostly due to docstrings (every method has Args/Returns documented) and extra helper methods (_move_to_head, _pop_tail). If you prefer terse code, this will feel verbose. If you want runnable code by default, it's an improvement.

Caveat: single prompt, single temperature, Ollama NVFP4 only. n=10 gives a strong signal (10% vs 100% executability), but one coding task doesn't generalize.

Bonus — Second code test: token bucket rate limiter with traps (n=10)

Second coding prompt, designed with 4 classic traps:

"Implement TokenBucketRateLimiter(rate, capacity) with consume(n) -> bool. Thread-safe. Continuous refill. Monotonic clock. 3 unittest tests covering burst, refill, and concurrent access. Code only."

Traps checked automatically:

Monotonic clock: must use time.monotonic(), not time.time() (which can jump on NTP sync).
Float accumulator: self.tokens must allow fractional values (otherwise refill rate < 1/sec drifts to zero).
No sleep inside lock: holding the lock during time.sleep() would starve other threads.
Thread-safe: threading.Lock / RLock wrapping state.

Results (10 runs per model, num_predict=2048, think: false):

Check	Qwen 3.5 `coding`	Qwen 3.6
Syntax parses	10/10	10/10
`consume()` returns bool	10/10	10/10
Uses `threading.Lock`	10/10	10/10
Uses `time.monotonic` (not `time.time`)	10/10	10/10
`tokens` as float accumulator	10/10	10/10
No `time.sleep` inside lock block	10/10	10/10
≥3 `test_*` methods	10/10	10/10
Runnable (`import unittest` ok)	10/10	10/10
Avg output tokens	663	1045 (+58%)
Avg wall time	10.6s	15.8s (+49%)

Result: both models ace this test — every single run avoids every trap. Qwen 3.5 coding is clearly trained on classic concurrency patterns; Qwen 3.6 base inherits the same knowledge. This is good news for infrastructure code.

Combined interpretation of the two coding tests:

On algorithmic correctness and known traps, 3.5 coding and 3.6 are equivalent.
On executability hygiene (imports, mundane stuff that CI might not catch), 3.6 is noticeably more rigorous.
3.6 always produces +40-58% more tokens for the same functional result, mostly docstrings and helper methods.

Bonus — Third code test: deep merge with circular references (n=10)

Third prompt, specifically designed around less-classic traps:

"Write deep_merge(a, b) merging two nested dicts. Dicts → recurse. Lists → concatenate. Otherwise → b wins. Handle circular references without infinite recursion. Must NOT mutate inputs. Include unittest with ≥4 test methods."

Dynamic checks (import the generated module in a subprocess with SIGALRM timeout, run 4 scenarios):

Scenario	Qwen 3.5 `coding`	Qwen 3.6
Simple nested merge works	4/10	7/10
List concatenation on conflict	4/10	6/10
Circular reference handled (no infinite recursion)	3/10	7/10
Inputs not mutated	4/10	7/10
`deep_merge` function extractable (syntax OK + function present)	4/10	8/10

Additional observation: 6/10 of Qwen 3.5's outputs have unclosed markdown code blocks (```python on line 1, never closed). 3.6 only left 1/10 unclosed. This is the third hygiene axis after import unittest on the LRU test — 3.5 `coding` has a consistent tendency toward "plan out loud, don't wrap properly". A couple of 3.5's outputs were literally just commentary thinking about the problem ("Wait, what if 'b' has a reference to 'a'... let me try another approach...") with no actual function definition, hitting the token budget mid-reasoning.

Qwen 3.6 handled the circular-reference trap correctly in 7/10 runs using seen/memo sets with id() tracking, or copy.deepcopy which has that protection built in. 3.5 only managed 3/10.

Summary of the 3 coding tests:

Algorithmic correctness on well-known traps (rate limiter): parity.
Hygiene (imports, markdown closure, producing an actual function): 3.6 clearly better.
Less-classic traps (circular refs + no-mutation invariant): 3.6 \~1.75× more reliable.
Verbosity: 3.6 always writes +40-60% more code.

Does LMS qx86-hi do better than Ollama NVFP4 on quality? (spoiler: no, and here's why)

Fair question since the whole point of the qx86-hi rec was "quality ceiling". Ran the same 3 coding prompts on qwen3.6-35b-a3b-qx86-hi-mlx via LM Studio 0.4.9, 3 runs each, max_tokens=4096, temperature=0.2:

Prompt	Ollama NVFP4 (n=10)	LMS qx86-hi (n=3)
LRU — runnable	10/10	3/3
LRU — `import unittest` present	10/10	2/3
Rate limiter — all traps avoided	10/10	1/3
Rate limiter — `time.monotonic`	10/10	3/3
Deep merge — `content` field non-empty	10/10	0/3 ⚠

The deep merge result is the key tell: all 3 runs returned empty message.content. The actual response was in message.reasoning_content — LMS splits thinking output into a separate field when a template has preserve_thinking=true hardcoded (the qx86-hi config.json has this baked in). With the deep merge prompt (the most complex one), 4096 tokens gets entirely consumed by reasoning before any code emits, and standard OpenAI-compat clients just see an empty response.

Rate limiter at qx86-hi scored 1/3 "all traps avoided" vs NVFP4's 10/10 — the failing runs forgot the float accumulator (used integer tokens, which drifts at low rates). Possibly a sampling variance issue at n=3, but even giving it the benefit of the doubt it's not better than NVFP4 here.

Also one HTTP 400 crash on LMS during these 9 runs — consistent with the concurrency instability I saw earlier.

Conclusion on the "quality ceiling" claim: on my 3 coding prompts, qx86-hi doesn't outperform NVFP4 in practice. The Deckard 8-bit schema may still pay off on benchmarks that nightmedia measures (+30% boolq, logic), but for Python code-gen under a real API, the preserve_thinking baked into qx86-hi makes it operationally awkward. Either set max_tokens aggressively (6000+), disable thinking via chat_template_kwargs (LMS 0.4.9 often returns HTTP 400 on that — didn't find a working config), or pick a quant without hardcoded thinking.

Caveat: n=3 is small, LMS-specific. Would be very interested in concurrent data from anyone running qx86-hi through mlx-lm directly (no LMS layer).

Bonus — Concurrency (1x / 2x / 4x parallel on single model instance)

Tested with a \~50-line stdlib Python script (ThreadPoolExecutor + urllib, one HTTP request per thread, same prompt, same model instance). Key params: num_predict=256, stream=false, sequential cooldown 5s between batches. Full script + the 2 code-quality harnesses (LRU, rate limiter, deep merge with subprocess + SIGALRM timeout for circular-ref safety) are \~500 LOC total, pure stdlib — happy to drop them in comments or DM.

n	Ollama NVFP4 wall_total	Ollama aggregate tok/s	LMS qx86-hi wall_total	LMS aggregate tok/s
1	3.8s	67	4.6s	55
2	7.7s (≈ 2× 1x)	66	9.7s + 1/2 failed	26 (partial)
4	15.3s (≈ 4× 1x)	67	18.7s + 1/4 failed / full batch crash	40 (partial)

Ollama: fully serializes requests. Aggregate tok/s stays at \~67 regardless of N — adding parallel clients just queues them. No errors.

LM Studio 0.4.9: tries to parallelize but fails under load — HTTP 400/500 errors at N≥2 and full crashes at N=4. When requests succeed, throughput is also serialized. Whether this is a known MLX backend limitation in LMS 0.4.9 or a queueing bug, I can't tell — would love confirmation from others.

Takeaway for multi-agent / swarm users: on a single instance of a single model, neither engine gives real parallelism on Apple Silicon. If your swarm orchestrator expects N subagents to run simultaneously, either (a) run multiple instances on different ports (N × 20 GB VRAM for NVFP4 — 64 GB Mac can only host \~3), or (b) accept queue wait times.

Bonus — Tool call stability (Merlin-style delegation, 12 runs)

4 orchestrator-style tool-use prompts, 3 runs each:

Prompt type	Ollama NVFP4	LMS qx86-hi
Simple delegation	3/3 pass	3/3 pass
Multi-step (`create_milestone` + spawn ×2)	0/3 fail (finish=length)	0/3 fail (finish=length)
Web search delegation	3/3 pass	3/3 pass
Routing trap (project-owner vs domain)	3/3 pass	3/3 pass
Total	9/12 (75%)	9/12 (75%)

Identical on both quantizations — qx86-hi gave no measurable tool-call advantage here. The multi-step prompt fails systematically: 3.6 spends its 512-token budget explaining the plan in natural language before emitting the first tool_call. Bumping max_tokens to 1024+ likely fixes it. This is new behavior vs 3.5, which called tools more eagerly.

Routing correctness: the "Leantime ticket #42: add a temp sensor to the bedroom" prompt was correctly routed to the domotic-agent (domain-based), not the Leantime-project-agent (ownership-based), 3/3 on both setups.

A note on Qwen 3.6's thinking mode

Qwen 3.6's chat template has thinking enabled by default. In practice on Ollama 0.20.2:

Thinking content comes back in a separate thinking field of the API response, not in response.
If num_predict is small (e.g. 300), the thinking consumes all the budget and response is empty.
Adding think: false to the Ollama request disables it, which I did for the code-quality test to get a fair comparison with 3.5 coding.

Thinking Preservation (the big new 3.6 feature — <think> content from turn N carried into turn N+1 context): I tried a multi-turn test but turn 2 came back empty on LMS. Needs a proper protocol. If anyone has a working multi-turn template-kwargs config for LMS 0.4.9, I'd love to see it.

For smaller Macs (16-32 GB)

I didn't benchmark 2-bit or Q3 variants on this machine, but Unsloth published GGUF quants including Q2_K_XL (\~12 GB) and Q4_K_M (\~21 GB). A 16 GB M1/M2 Air should handle Q2_K_XL, 24 GB can fit Q4_K_M. Since the architecture is MoE with 3B active params, quality degradation from aggressive quantization should be smaller than dense models at equivalent bits. Would like to see numbers from Air users.

Caveats

Quality benchmarks (boolq, SWE-bench, Terminal-Bench): not independently measured. Qwen's +3.4 to +11 point gains vs 3.5 are self-reported.
Thinking Preservation: test inconclusive (see above).
Multimodal (image / video): not tested (my bench harness is text only).
1M context via YaRN: 64k prefill already 23s, 256k+ would need hours.
LMS qx64-hi and mxfp8 were not context-stress-tested (16k / 64k only run on NVFP4).

Takeaways

Swap 3.5 → 3.6 on the same quantization is safe (performance parity confirmed on 256/16k/64k prompts).
Daily driver / speed: Ollama qwen3.6:35b-a3b-nvfp4. 20 GB RSS, 68 tok/s, TTFT 90 ms, simple.
Quality ceiling (theoretical): nightmedia/Qwen3.6-35B-A3B-qx86-hi-mlx. The 8-bit Deckard schema claims +30% boolq (nightmedia self-reported). In my tests (code-gen + tool-call on LMS 0.4.9), it did NOT outperform NVFP4 — hardcoded preserve_thinking in the template causes empty content responses on complex prompts unless you bump max_tokens to 6000+. Use at your own risk until mlx-lm native becomes the primary path.
Skip mxfp8 — no win over qx86-hi on Apple Silicon.
Multi-agent / swarm: one model instance = serialization on both engines. Plan for multiple instances.
Tool calling with complex orchestration: bump max_tokens to 1024+ — 3.6 plans aloud before calling.
Thinking mode eats response budget if enabled — pass think: false in Ollama for plain code-gen use.
Give 3.6 more tokens for code: 2048 is enough on simple prompts, but bump to 3072-4096 if you have a complex multi-step task or you're running with think: true.
3.5 coding remains competitive on well-known algorithmic traps (rate limiter: parity 10/10). On less-classic traps (deep merge with circular refs), 3.6 is 1.75× more reliable. The gain from 3.6 is on hygiene + robustness on non-standard problems, not raw algorithm knowledge on canonical ones.

Full JSON exports, raw code samples, and the concurrency/tool-call scripts: happy to share on request. Would especially love concurrent-load data from someone who can run multiple Ollama instances on a 128 GB+ Mac.