Qwen 3.5 35B on LocalAI (Strix Halo): Vulkan / ROCm
Posted by pipould@reddit | LocalLLaMA | View on Reddit | 2 comments
Qwen 3.5 35B on LocalAI: Vulkan vs ROCm
Hey everyone! 👋
Just finished running a bunch of benchmarks on the new Qwen 3.5 35B models using LocalAI and figured I'd share the results. I was curious how Vulkan and ROCm backends stack up against each other for these two different quant/source variants.
Two model variants, each on both Vulkan and ROCm:
| Model | Type | Quant | Source |
|---|---|---|---|
| mudler/Qwen3.5-35B-A3B-APEX-GGUF:Qwen3.5-35B-A3B-APEX-I-Quality.gguf | MoE (3B active) | APEX | mudler |
| unsloth/Qwen3.5-35B-A3B-GGUF:Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf | MoE (3B active) | GGUF | unsloth |
Tool: llama-benchy (via uvx), with prefix caching enabled, generation latency mode, adaptive prompts.
Context depths tested: 0, 4K, 8K, 16K, 32K, 65K, 100K, and up to 200K tokens.
System Environment
Lemonade Version: 10.1.0
OS: Linux-6.19.10-061910-generic (Ubuntu 25.10)
CPU: AMD RYZEN AI MAX+ 395 w/ Radeon 8060S
Shared GPU memory: 118.1 GB
TDP: 85W
vulkan : 'b8681'
rocm : 'b1232'
cpu : 'b8681'
The results
1. Qwen3.5-35B-A3B-APEX-I-Quality (mudler)
(See charts 1 & 2)
On token generation, Vulkan is the clear winner here, consistently outperforming ROCm. At zero context, Vulkan hits ~57.5 t/s compared to ROCm's ~50.0 t/s. As context grows to 100K, Vulkan maintains a healthy ~38.6 t/s while ROCm drops to ~35.7 t/s.
Prompt processing is where ROCm shows its strength, though Vulkan is very competitive. At 4K context, ROCm hits ~885 t/s while Vulkan is at ~759 t/s. The gap remains significant even at higher context depths.
2. Qwen3.5-35B-A3B-ThinkingCoder (unsloth)
(See charts 3 & 4)
This variant follows a very similar pattern. On token generation, Vulkan again takes the lead, starting at ~53.3 t/s (vs ROCm's ~46.6 t/s) and maintaining a lead even at 100K context.
Prompt processing is notably faster on ROCm, hitting ~1052 t/s at 2K context, whereas Vulkan is around ~798 t/s.
| Gen Speed Winner | Prompt Processing Winner | |
|---|---|---|
| APEX-I-Quality | Vulkan | ROCm |
| ThinkingCoder | Vulkan | ROCm |
Big picture:
- 🔧 Vulkan favors generation speed, ROCm favors prompt processing.
- 🚀 Both backends scale well into very large context windows (up to 200K).
- 🎯 Vulkan provides a consistent ~10-15% boost in generation throughput for these Qwen 3.5 MoE models.
- 🧊 Prefix caching was on for all tests, helping maintain performance at higher depths.
For day-to-day use, if you want the fastest response time per token, Vulkan is the way to go. If you are processing massive amounts of text in a single prompt, ROCm might give you the edge.
*Benchmarks done with llama-benchy.
crowtain@reddit
The speed per active param seems still lower than old Qwen3, are there any hope to see it improve with time?
at Q8 it's nearly as slow as Minimax Q3 K_L,
VoiceApprehensive893@reddit
rocm for me likes to gpu hang on prompt processing a lot(llama.cpp)