Qwen 3.5 35B on LocalAI (Strix Halo): Vulkan / ROCm

Posted by pipould@reddit | LocalLLaMA | View on Reddit | 2 comments

Qwen 3.5 35B on LocalAI: Vulkan vs ROCm

Hey everyone! 👋

Just finished running a bunch of benchmarks on the new Qwen 3.5 35B models using LocalAI and figured I'd share the results. I was curious how Vulkan and ROCm backends stack up against each other for these two different quant/source variants.

Two model variants, each on both Vulkan and ROCm:

Model	Type	Quant	Source
mudler/Qwen3.5-35B-A3B-APEX-GGUF:Qwen3.5-35B-A3B-APEX-I-Quality.gguf	MoE (3B active)	APEX	mudler
unsloth/Qwen3.5-35B-A3B-GGUF:Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf	MoE (3B active)	GGUF	unsloth

Tool: llama-benchy (via uvx), with prefix caching enabled, generation latency mode, adaptive prompts.

Context depths tested: 0, 4K, 8K, 16K, 32K, 65K, 100K, and up to 200K tokens.

System Environment

Lemonade Version: 10.1.0
OS: Linux-6.19.10-061910-generic (Ubuntu 25.10)
CPU: AMD RYZEN AI MAX+ 395 w/ Radeon 8060S
Shared GPU memory: 118.1 GB
TDP: 85W

vulkan : 'b8681'
rocm   : 'b1232'
cpu    : 'b8681'

The results

1. Qwen3.5-35B-A3B-APEX-I-Quality (mudler)

(See charts 1 & 2)

On token generation, Vulkan is the clear winner here, consistently outperforming ROCm. At zero context, Vulkan hits ~57.5 t/s compared to ROCm's ~50.0 t/s. As context grows to 100K, Vulkan maintains a healthy ~38.6 t/s while ROCm drops to ~35.7 t/s.

Prompt processing is where ROCm shows its strength, though Vulkan is very competitive. At 4K context, ROCm hits ~885 t/s while Vulkan is at ~759 t/s. The gap remains significant even at higher context depths.

2. Qwen3.5-35B-A3B-ThinkingCoder (unsloth)

(See charts 3 & 4)

This variant follows a very similar pattern. On token generation, Vulkan again takes the lead, starting at ~53.3 t/s (vs ROCm's ~46.6 t/s) and maintaining a lead even at 100K context.

Prompt processing is notably faster on ROCm, hitting ~1052 t/s at 2K context, whereas Vulkan is around ~798 t/s.

	Gen Speed Winner	Prompt Processing Winner
APEX-I-Quality	Vulkan	ROCm
ThinkingCoder	Vulkan	ROCm

Big picture:

🔧 Vulkan favors generation speed, ROCm favors prompt processing.
🚀 Both backends scale well into very large context windows (up to 200K).
🎯 Vulkan provides a consistent ~10-15% boost in generation throughput for these Qwen 3.5 MoE models.
🧊 Prefix caching was on for all tests, helping maintain performance at higher depths.

For day-to-day use, if you want the fastest response time per token, Vulkan is the way to go. If you are processing massive amounts of text in a single prompt, ROCm might give you the edge.

*Benchmarks done with llama-benchy.