Qwen 3.6-35B-A3B with 977 tk/s prompt processing and 262k context window on Intel Arc B70 Pro

Posted by Atomynos_Atom@reddit | LocalLLaMA | View on Reddit | 32 comments

Llama benchmark results

model	size	params	backend	ngl	threads	type_k	type_v	fa	test	t/s
qwen35moe 35B.A3B Q4_K - Medium	20.81 GiB	34.66 B	SYCL	99	1	q8_0	q8_0	1	pp512	977.40 ± 2.02
qwen35moe 35B.A3B Q4_K - Medium	20.81 GiB	34.66 B	SYCL	99	1	q8_0	q8_0	1	tg128	70.54 ± 0.12

I've chucked all my notes in an LLM and created an article if you want to recreate the same setup.

I am currently using this with oh my pi and its very usable. I was able to create a well-designed poker game without it going in a loop or hanging/crashing.

I've also tried intels vllm before but couldn't get it to this kind of performance for a single request, I see that there are some updates, so I will give that another shot when I have the time.

Would love to hear if anyone's running a similar setup with any optimizations I'm missing, or anything in there that's actually doing nothing? Always looking to squeeze out more.

Also massive thanks to the llama.cpp contributors and everyone working to make local inferencing viable. The fact that I can do this kind of inferencing locally is only possible because of the people building and maintaining this stuff.

GGML_VK_VISIBLE_DEVICES=1 llama-bench -m /home/wuff/Downloads/Qwen3.5-35B-A3B-Q4_K_M.gguf -ngl 99 -b 1024 -t 6 -fa 1 -ub 512 -ctv q8_0 -ctk q8_0 WARNING: radv is not a conformant Vulkan implementation, testing use only. ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = AMD Radeon AI PRO R9700 (RADV GFX1201) (radv) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat | model | size | params | backend | ngl | threads | n_batch | type_k | type_v | fa | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -----: | -----: | --: | --------------: | -------------------: | | qwen35moe 35B.A3B Q4_K - Medium | 20.49 GiB | 34.66 B | Vulkan | 99 | 6 | 1024 | q8_0 | q8_0 | 1 | pp512 | 3084.28 ± 23.29 | | qwen35moe 35B.A3B Q4_K - Medium | 20.49 GiB | 34.66 B | Vulkan | 99 | 6 | 1024 | q8_0 | q8_0 | 1 | tg128 | 117.02 ± 0.12 | build: 5aa3a6459 (9463)

[-]

MikePounce@reddit

Thanks for the write up, makes me consider Intel's offering in a good light. I pointed Opus 4.8 to it and asked copy/pasted "any optimizations I'm missing, or anything in there that's actually doing nothing? Always looking to squeeze out more." and this is what it had to say. Obviously, take it with a grain of salt.

Doing nothing / counterproductive:

--cache-ram 0 — this is the big one. It doesn't "pin KV to VRAM" (KV is in VRAM regardless). It sets the host-RAM budget for the prompt cache, and context checkpoints are stored in that same budget. 0 = disabled. So this flag disables the exact checkpoint mechanism the seq_rm patch produces and that the post's own logs claim to restore. Drop the flag (default 8192, or -1 for unlimited). SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 — inert. Deprecated PI name (live one is UR_L0_USE_IMMEDIATE_COMMANDLISTS), and on B-series + 2025.3 you're on the L0 v2 adapter, which only supports immediate command lists. You're forcing on something already mandatory. privileged: true — zero perf benefit, security overkill. /dev/dri + render/video group is enough. Also makes the explicit devices: mapping redundant. --min-p 0.0 / --presence-penalty 0.0 — just defaults, no-ops.

Actual gains being left on the table:

Right-size --ctx-size. 256K is almost certainly overkill; KV + checkpoints scale with it. Drop to what you use (32–64K), reclaim VRAM for bigger batches. AOT compile — cmake is JIT-only, no -DGGML_SYCL_DEVICE_ARCH for Battlemage. Setting it kills JIT warmup and can improve codegen. Bump --ubatch-size to 1024–2048. Won't change gen tok/s but speeds up prefill/TTFT. UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=1 — keep this one, it's correct and relevant (4GB single-alloc cap vs large KV buffer).

[-]

Jester14@reddit

Lmfao there's a whole section about context cache in his "article" and he has it disabled.

xlltt@reddit

R9700 is like ~15-20% more expensive at most and you dont have to do finicky stuff to make it work. No point of buying intel unless they stabilize everything

Tai9ch@reddit

The "finicky stuff" is installing the dev tools and then building llama.cpp.

Not for llama.cpp or older workflows. You are forced to use SYCL under VLLM

I've got a B70. Works fine on llama.cpp with both SYCL and Vulkan.

Fine as in same perf numbers as R9700 or even what this post is about ?

Formal-Exam-8767@reddit

I dug some 3090 results for comparison:

/r/LocalLLaMA/comments/1rqljv4/benchmarked_all_unsloth_qwen3535ba3b_q4_models_on/

esw123@reddit

40% of processing speed and 60% of generation speed for 175% price of a used 3090?

Where are you finding 3090's cheaper than B70's?

B70 are 1200-1300+ euro and used 3090 for 700-900 euro where I live. For 1500-1600 euro can get a pair of 3090.

sn2006gy@reddit

B70 are $950 usd and 3090's are 1100 usd.

Seems price is different in Europe and we include VAT. 3090 above 900 euro seat for to long on marketplace, something in 700-800 euro range are more common.

Time to collect some free arbitrage money shipping those across the pond.

They're not.

tecneeq@reddit

Can you do the same with 27b?

Maleficent-Ad5999@reddit

PP: 80tps TG: 8tps lol

totosse17@reddit

Isn't it kinda slow for how much memory bandwidth the card has?

2str8_njag@reddit

in what world does this card have a big memory bandwidth? it's gddr6

SexyAlienHotTubWater@reddit

Unbelievably low quality thing to add to the discussion, come on man. Reddit is the absolute worst for figuring out a way to misinterpret what you said and then dunk on you for it.

In our regular world. According to the spec published by Intel it has 608gb/s memory bandwidth.

Just for comparison spark with 273gbs has 93-100 tps decoding on 4 bit

Long_comment_san@reddit

Hopefully this brings R9700 price down to something a bit more sane. Geeks can play with Intel and I don't mind but I just want internet waifu and occasional SQL scripts.

unfortunately with each week they get performance gains, the intel cards just go up in price.

Intel just released a new version of its vllm that specifically was a performance fix for qwen 3.6 - give that a shot.

quantum_splicer@reddit

Tell me more only because I may invest in a new card soon

https://github.com/intel/llm-scaler

sampdoria_supporter@reddit

Great write up. Thank you!

CoolConfusion434@reddit

Ooh a B70 post 😄 Thank you!

In my case, I have been using -DGGML_SYCL_AOT=ON during cmake with the hope it would avoid JIT compilation later on.

I could be wrong but if you --cache-ram 0, then llama can't cache your prompt and restore it as you need it. Llama cache management is great to have as it skips having to reprocess what it already did.

I am definitely going to try your setup because in my case, generic Vulkan on Windows (yep) is faster than SYCL (or even Vulkan) on Linux. However, it dives off a cliff in prompt processing as the context window fills up. With 65K cx, it starts at about 125 t/s, and ends up at \~45ish t/s with full window. Prompt processing dive is even more dramatic as it starts at >1700 t/s and finishes at 70.

Thanks again!

Edenar@reddit

well i have the same number with the q8_l_xl quant on Strix halo so i think you can get more out of the Intel card since it had more compute and memory bandwidth !

Momsbestboy@reddit

IBM SYCL went on my nerves, because I need to run the source /opt/.... script every time I open console, adding an annoying 2s delay until I see the prompt.

Also t/s always was lower than the R9700. So I sold the B70, bought a R9700 and run it now with -50mV CPU voltage and power limit set to 210W. Very quiet and cool, and yes, much slower than a hyped 5090, but also only drawing around 1/3 of the power.

Quickly pulled Qwen3.5 A3B Q4_K_M unsloth from hugginface, and ran llama bench:

codeltd@reddit

I am running vllm on my DGX Spark. Tried llama.cpp but if you want more paralel request handling I think vllm better. (I could use 32 paralel reguest now) But will try these with llama.cpp to see...