Qwen 3.6-35B-A3B with 977 tk/s prompt processing and 262k context window on Intel Arc B70 Pro
Posted by Atomynos_Atom@reddit | LocalLLaMA | View on Reddit | 32 comments
Llama benchmark results
| model | size | params | backend | ngl | threads | type_k | type_v | fa | test | t/s |
|---|---|---|---|---|---|---|---|---|---|---|
| qwen35moe 35B.A3B Q4_K - Medium | 20.81 GiB | 34.66 B | SYCL | 99 | 1 | q8_0 | q8_0 | 1 | pp512 | 977.40 ± 2.02 |
| qwen35moe 35B.A3B Q4_K - Medium | 20.81 GiB | 34.66 B | SYCL | 99 | 1 | q8_0 | q8_0 | 1 | tg128 | 70.54 ± 0.12 |
I've chucked all my notes in an LLM and created an article if you want to recreate the same setup.
I am currently using this with oh my pi and its very usable. I was able to create a well-designed poker game without it going in a loop or hanging/crashing.
I've also tried intels vllm before but couldn't get it to this kind of performance for a single request, I see that there are some updates, so I will give that another shot when I have the time.
Would love to hear if anyone's running a similar setup with any optimizations I'm missing, or anything in there that's actually doing nothing? Always looking to squeeze out more.
Also massive thanks to the llama.cpp contributors and everyone working to make local inferencing viable. The fact that I can do this kind of inferencing locally is only possible because of the people building and maintaining this stuff.
xlltt@reddit
R9700 is like ~15-20% more expensive at most and you dont have to do finicky stuff to make it work. No point of buying intel unless they stabilize everything
Tai9ch@reddit
The "finicky stuff" is installing the dev tools and then building llama.cpp.
xlltt@reddit
Not for llama.cpp or older workflows. You are forced to use SYCL under VLLM
Tai9ch@reddit
I've got a B70. Works fine on llama.cpp with both SYCL and Vulkan.
xlltt@reddit
Fine as in same perf numbers as R9700 or even what this post is about ?
Formal-Exam-8767@reddit
I dug some 3090 results for comparison:
/r/LocalLLaMA/comments/1rqljv4/benchmarked_all_unsloth_qwen3535ba3b_q4_models_on/
esw123@reddit
40% of processing speed and 60% of generation speed for 175% price of a used 3090?
Tai9ch@reddit
Where are you finding 3090's cheaper than B70's?
esw123@reddit
B70 are 1200-1300+ euro and used 3090 for 700-900 euro where I live. For 1500-1600 euro can get a pair of 3090.
sn2006gy@reddit
B70 are $950 usd and 3090's are 1100 usd.
esw123@reddit
Seems price is different in Europe and we include VAT. 3090 above 900 euro seat for to long on marketplace, something in 700-800 euro range are more common.
Tai9ch@reddit
Time to collect some free arbitrage money shipping those across the pond.
sn2006gy@reddit
They're not.
tecneeq@reddit
Can you do the same with 27b?
Maleficent-Ad5999@reddit
PP: 80tps TG: 8tps lol
totosse17@reddit
Isn't it kinda slow for how much memory bandwidth the card has?
2str8_njag@reddit
in what world does this card have a big memory bandwidth? it's gddr6
SexyAlienHotTubWater@reddit
Unbelievably low quality thing to add to the discussion, come on man. Reddit is the absolute worst for figuring out a way to misinterpret what you said and then dunk on you for it.
totosse17@reddit
In our regular world. According to the spec published by Intel it has 608gb/s memory bandwidth.
totosse17@reddit
Just for comparison spark with 273gbs has 93-100 tps decoding on 4 bit
Long_comment_san@reddit
Hopefully this brings R9700 price down to something a bit more sane. Geeks can play with Intel and I don't mind but I just want internet waifu and occasional SQL scripts.
sn2006gy@reddit
unfortunately with each week they get performance gains, the intel cards just go up in price.
sn2006gy@reddit
Intel just released a new version of its vllm that specifically was a performance fix for qwen 3.6 - give that a shot.
quantum_splicer@reddit
Tell me more only because I may invest in a new card soon
sn2006gy@reddit
https://github.com/intel/llm-scaler
sampdoria_supporter@reddit
Great write up. Thank you!
CoolConfusion434@reddit
Ooh a B70 post 😄 Thank you!
In my case, I have been using -DGGML_SYCL_AOT=ON during cmake with the hope it would avoid JIT compilation later on.
I could be wrong but if you --cache-ram 0, then llama can't cache your prompt and restore it as you need it. Llama cache management is great to have as it skips having to reprocess what it already did.
I am definitely going to try your setup because in my case, generic Vulkan on Windows (yep) is faster than SYCL (or even Vulkan) on Linux. However, it dives off a cliff in prompt processing as the context window fills up. With 65K cx, it starts at about 125 t/s, and ends up at \~45ish t/s with full window. Prompt processing dive is even more dramatic as it starts at >1700 t/s and finishes at 70.
Thanks again!
Edenar@reddit
well i have the same number with the q8_l_xl quant on Strix halo so i think you can get more out of the Intel card since it had more compute and memory bandwidth !
Momsbestboy@reddit
IBM SYCL went on my nerves, because I need to run the source /opt/.... script every time I open console, adding an annoying 2s delay until I see the prompt.
Also t/s always was lower than the R9700. So I sold the B70, bought a R9700 and run it now with -50mV CPU voltage and power limit set to 210W. Very quiet and cool, and yes, much slower than a hyped 5090, but also only drawing around 1/3 of the power.
Quickly pulled Qwen3.5 A3B Q4_K_M unsloth from hugginface, and ran llama bench:
MikePounce@reddit
Thanks for the write up, makes me consider Intel's offering in a good light. I pointed Opus 4.8 to it and asked copy/pasted "any optimizations I'm missing, or anything in there that's actually doing nothing? Always looking to squeeze out more." and this is what it had to say. Obviously, take it with a grain of salt.
Doing nothing / counterproductive:
--cache-ram 0 — this is the big one. It doesn't "pin KV to VRAM" (KV is in VRAM regardless). It sets the host-RAM budget for the prompt cache, and context checkpoints are stored in that same budget. 0 = disabled. So this flag disables the exact checkpoint mechanism the seq_rm patch produces and that the post's own logs claim to restore. Drop the flag (default 8192, or -1 for unlimited). SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 — inert. Deprecated PI name (live one is UR_L0_USE_IMMEDIATE_COMMANDLISTS), and on B-series + 2025.3 you're on the L0 v2 adapter, which only supports immediate command lists. You're forcing on something already mandatory. privileged: true — zero perf benefit, security overkill. /dev/dri + render/video group is enough. Also makes the explicit devices: mapping redundant. --min-p 0.0 / --presence-penalty 0.0 — just defaults, no-ops.
Actual gains being left on the table:
Right-size --ctx-size. 256K is almost certainly overkill; KV + checkpoints scale with it. Drop to what you use (32–64K), reclaim VRAM for bigger batches. AOT compile — cmake is JIT-only, no -DGGML_SYCL_DEVICE_ARCH for Battlemage. Setting it kills JIT warmup and can improve codegen. Bump --ubatch-size to 1024–2048. Won't change gen tok/s but speeds up prefill/TTFT. UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=1 — keep this one, it's correct and relevant (4GB single-alloc cap vs large KV buffer).
Jester14@reddit
Lmfao there's a whole section about context cache in his "article" and he has it disabled.
codeltd@reddit
I am running vllm on my DGX Spark. Tried llama.cpp but if you want more paralel request handling I think vllm better. (I could use 32 paralel reguest now) But will try these with llama.cpp to see...