Intel B70: LLama.ccp SYCL vs LLama.cpp OpenVino vs LLM-Scaler
Posted by Fmstrat@reddit | LocalLLaMA | View on Reddit | 15 comments
In case anyone is interested, I decided to test out LLama.cpp's new OpenVino backend to see how it compares on Intel GPUs. At first glance, it stomps all over the previous best-case, SYCL, but lags behind LLM-Scaler (Intel's VLLM fork), likely just due to the hardware optimizations against GPTQ/Int4.
As usual with Intel, model selection is... poor. It took a while to even find a model that was in the validated OpenVino list that would not only run properly, but also have a counterpart that was "close enough" for LLM Scaler.
## Llama.cpp OpenVino
llama-benchy http://localhost:8000/v1 bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF:Q4_K_M
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:---------------------------------------------------|-------:|-----------------:|-------------:|---------------:|---------------:|----------------:|
| bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF:Q4_K_M | pp2048 | 3845.61 ± 524.73 | | 659.99 ± 56.95 | 489.07 ± 56.95 | 739.42 ± 56.84 |
| bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF:Q4_K_M | tg512 | 40.89 ± 0.55 | 44.33 ± 1.25 | | | |
## Llama.cpp SYCL
llama-benchy http://localhost:8000/v1 bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF:Q4_K_M
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:---------------------------------------------------|-------:|---------------:|-------------:|----------------:|----------------:|----------------:|
| bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF:Q4_K_M | pp2048 | 844.64 ± 19.25 | | 2199.90 ± 23.63 | 2178.96 ± 23.63 | 2229.67 ± 24.84 |
| bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF:Q4_K_M | tg512 | 73.87 ± 1.17 | 78.00 ± 2.16 | | | |
## LLM-Scaler
llama-benchy http://localhost:8000/v1 jakiAJK/DeepSeek-R1-Distill-Llama-8B_GPTQ-int4
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:--------|-------:|-----------------:|-------------:|---------------:|---------------:|----------------:|
| jakiAJK/DeepSeek-R1-Distill-Llama-8B_GPTQ-int4 | pp2048 | 7875.52 ± 642.20 | | 268.09 ± 20.50 | 240.11 ± 20.50 | 268.34 ± 20.45 |
| jakiAJK/DeepSeek-R1-Distill-Llama-8B_GPTQ-int4 | tg512 | 52.75 ± 0.10 | 54.00 ± 0.00 | | | |## Llama.cpp OpenVino
TheBlueMatt@reddit
Vulkan was already much better than SYCL. Its also going to get better, see eg https://old.reddit.com/r/LocalLLaMA/comments/1swgwvh/mesa_pr_with_37130_llamacpp_pp_perf_gain_for/
Fmstrat@reddit (OP)
Interesting, given the performance increase discussed it still seems lower than what benchmarks show, though?
https://github.com/PMZFX/intel-arc-pro-b70-benchmarks/blob/master/llm-benchmarks.md#sycl-vs-vulkan---same-hardware-same-model
TheBlueMatt@reddit
I don't know where they're getting their data. Locally, SYCL is generally faster in PP, but generally slower in tg. That was true before the mesa path, but that patch closes some of the gap for pp. eg righ tnow on a B60 I see
| qwen35 9B Q4_K - Medium | 5.55 GiB | 8.95 B | SYCL | 99 | SYCL0 | 0 | pp512 | 1620.49 ± 1.39 | | qwen35 9B Q4_K - Medium | 5.55 GiB | 8.95 B | SYCL | 99 | SYCL0 | 0 | pp2048 | 1605.91 ± 0.32 | | qwen35 9B Q4_K - Medium | 5.55 GiB | 8.95 B | SYCL | 99 | SYCL0 | 0 | tg128 | 30.73 ± 0.01 |
and
| qwen35 9B Q4_K - Medium | 5.55 GiB | 8.95 B | Vulkan | 99 | Vulkan1 | 0 | pp512 | 1191.04 ± 1.10 | | qwen35 9B Q4_K - Medium | 5.55 GiB | 8.95 B | Vulkan | 99 | Vulkan1 | 0 | pp2048 | 1189.52 ± 0.39 | | qwen35 9B Q4_K - Medium | 5.55 GiB | 8.95 B | Vulkan | 99 | Vulkan1 | 0 | tg128 | 33.26 ± 0.01 |
Rabooooo@reddit
I would be nice to see how it compares with Vulkan backend.
Also I don't understand, so only some models work with OpenVino backend?
How about if you have an intel card and use Vulkan backend, will all models work?
I've been thinking of buying the B70 cause of its low price and high vram. But got scared cause of all the threads of it working pore
Fucnk@reddit
Here is the catch 22 with this card. New transformers that support the latest models do not support all of the features of this card. SYCL does not have xmx support so your promt processes 1/3rd the speed it should but its able to generate tokens at an okay speed.
There is llm scaler that may have all of the support enabled but may not have the ability to run the latest models.
Its that donut hole problem keeps repeating itself for each back end.
This card will be as fast as a 3090 on all the models you do not want to run.
TheBlueMatt@reddit
Except on the Vulkan backend? For whatever reason people keep ignoring the vaulkan backend for Intel cards on this sub - its generally faster than SYCL and is much more actively maintained (supports the latest models at competitive speed).
Fucnk@reddit
Not my page but ive been following vulcan benchmarks. https://github.com/PMZFX/intel-arc-pro-b70-benchmarks
Rabooooo@reddit
These numbers seems a bit low, no? I get 20-25 tg/s for Qwen 3.6-35B-A3B Q4_K_XL that is only partially running on my super old GPU RTX 2080 TI and my 10 year old CPU and DDR4 system ram. Qwen3-Coder-Next I get around 15tg/s
TheBlueMatt@reddit
IME its highly model-dependent, but Vulkan often is substantially faster.
TheBlueMatt@reddit
It definitely works poorly today. If you want something that just works, its probably not an idea perf/$ tradeoff. Some of us are trying to improve it though.
bigbigmind@reddit
Try ipex-llm
Fmstrat@reddit (OP)
No longer updated, so no new model support. This is why there is a shift to OpenVino
RelicDerelict@reddit
Call me ignorant but didn't Intel just publicly abandoned OpenVino, SYCL and most of the software AI stack?
fallingdowndizzyvr@reddit
Does it? Look at the TG. It's almost half of SYCL.
tomByrer@reddit
Thanks for digging!
I looked at LLM-Scaler (Intel's VLLM fork) repo, Intel seems busy on it.