Intel B70: LLama.ccp SYCL vs LLama.cpp OpenVino vs LLM-Scaler

Posted by Fmstrat@reddit | LocalLLaMA | View on Reddit | 15 comments

In case anyone is interested, I decided to test out LLama.cpp's new OpenVino backend to see how it compares on Intel GPUs. At first glance, it stomps all over the previous best-case, SYCL, but lags behind LLM-Scaler (Intel's VLLM fork), likely just due to the hardware optimizations against GPTQ/Int4.

As usual with Intel, model selection is... poor. It took a while to even find a model that was in the validated OpenVino list that would not only run properly, but also have a counterpart that was "close enough" for LLM Scaler.

## Llama.cpp OpenVino
llama-benchy http://localhost:8000/v1 bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF:Q4_K_M


| model                                              |   test |              t/s |     peak t/s |      ttfr (ms) |   est_ppt (ms) |   e2e_ttft (ms) |
|:---------------------------------------------------|-------:|-----------------:|-------------:|---------------:|---------------:|----------------:|
| bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF:Q4_K_M | pp2048 | 3845.61 ± 524.73 |              | 659.99 ± 56.95 | 489.07 ± 56.95 |  739.42 ± 56.84 |
| bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF:Q4_K_M |  tg512 |     40.89 ± 0.55 | 44.33 ± 1.25 |                |                |                 |


## Llama.cpp SYCL
llama-benchy http://localhost:8000/v1 bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF:Q4_K_M


| model                                              |   test |            t/s |     peak t/s |       ttfr (ms) |    est_ppt (ms) |   e2e_ttft (ms) |
|:---------------------------------------------------|-------:|---------------:|-------------:|----------------:|----------------:|----------------:|
| bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF:Q4_K_M | pp2048 | 844.64 ± 19.25 |              | 2199.90 ± 23.63 | 2178.96 ± 23.63 | 2229.67 ± 24.84 |
| bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF:Q4_K_M |  tg512 |   73.87 ± 1.17 | 78.00 ± 2.16 |                 |                 |                 |


## LLM-Scaler
llama-benchy http://localhost:8000/v1 jakiAJK/DeepSeek-R1-Distill-Llama-8B_GPTQ-int4


| model   |   test |              t/s |     peak t/s |      ttfr (ms) |   est_ppt (ms) |   e2e_ttft (ms) |
|:--------|-------:|-----------------:|-------------:|---------------:|---------------:|----------------:|
| jakiAJK/DeepSeek-R1-Distill-Llama-8B_GPTQ-int4    | pp2048 | 7875.52 ± 642.20 |              | 268.09 ± 20.50 | 240.11 ± 20.50 |  268.34 ± 20.45 |
| jakiAJK/DeepSeek-R1-Distill-Llama-8B_GPTQ-int4    |  tg512 |     52.75 ± 0.10 | 54.00 ± 0.00 |                |                |                 |## Llama.cpp OpenVino

[-]

TheBlueMatt@reddit

Vulkan was already much better than SYCL. Its also going to get better, see eg https://old.reddit.com/r/LocalLLaMA/comments/1swgwvh/mesa_pr_with_37130_llamacpp_pp_perf_gain_for/

[-]

Fmstrat@reddit (OP)

Interesting, given the performance increase discussed it still seems lower than what benchmarks show, though?

https://github.com/PMZFX/intel-arc-pro-b70-benchmarks/blob/master/llm-benchmarks.md#sycl-vs-vulkan---same-hardware-same-model

[-]

TheBlueMatt@reddit

I don't know where they're getting their data. Locally, SYCL is generally faster in PP, but generally slower in tg. That was true before the mesa path, but that patch closes some of the gap for pp. eg righ tnow on a B60 I see

| qwen35 9B Q4_K - Medium | 5.55 GiB | 8.95 B | SYCL | 99 | SYCL0 | 0 | pp512 | 1620.49 ± 1.39 | | qwen35 9B Q4_K - Medium | 5.55 GiB | 8.95 B | SYCL | 99 | SYCL0 | 0 | pp2048 | 1605.91 ± 0.32 | | qwen35 9B Q4_K - Medium | 5.55 GiB | 8.95 B | SYCL | 99 | SYCL0 | 0 | tg128 | 30.73 ± 0.01 |

and

| qwen35 9B Q4_K - Medium | 5.55 GiB | 8.95 B | Vulkan | 99 | Vulkan1 | 0 | pp512 | 1191.04 ± 1.10 | | qwen35 9B Q4_K - Medium | 5.55 GiB | 8.95 B | Vulkan | 99 | Vulkan1 | 0 | pp2048 | 1189.52 ± 0.39 | | qwen35 9B Q4_K - Medium | 5.55 GiB | 8.95 B | Vulkan | 99 | Vulkan1 | 0 | tg128 | 33.26 ± 0.01 |

[-]

Rabooooo@reddit

I would be nice to see how it compares with Vulkan backend.
Also I don't understand, so only some models work with OpenVino backend?
How about if you have an intel card and use Vulkan backend, will all models work?
I've been thinking of buying the B70 cause of its low price and high vram. But got scared cause of all the threads of it working pore

[-]

Fucnk@reddit

Here is the catch 22 with this card. New transformers that support the latest models do not support all of the features of this card. SYCL does not have xmx support so your promt processes 1/3rd the speed it should but its able to generate tokens at an okay speed.

There is llm scaler that may have all of the support enabled but may not have the ability to run the latest models.

Its that donut hole problem keeps repeating itself for each back end.

This card will be as fast as a 3090 on all the models you do not want to run.

[-]

Fmstrat@reddit (OP)

No longer updated, so no new model support. This is why there is a shift to OpenVino

[-]

RelicDerelict@reddit

Call me ignorant but didn't Intel just publicly abandoned OpenVino, SYCL and most of the software AI stack?

[-]

fallingdowndizzyvr@reddit

At first glance, it stomps all over the previous best-case, SYCL,

Does it? Look at the TG. It's almost half of SYCL.

[-]

tomByrer@reddit

Thanks for digging!
I looked at LLM-Scaler (Intel's VLLM fork) repo, Intel seems busy on it.