Intel B70: LLama.ccp SYCL vs LLama.cpp OpenVino vs LLM-Scaler

Posted by Fmstrat@reddit | LocalLLaMA | View on Reddit | 15 comments

In case anyone is interested, I decided to test out LLama.cpp's new OpenVino backend to see how it compares on Intel GPUs. At first glance, it stomps all over the previous best-case, SYCL, but lags behind LLM-Scaler (Intel's VLLM fork), likely just due to the hardware optimizations against GPTQ/Int4.

As usual with Intel, model selection is... poor. It took a while to even find a model that was in the validated OpenVino list that would not only run properly, but also have a counterpart that was "close enough" for LLM Scaler.

## Llama.cpp OpenVino
llama-benchy http://localhost:8000/v1 bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF:Q4_K_M


| model                                              |   test |              t/s |     peak t/s |      ttfr (ms) |   est_ppt (ms) |   e2e_ttft (ms) |
|:---------------------------------------------------|-------:|-----------------:|-------------:|---------------:|---------------:|----------------:|
| bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF:Q4_K_M | pp2048 | 3845.61 ± 524.73 |              | 659.99 ± 56.95 | 489.07 ± 56.95 |  739.42 ± 56.84 |
| bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF:Q4_K_M |  tg512 |     40.89 ± 0.55 | 44.33 ± 1.25 |                |                |                 |


## Llama.cpp SYCL
llama-benchy http://localhost:8000/v1 bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF:Q4_K_M


| model                                              |   test |            t/s |     peak t/s |       ttfr (ms) |    est_ppt (ms) |   e2e_ttft (ms) |
|:---------------------------------------------------|-------:|---------------:|-------------:|----------------:|----------------:|----------------:|
| bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF:Q4_K_M | pp2048 | 844.64 ± 19.25 |              | 2199.90 ± 23.63 | 2178.96 ± 23.63 | 2229.67 ± 24.84 |
| bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF:Q4_K_M |  tg512 |   73.87 ± 1.17 | 78.00 ± 2.16 |                 |                 |                 |


## LLM-Scaler
llama-benchy http://localhost:8000/v1 jakiAJK/DeepSeek-R1-Distill-Llama-8B_GPTQ-int4


| model   |   test |              t/s |     peak t/s |      ttfr (ms) |   est_ppt (ms) |   e2e_ttft (ms) |
|:--------|-------:|-----------------:|-------------:|---------------:|---------------:|----------------:|
| jakiAJK/DeepSeek-R1-Distill-Llama-8B_GPTQ-int4    | pp2048 | 7875.52 ± 642.20 |              | 268.09 ± 20.50 | 240.11 ± 20.50 |  268.34 ± 20.45 |
| jakiAJK/DeepSeek-R1-Distill-Llama-8B_GPTQ-int4    |  tg512 |     52.75 ± 0.10 | 54.00 ± 0.00 |                |                |                 |## Llama.cpp OpenVino