Intel Arc Pro B70 llama.cpp benchmarks posted

[-]

Noble00_@reddit

As top comment posted: [https://www.youtube.com/watch?v=MnGLqo5cuGQ](https://www.youtube.com/watch?v=MnGLqo5cuGQ) Seems as though SYCL backend doesn't use XMX acceleration, so that's why we see pretty poor results compared to VLLM which is well maintained by Intel. https://preview.redd.it/0n5gmeozt25h1.png?width=1920&format=png&auto=webp&s=eacd7a2251116cdbc2570f22d20cd4bae7d2bb04

Reply

[-]

ImportancePitiful795@reddit

Use vLLM [Intel Arc Pro B70 (32GB) for Local LLMs: llama.cpp (SYCL/Vulkan), vLLM (Intel LLM Scaler) Benchmarks](https://www.youtube.com/watch?v=MnGLqo5cuGQ)

Reply

[-]

In_der_Tat@reddit

So, is it comparably worth it, or do we have to stick with team Green?

Reply

[-]

ImportancePitiful795@reddit

"Team Grean" means 300%-400% more expensive GPU for the same VRAM.

Reply

[-]

In_der_Tat@reddit

Emphasis on *worth it*, i.e. also in terms of performance or efficiency per dollar.

Reply

[-]

JustFinishedBSG@reddit

That’s great and all but compute is not the limiting factor for what we do on this sub

Reply

[-]

In_der_Tat@reddit

I see, so drawing a comparison by something like t/s/$ under a given set of requirements could be a start.

Reply

[-]

ImportancePitiful795@reddit

u/JustFinishedBSG is absolutely right. If plan to get "metrics" need to put into the cost of VRAM per GB. What's the point on having an 8GB card with gazillion TFLOPs and speed when cannot run larger model than 4B? And this is the case for the 5090 right now. $4000+ for 32GB when the others are $1000-$1300. Also need to break through the TK/s after a certain point. Having 90tk/s doesn't mean is a horrible product compared to 200tk/s. A human can go up to 30-35tks on reading speed. So even an agent can easily be extremely fast with 90tk/s. Also there is concurrency. We see even on devices like DGX Spark, with concurrency (aka having agent hooked 2-3 times to it) the TK/s perf grows exponentially not linearly as more calls are happening.

Reply

[-]

ImportancePitiful795@reddit

TFLOP mean total sht as metric when comes to LLMs. Everything is more "relative". Example : 3090 doesn't support FP8 for example and missing a lot of other goodies found on later gens. So trying to run FP8 models on it tanks and TFLOP goes out of the window. R9700 missing from the list, btw . Model depending, 5090 can be 250%-300% faster on PP and generally around 50%-55% faster on TG over the R9700s. (eg Qwen3.5-35B-A3B-UD-Q4\_K\_XL) However it costs 3 to 4 times more these days. Which means can have 2-3 R9700/B70s for the cost of a single 5090. (in USA is 4xB70s vs 1x 5090 price wise). So what's "faster"? 120GB VRAM or 32GB VRAM? The answer depends, the size of the model. However the 32GB cannot run a 70B dense or 200B+ MOE. The 128GB VRAM system can. Which is why, if looking for LLMs right now. The ONLY NVIDIA card that could be considered is the RTX6000 96GB. Ain't worth paying the exuberant prices NVIDIA asks for the whole stack bellow it as the alternatives are so cheaper than can run many times larger models for the same money. Ofc if you want to stick to 27B models sure. Go for it. Buy a 5090.

Reply

[-]

tat_tvam_asshole@reddit

just because a model doesn't fit in vram, doesn't mean it can't be run. lol for example, my legion 5i laptop with 8gb 4070m and 128gb ram can still run 70B models just fine for chatting. moreover, who is running 70B models in this day and age? qwen3.6-27b is more than enough to do meaningful work and can fit in a single 32gb card, let alone even faster qwen3.6-35B-A3

Reply

[-]

jacek2023@reddit (OP)

People on Internet always ask "is it worth to buy..." and I still don't understand what they expect

Reply

[-]

In_der_Tat@reddit

Metrics per dollar.

Reply

[-]

JockY@reddit

That B70 32GB was running Qwen3.6 35B A3B at 65 tokens/sec. My RTX 5000 PRO 48GB runs Qwen3.6-35B-A3B-FP8 at 260 tokens/sec in vLLM, 4x faster than OP's example with the B70 (although I don't know what quant they were running). Commensurately, the 5000 PRO is more 4x the price of a B70. Worth it? Up to you!

Reply

[-]

ImportancePitiful795@reddit

B70 is all over the place depending vllm or llamacpp, libraries, settings even models. on vLLM is much faster than llamacpp, with some settings/models can see way more perf than R9700 on others is less than half on vLLM also. So if is slow on one test doesn't mean is that slow as a fact.

Reply

[-]

SnooDingos8194@reddit

Should I buy another 3090? Or another b70? Why? Already have both? Just adding more to the stable.

Reply

[-]

JockY@reddit

Terrible performance, holy shit. How is Intel's stack _still_ this bad?

Reply

[-]

CoolConfusion434@reddit

Another B70 post? Yay! 😄 This is Vulkan/Windows. Starts off strong, dives off a cliff at higher context sizes: .\llama-bench.exe ` -m \Models\unsloth\Qwen3.6-35B-A3B-MTP-GGUF\Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf ` -ngl 99 ` -fa on ` -b 2048 ` -ub 512 ` -p 512 ` -n 128 ` -d 4096,8192,32768,65536 ` -r 5 ` -o md load_backend: loaded RPC backend from \llama\llama-b9413-bin-win-vulkan-x64\ggml-rpc.dll ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = Intel(R) Arc(TM) Pro B70 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat load_backend: loaded Vulkan backend from \llama\llama-b9413-bin-win-vulkan-x64\ggml-vulkan.dll load_backend: loaded CPU backend from \llama\llama-b9413-bin-win-vulkan-x64\ggml-cpu-alderlake.dll | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | qwen35moe 35B.A3B Q4_K - Medium | 21.27 GiB | 35.51 B | Vulkan | 99 | pp512 @ d4096 | 1766.38 ± 11.77 | | qwen35moe 35B.A3B Q4_K - Medium | 21.27 GiB | 35.51 B | Vulkan | 99 | tg128 @ d4096 | 98.98 ± 0.09 | | qwen35moe 35B.A3B Q4_K - Medium | 21.27 GiB | 35.51 B | Vulkan | 99 | pp512 @ d8192 | 1659.18 ± 11.02 | | qwen35moe 35B.A3B Q4_K - Medium | 21.27 GiB | 35.51 B | Vulkan | 99 | tg128 @ d8192 | 95.15 ± 0.21 | | qwen35moe 35B.A3B Q4_K - Medium | 21.27 GiB | 35.51 B | Vulkan | 99 | pp512 @ d32768 | 140.44 ± 0.40 | | qwen35moe 35B.A3B Q4_K - Medium | 21.27 GiB | 35.51 B | Vulkan | 99 | tg128 @ d32768 | 78.27 ± 0.11 | | qwen35moe 35B.A3B Q4_K - Medium | 21.27 GiB | 35.51 B | Vulkan | 99 | pp512 @ d65536 | 69.50 ± 0.25 | | qwen35moe 35B.A3B Q4_K - Medium | 21.27 GiB | 35.51 B | Vulkan | 99 | tg128 @ d65536 | 47.41 ± 0.06 | build: 6ed481eea (9413)

Reply

[-]

VanagearDevGuy@reddit

If anyone is curious, B70 also works great with custom nodes in comfyUI although you might have to install dependencies for those custom node manually, like for SAM3 workflows and GVHMR-based 3D human motion capture.

Reply

[-]

jacek2023@reddit (OP)

do you have some benchmarks for wan or ltx?

Reply

[-]

VanagearDevGuy@reddit

I'm running a test to try and get some numbers for you but out of everything, video seems to be slow. Think this is due to some of the memory overflowing to that "shared GPU memory pool"(on windows btw) so been waiting like 5 minutes for the first it/s

Reply

[-]

jacek2023@reddit (OP)

I am able to produce lowest resolution short video (like 100 frames) in about minute on 5070 and 3090, is it slower then?

Reply

[-]

VanagearDevGuy@reddit

It is. But it is much newer. Might last longer than a used 3090 in the long run. I had a 3090 since release and sold it because I'm afraid of it failing. Settled for one of these. Would suggest a AMD Pro R9700 for better drivers and software if you can afford it

Reply

[-]

RazzmatazzNo7613@reddit

I have a question , 96gb of tensor parallel rtx3090 and 96gb of Mac Studio what are the differences when running a model ? I don’t understand why you would spend 4k$ on Nvidia I have never used it please someone explain

Reply

[-]

Practical-Collar3063@reddit

The prompt processing speed on Mac is quite slow compared to rtx 3090s, so if you have large prompts you will have to wait quite a bit before it starts answering compared to 3090s

Reply

[-]

StorageHungry8380@reddit

There are two aspects to running a LLM: processing the initial input prompt (prompt processing, pp), and generating the next token of the output (token generation, tg). The latter is entirely memory bandwidth bound, and the Mac can do reasonably well with it. Not as good as a decent GPU but enough. However the former is mainly compute bound, and that's where the Mac falls short. If you have large prompts, for example asking questions about large articles, or agentic coding where the entire source code goes into the prompt, then the Mac will take ages while the GPU is likely much faster. Consider the benchmark numbers posted in this thread [here](https://www.reddit.com/r/LocalLLaMA/comments/1tuik6o/comment/op9t7gq/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button). On Qwen3.6 27B the best case prompt processing speed was 173 tokens/second. For comparison, with that model on my 5090 I get around 3400 tokens/second, so 20x faster. If you have a document with 50k tokens that's almost five minutes to wait with the B70 until it is done processing the prompt and starts generating output, while on my 5090 it's like 15 seconds or so. However once that's done, the output token generation is only 3x faster on my 5090 compared to the B70. Going off [these](https://omlx.ai/benchmarks/tdf1ifnv) numbers a Mac M5 Max will be better than the posted B70, but my 5090 is still 5.9x faster at processing prompts and 2x at token generation.

Reply

[-]

jacek2023@reddit (OP)

Mac is slow

Reply

[-]

suprjami@reddit

The price of a 3090 for one third the performance. (Intel jingle plays)

Reply

[-]

Practical-Collar3063@reddit

This is due to software optimisation, we are still very early in the "Intel GPU for LLM" saga the software will get better similar to how Rocm is much better now than it was 2 years ago. A B70 could be a better future proof buy, the software will improve, it consumes less power, has 8gb of additional VRAM, supports FP8, it is much easier to set them up in a cluster. Additionaly performance seem much better inside VLLM than what is shown here. All things cosidered, this level of performance on such a new card from a "new" GPU manufacturer is not bad at all and I think this card definitely has its place in the current GPU landscape, especially with more software optimisation.

Reply

[-]

jacek2023@reddit (OP)

I’ve been trying to buy a fourth 3090 for a long time, but prices are rising and availability is very low. At this point, I think buying four B70s would be easier than finding 3090s

Reply

[-]

suprjami@reddit

If you don't want tensor parallel[1] then your best option is a 3080 20G from China. [1] which you don't because you have three cards and vllm needs a power-of-two number of cards to do TP If you do want TP then you're already three cards in and have a system which can take 4 cards, which means an expensive motherboard and large power supply, one more card is not a big percentage of your total spend. Just do it.

Reply

[-]

jacek2023@reddit (OP)

I use -sm tensor with my 3 cards

Reply

[-]

suprjami@reddit

A fellow cultured llama.cpp gentleman. You want a 3090 then. Switching to four B70s is a huge downgrade to me. If you really don't want to pay for a 3090 then four XTXs would be a better option imo. At least ROCm mostly works and is getting better at a decent pace.

Reply

[-]

jacek2023@reddit (OP)

Why do you think I am considering four B70s?

Reply

[-]

suprjami@reddit

Because you literally just said you were two comments ago: > https://www.reddit.com/r/LocalLLaMA/comments/1tuik6o/comment/opa8zfn/ > *I think buying four B70s...*

Reply

[-]

jacek2023@reddit (OP)

You commented on my post, I shared someone’s benchmarks for people considering B70s. I replied to your comment saying that buying B70s might be easier than buying four 3090s. I already have three 3090s.

Reply

[-]

BlackBeardAI@reddit

I get 30 tps expert offloaded fully to the cpu with gtx1070/64gb ddr4... I don't think Intel is accomplishing much here. https://github.com/blackbeardlabs/blackbeard-homelab/blob/main/benchmarks/node-01-gtx1070/llmfan46-qwen36-35b-a3b-heretic-q4km-mtp-llamacpp-30k-direct-prompt01.md Got 30tps with q5 recently (50k ctx) which I'll be posting its benchmark soon.

Reply

[-]

RazzmatazzNo7613@reddit

How ? On a gtx1070 ?

Reply

[-]

BlackBeardAI@reddit

Experts are on the CPU and System ram. Check the benchmarks. I uploaded the q5 bench too. https://github.com/blackbeardlabs/blackbeard-homelab/blob/main/benchmarks/node-01-gtx1070/llmfan46-qwen36-35b-a3b-heretic-q5km-mtp1-llamacpp-50k-cpu-moe-direct-prompt01.md Yeah, it is possible.

Reply

[-]

tecneeq@reddit

I laugh in Strix Halo! https://preview.redd.it/z7do2sy5uu4h1.png?width=502&format=png&auto=webp&s=5c3cf19a7ce26f0f10fc3ef1435a948f01708c01

Reply

[-]

szansky@reddit

63 t/s on Qwen 35B sounds surprisingly solid for Arc

Reply

[-]

Atomynos_Atom@reddit

|Component|Detail| |:-|:-| |GPU|Intel Arc Pro B70| |Backend|SYCL (Level Zero)| |Build|`354ebac8c` (9468)| |model|size|params|backend|ngl|threads|type\_k|type\_v|fa|test|t/s| |:-|:-|:-|:-|:-|:-|:-|:-|:-|:-|:-| |qwen35moe 35B.A3B Q4\_K - Medium|20.81 GiB|34.66 B|SYCL|99|1|q8\_0|q8\_0|1|pp512|977.40 ± 2.02| |qwen35moe 35B.A3B Q4\_K - Medium|20.81 GiB|34.66 B|SYCL|99|1|q8\_0|q8\_0|1|tg128|70.54 ± 0.12|

Reply

[-]

fallingdowndizzyvr@reddit

To put that in perspective, even accounting for the slight smaller model, K_S instead of K_M, that's about what Strix Halo is. ggml_vulkan: 1 = AMD Radeon Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat | model | size | params | backend | ngl | fa | dev | mmap | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ------------ | ---: | --------------: | -------------------: | | qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | Vulkan | 99 | 1 | Vulkan1 | 0 | pp512 | 999.84 ± 5.64 | | qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | Vulkan | 99 | 1 | Vulkan1 | 0 | tg128 | 60.98 ± 0.04 |

Reply

[-]

Formal-Exam-8767@reddit

B70 has theoretical memory bandwidth of 608.0 GB/s and this does not even reach 150.0 GB/s if my math is correct?

Reply

[-]

ImportancePitiful795@reddit

The perf is there, but problem is the software stack is totally pants and extremely unreliable with the smallest change. 🤷‍♂️ [https://youtu.be/MnGLqo5cuGQ](https://youtu.be/MnGLqo5cuGQ)

Reply

[-]

jacek2023@reddit (OP)

I have no idea, but I see SYCL pull requests in llama.cpp, so I assume the backend is still being improved. These benchmarks at least establish a baseline. GPU works and it’s a much more affordable than 5090 (to run big models you need VRAM first and speed is often less crucial)

Reply

[-]

Formal-Exam-8767@reddit

Yes, there appears to be lots of room for improvement.

Reply

[-]

SurpriseOk6927@reddit

ngl intel might be cooking with these arc pro cards. if the SYCL perf keeps going up we could finally have a real alternative for running llama locally. competition in GPU space is long overdue

Reply

[-]

wayofTzu@reddit

./llama-bench -m /models/Qwen3.6-27B-UD-Q5_K_XL.gguf,/models/Qwen3.6-35B-A3B-UD-Q5_K_XL.gguf -d 2000,20000,50000 -ctk q8_0 -ctv q8_0 -r 3 -mg 1 -n 512 -fa 1 load_backend: loaded SYCL backend from /app/libggml-sycl.so load_backend: loaded CPU backend from /app/libggml-cpu-alderlake.so | model | size | params | backend | ngl | type_k | type_v | main_gpu | fa | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -----: | ---------: | -: | --------------: | -------------------: | | qwen35 27B Q5_K - Medium | 18.94 GiB | 27.32 B | SYCL | 99 | q8_0 | q8_0 | 1 | 1 | pp512 @ d2000 | 132.76 ± 13.24 | | qwen35 27B Q5_K - Medium | 18.94 GiB | 27.32 B | SYCL | 99 | q8_0 | q8_0 | 1 | 1 | tg512 @ d2000 | 16.94 ± 0.27 | | qwen35 27B Q5_K - Medium | 18.94 GiB | 27.32 B | SYCL | 99 | q8_0 | q8_0 | 1 | 1 | pp512 @ d20000 | 173.67 ± 1.32 | | qwen35 27B Q5_K - Medium | 18.94 GiB | 27.32 B | SYCL | 99 | q8_0 | q8_0 | 1 | 1 | tg512 @ d20000 | 13.33 ± 0.01 | | qwen35 27B Q5_K - Medium | 18.94 GiB | 27.32 B | SYCL | 99 | q8_0 | q8_0 | 1 | 1 | pp512 @ d50000 | 110.62 ± 0.33 | | qwen35 27B Q5_K - Medium | 18.94 GiB | 27.32 B | SYCL | 99 | q8_0 | q8_0 | 1 | 1 | tg512 @ d50000 | 9.02 ± 0.02 | | qwen35moe 35B.A3B Q5_K - Medium | 25.28 GiB | 35.51 B | SYCL | 99 | q8_0 | q8_0 | 1 | 1 | pp512 @ d2000 | 359.82 ± 38.79 | | qwen35moe 35B.A3B Q5_K - Medium | 25.28 GiB | 35.51 B | SYCL | 99 | q8_0 | q8_0 | 1 | 1 | tg512 @ d2000 | 45.69 ± 1.50 | | qwen35moe 35B.A3B Q5_K - Medium | 25.28 GiB | 35.51 B | SYCL | 99 | q8_0 | q8_0 | 1 | 1 | pp512 @ d20000 | 529.75 ± 10.71 | | qwen35moe 35B.A3B Q5_K - Medium | 25.28 GiB | 35.51 B | SYCL | 99 | q8_0 | q8_0 | 1 | 1 | tg512 @ d20000 | 30.17 ± 0.11 | | qwen35moe 35B.A3B Q5_K - Medium | 25.28 GiB | 35.51 B | SYCL | 99 | q8_0 | q8_0 | 1 | 1 | pp512 @ d50000 | 376.06 ± 2.43 | | qwen35moe 35B.A3B Q5_K - Medium | 25.28 GiB | 35.51 B | SYCL | 99 | q8_0 | q8_0 | 1 | 1 | tg512 @ d50000 | 18.19 ± 0.04 | build: 9777256c3 (9354) Seems llama.cpp is still missing some important SYCL implementations. Not really an expert here, but I've seen \[this suggested\](https://github.com/ggml-org/llama.cpp/blob/8f7f3bf141b03779adc8b54616fa342607357e51/ggml/src/ggml-sycl/common.hpp#L105) as an example.

Reply

Reply to Post

48 Comments