Nvidia RTX 3090 vs Intel Arc Pro B70 llama.cpp Benchmarks

Posted by tovidagaming@reddit | LocalLLaMA | View on Reddit | 36 comments

Just sharing the results from experimenting with the B70 on my setup....

These results compare three llama.cpp execution paths on the same machine:

RTX 3090 (Vulkan) on NixOS host, using main llama.cpp repo (compiled on 4/21/2026)
Arc Pro B70 (Vulkan) on NixOS host, using main llama.cpp repo (compiled on 4/21/2026)
Arc Pro B70 (SYCL) inside an Ubuntu 24.04 Docker container, using a separate SYCL-enabled llama-bench build from the aicss-genai/llama.cpp fork

Prompt processing (pp512)

model	RTX 3090 (Vulkan)	Arc Pro B70 (Vulkan)	Arc Pro B70 (SYCL)	B70 best vs 3090	B70 SYCL vs B70 Vulkan
TheBloke/Llama-2-7B-GGUF:Q4_K_M	4550.27 ± 10.90	1236.65 ± 3.19	1178.54 ± 5.74	-72.8%	-4.7%
unsloth/gemma-4-E2B-it-GGUF:Q4_K_XL	9359.15 ± 168.11	2302.80 ± 5.26	3462.19 ± 36.07	-63.0%	+50.3%
unsloth/gemma-4-26B-A4B-it-GGUF:Q4_K_M	3902.28 ± 21.37	1126.28 ± 6.17	945.89 ± 17.53	-71.1%	-16.0%
unsloth/gemma-4-31B-it-GGUF:Q4_K_XL	991.47 ± 1.73	295.66 ± 0.60	268.50 ± 0.65	-70.2%	-9.2%
ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF:Q8_0	4740.04 ± 13.78	1176.34 ± 1.68	1192.99 ± 5.75	-74.8%	+1.4%
ggml-org/Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF:Q8_0	oom	990.32 ± 5.34	552.37 ± 5.76	∞	-44.2%
Qwen/Qwen3-8B-GGUF:Q8_0	4195.89 ± 41.31	1048.39 ± 2.66	1098.90 ± 1.02	-73.8%	+4.8%
unsloth/Qwen3.5-4B-GGUF:Q4_K_XL	5233.55 ± 8.29	1430.72 ± 9.68	1767.21 ± 21.27	-66.2%	+23.5%
unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_M	3357.03 ± 18.47	886.39 ± 6.14	445.56 ± 7.46	-73.6%	-49.7%
unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_M	3417.76 ± 17.84	878.15 ± 5.32	442.01 ± 6.51	-74.3%	-49.7%
Average (excluding oom)				-71.1%

Token generation (tg128)

model	RTX 3090 (Vulkan)	Arc Pro B70 (Vulkan)	Arc Pro B70 (SYCL)	B70 best vs 3090	B70 SYCL vs B70 Vulkan
TheBloke/Llama-2-7B-GGUF:Q4_K_M	137.92 ± 0.41	58.61 ± 0.09	92.39 ± 0.30	-33.0%	+57.6%
unsloth/gemma-4-E2B-it-GGUF:Q4_K_XL	207.21 ± 2.00	89.33 ± 0.60	70.65 ± 0.84	-56.9%	-20.9%
unsloth/gemma-4-26B-A4B-it-GGUF:Q4_K_M	131.33 ± 0.14	42.00 ± 0.01	37.75 ± 0.32	-68.0%	-10.1%
unsloth/gemma-4-31B-it-GGUF:Q4_K_XL	31.49 ± 0.05	14.49 ± 0.04	18.30 ± 0.05	-41.9%	+26.3%
ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF:Q8_0	98.96 ± 0.56	21.30 ± 0.03	55.37 ± 0.02	-44.1%	+160.0%
ggml-org/Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF:Q8_0	oom	37.69 ± 0.03	28.58 ± 0.09	∞	-24.2%
Qwen/Qwen3-8B-GGUF:Q8_0	92.29 ± 0.17	19.78 ± 0.01	50.74 ± 0.02	-45.0%	+156.5%
unsloth/Qwen3.5-4B-GGUF:Q4_K_XL	162.58 ± 0.76	60.45 ± 0.06	79.09 ± 0.05	-51.4%	+30.8%
unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_M	148.01 ± 0.38	43.30 ± 0.05	37.93 ± 0.89	-70.7%	-12.4%
unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_M	148.64 ± 0.53	43.46 ± 0.02	36.87 ± 0.42	-70.8%	-15.2%
Average (excluding oom)				-53.5%

Commands used

Host Vulkan runs

For each model, the host benchmark commands were:

llama-bench -hf <MODEL> -dev Vulkan0
llama-bench -hf <MODEL> -dev Vulkan2

Where:

Vulkan0 = RTX 3090
Vulkan2 = Arc Pro B70

Container SYCL runs

For each model, the SYCL benchmark was run inside the Docker container with:

./build/bin/llama-bench -hf <MODEL> -dev SYCL0

Where:

SYCL0 = Arc Pro B70

Test machine

CPU: AMD Ryzen Threadripper 2970WX 24-Core Processor
24 cores / 48 threads
1 socket
2.2 GHz min / 3.0 GHz max
RAM: 128 GiB total
GPUs:
NVIDIA GeForce RTX 3090, 24 GiB
NVIDIA GeForce RTX 3090, 24 GiB
Intel Arc Pro B70, 32 GiB

[-]

ziphnor@reddit

Thank you for posting some actual numbers that can be used for comparison. I just tried running a similar one for my 2x RTX 5060 TI 16gb (standard +3000MHz mem OC applied and tested with cuda_memtest).

On the ggml-org/Qwen3-Coder-30B-A3B-Instruct-Q8_0 I am not sure if its "cheating" to add -fitt 512? But considering i bought my the to 5060's almost new at approx the same price as a used RTX 3090 that are pretty hard to find in my region (might still buy some), I am not too unhappy. I am however happy that i didn't go with a B70 Pro, i guess software might mature, but a single one of those would have cost more.

Test Machine: - CPU: Intel Core 2 Ultra 235 - RAM: 64GB (DDR5 6400) - llama.cpp build: cff8b0dbda (8861), CUDA 13.1.1, Blackwell arch 12.0

Prompt Processing (pp512)

Model	2x RTX 5060 Ti (CUDA)	RTX 3090 (Vulkan)	Arc Pro B70 (Vulkan)	Arc Pro B70 (SYCL)
unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_M	2484.92 ± 12.51	3417.76 ± 17.84	878.15 ± 5.32	442.01 ± 6.51
ggml-org/Qwen3-Coder-30B-A3B-Instruct-Q8_0	2210.51 ± 19.30	OOM	990.32 ± 5.34	552.37 ± 5.76

Token Generation (tg128)

Model	2x RTX 5060 Ti (CUDA)	RTX 3090 (Vulkan)	Arc Pro B70 (Vulkan)	Arc Pro B70 (SYCL)
unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_M	109.49 ± 2.44	148.64 ± 0.53	43.46 ± 0.02	36.87 ± 0.42
ggml-org/Qwen3-Coder-30B-A3B-Instruct-Q8_0	89.61 ± 0.50	OOM	37.69 ± 0.03	28.58 ± 0.09

Commands Used

Qwen3.6-35B-A3B Q4_K_M (20.60 GiB - fits in 32GB VRAM)

docker run --rm --gpus all --entrypoint /app/llama-bench ik-llama.cpp:latest \ -hf unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_M

Qwen3-Coder-30B-A3B-Instruct Q8_0 (30.25 GiB - requires fit-target to squeeze into 32GB VRAM)

docker run --rm --gpus all --entrypoint /app/llama-bench ik-llama.cpp:latest \ -hf ggml-org/Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF:Q8_0 -fitt 512

[-]

tovidagaming@reddit (OP)

Yeah, that makes sense. Though at that point, I probably would have just gotten another used 3090 from eBay for about the same amount and hoped my luck would strike a third time (I have had no issue with the first two I bought used last year). The R9700 seems like a good, slightly more expensive option. Basically, the same memory bandwidth as the B70, but the ROCm support seems a bit more mature than Intel's.

[-]

ziphnor@reddit

There is no arguing the RTX 3090 is awesome, but the RTX 5060 TI's are easily available and easy to bargain for on the used market. Reasonably priced RTX 3090 are hard to find though. I am in Denmark and got the dual 5060's for \~860€. I have only seen one 3090 available for that price here. The ebay ones are more like \~1000€ and higher with shipping and potential import taxes.

Then there is the higher power consumption (power in denmark is expensive) and the 2x 5060 providing +8 GB VRAM

[-]

Serious_Rub_3674@reddit

Can you try running a sanity test using llama-server or cli and check the actual tokens being generated by the aics branch? I tried building their fork and while the benchmark numbers were great, the actual tokens were unusable. Just gibberish.

[-]

tovidagaming@reddit (OP)

Good catch. I hadn't gotten around to using it yet. I tested a few of the models with SYCL, focusing on the ones that were way faster using SYCL vs Vulkan.

TheBloke/Llama-2-7B-GGUF:Q4_K_M - is completely broken.

ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF:Q8_0 - sometimes works just fine, sometimes gets completely lost and goes in loops. It seems something related to the termination of the responses is failing. It can answer technical questions just fine most of the time, but a simple "Hi" breaks it :D!

The rest, including Qwen/Qwen3-8B-GGUF:Q8_0 seem to be working fine. All the reasoning models seem fine too.

[-]

TheBlueMatt@reddit

sometimes works just fine, sometimes gets completely lost and goes in loops.

This implies there's some synchronization missing. That doesn't mean that other models are actually fine, only that they happen to be running fast/slow enough that the missing sync isn't breaking them. That also probably means that once the missing sync is added all the models will slow down, even the ones that happened to be working :(

[-]

tovidagaming@reddit (OP)

I see... Well, that's a bummer. Isn't synchronization for MOE models more complicated? I would expect at least one of the MOE models to visibly break too in that case. Or I guess it depends on exactly what synchronization is missing...

[-]

Queasy-Contract9753@reddit

Thanks for those detailed numbers! I don't see much information about Intel cards. Suppose this means buying an old a770 is a bad idea.

What's your ram comditike btw? How many sticks do you have?

[-]

tovidagaming@reddit (OP)

I have 8 sticks of 16 GB each. I mixed two kits of 64gb cause it was what I had. All 8 slots on the X399 DESIGNARE EX motherboard are now populated.

[-]

fallingdowndizzyvr@reddit

Suppose this means buying an old a770 is a bad idea.

The numbers are very interesting; the 3090 is still a beast in terms of pure speed (especially in prompt processing and CUDA maturity). But what fascinates me about the B70 is the context of its 32GB of VRAM versus 24GB. The ability to run models that the 3090 simply can't seems to me to be the best point to consider. That said, the performance in SYCL vs. Vulkan is very uneven; in some cases, SYCL is much faster (+160% in a generation with Qwen2.5-Coder), and in others, it's slower. I understand that Intel is working on several fronts (vllm, NEO, PyTorch, etc.) to compete with its hardware. For now, it's something that depends on the context, but we understand that Vulkan remains "plug and play," although the OpenVino and SYCL backends continue to evolve.

If you have the time and inclination to run more tests, I'm curious about some models that would help provide an even more complete picture (just a friendly suggestion, no pressure):

unsloth/Mistral-Small-3.2-24B-Instruct-2506-UD-Q4_K_XL.gguf = The Mistral architecture itself seems interesting to me for comparing the two GPUs.

unsloth/GLM-4.7-Flash-Q4_K_M.gguf = A key reasoning model. After seeing improvements of up to +160% in SYCL with other models, I'm intrigued to see how Intel handles this architecture compared to CUDA/Vulkan.

unsloth/gpt-oss-20b-Q6_K.gguf = A very efficient MoE that's been around for a while.

unsloth/Qwen3.6-27B-UD-Q4_K_XL.gguf = This is the dense model and a midpoint between Qwen3-3.5 and the 35B MoEs you tested. Since SYCL seems to win in the dense models but loses in the MoE, it's very intriguing.

unsloth/Llama-3.1-8B-Instruct-UD-Q8_K_XL.gguf = After the Llama-2 results, I want to see if the 2026 optimizations in SYCL/Vulk have closed the gap in the architecture.

Anyway, thank you very much for this incredible info :D

[-]

tovidagaming@reddit (OP)

Let me see if I can run these soon... Note that Llama 2 on that SYCL built is broken, as u/Serious_Rub_3674 pointed out. Qwen2.5-Coder-7B is a bit dazed and confused, too.

[-]

RemarkableGuidance44@reddit

Intel Drivers are still new and they are updating them weekly. I got 4 x B70's they are great for larger models a bit slower of course but software is still new. Intel are also now going for the AI Datacenters, so expect better performance down the track.

I have the best of both worlds, dual 5090's and 4 x b70's :D 5090's eat so much power while the b70s just munch bit by bit and keep cool. :)

[-]

tovidagaming@reddit (OP)

How are your speeds for 1 vs 2/3/4 B70 GPUs for the same model? I only have one B70 currently, so I can't test it, but on Vulkan, things slow down a lot if I try to mix the B70 with the 3090s.

[-]

TheBlueMatt@reddit

Tensor parallelism in llama.cpp is still brand new, and vulkan hasn't landed the backend implementations we need for it to be efficient. For more than 2 GPUs, it probably also makes sense to eventually do PCIe-P2P, which would probably require https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40798 as well. There's just a lot to do to optimize these things...

[-]

fallingdowndizzyvr@reddit

Intel Drivers are still new and they are updating them weekly.

No. They are not new. They've been working on them for a couple of years. For a solid year on the battlemage specifically. The B70 is not that different from the B580. It just has more of what the B580 has. I'm still waiting for the A770 to meet what it's paper specs promise.

[-]

RemarkableGuidance44@reddit

Intel has stated that they are working on the drivers even more than ever. For a $1000 card they are amazing, my 2x5090s were $4000 each, I got 4 b70's that can run large models 24/7 for $4000.

[-]

TheBlueMatt@reddit

I don't believe LLMs are a priority for mesa (the open-source drivers the Vulkan backend uses on Linux). They've mostly focused on gaming uses and a lot of the work has historically been done by Valve. There's a lot of low-hanging fruit if you are willing to really dive in, eg https://gitlab.freedesktop.org/mesa/mesa/-/work_items/15311

[-]

AbsoluteHedonn@reddit

NixOS mentioned

[-]

tovidagaming@reddit (OP)

Yea, I use NixOS btw

[-]