Nvidia RTX 3090 vs Intel Arc Pro B70 llama.cpp Benchmarks
Posted by tovidagaming@reddit | LocalLLaMA | View on Reddit | 36 comments
Just sharing the results from experimenting with the B70 on my setup....
These results compare three llama.cpp execution paths on the same machine:
- RTX 3090 (Vulkan) on NixOS host, using main llama.cpp repo (compiled on 4/21/2026)
- Arc Pro B70 (Vulkan) on NixOS host, using main llama.cpp repo (compiled on 4/21/2026)
- Arc Pro B70 (SYCL) inside an Ubuntu 24.04 Docker container, using a separate SYCL-enabled
llama-benchbuild from theaicss-genai/llama.cppfork
Prompt processing (pp512)
| model | RTX 3090 (Vulkan) | Arc Pro B70 (Vulkan) | Arc Pro B70 (SYCL) | B70 best vs 3090 | B70 SYCL vs B70 Vulkan |
|---|---|---|---|---|---|
| TheBloke/Llama-2-7B-GGUF:Q4_K_M | 4550.27 ± 10.90 | 1236.65 ± 3.19 | 1178.54 ± 5.74 | -72.8% | -4.7% |
| unsloth/gemma-4-E2B-it-GGUF:Q4_K_XL | 9359.15 ± 168.11 | 2302.80 ± 5.26 | 3462.19 ± 36.07 | -63.0% | +50.3% |
| unsloth/gemma-4-26B-A4B-it-GGUF:Q4_K_M | 3902.28 ± 21.37 | 1126.28 ± 6.17 | 945.89 ± 17.53 | -71.1% | -16.0% |
| unsloth/gemma-4-31B-it-GGUF:Q4_K_XL | 991.47 ± 1.73 | 295.66 ± 0.60 | 268.50 ± 0.65 | -70.2% | -9.2% |
| ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF:Q8_0 | 4740.04 ± 13.78 | 1176.34 ± 1.68 | 1192.99 ± 5.75 | -74.8% | +1.4% |
| ggml-org/Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF:Q8_0 | oom | 990.32 ± 5.34 | 552.37 ± 5.76 | ∞ | -44.2% |
| Qwen/Qwen3-8B-GGUF:Q8_0 | 4195.89 ± 41.31 | 1048.39 ± 2.66 | 1098.90 ± 1.02 | -73.8% | +4.8% |
| unsloth/Qwen3.5-4B-GGUF:Q4_K_XL | 5233.55 ± 8.29 | 1430.72 ± 9.68 | 1767.21 ± 21.27 | -66.2% | +23.5% |
| unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_M | 3357.03 ± 18.47 | 886.39 ± 6.14 | 445.56 ± 7.46 | -73.6% | -49.7% |
| unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_M | 3417.76 ± 17.84 | 878.15 ± 5.32 | 442.01 ± 6.51 | -74.3% | -49.7% |
| Average (excluding oom) | -71.1% |
Token generation (tg128)
| model | RTX 3090 (Vulkan) | Arc Pro B70 (Vulkan) | Arc Pro B70 (SYCL) | B70 best vs 3090 | B70 SYCL vs B70 Vulkan |
|---|---|---|---|---|---|
| TheBloke/Llama-2-7B-GGUF:Q4_K_M | 137.92 ± 0.41 | 58.61 ± 0.09 | 92.39 ± 0.30 | -33.0% | +57.6% |
| unsloth/gemma-4-E2B-it-GGUF:Q4_K_XL | 207.21 ± 2.00 | 89.33 ± 0.60 | 70.65 ± 0.84 | -56.9% | -20.9% |
| unsloth/gemma-4-26B-A4B-it-GGUF:Q4_K_M | 131.33 ± 0.14 | 42.00 ± 0.01 | 37.75 ± 0.32 | -68.0% | -10.1% |
| unsloth/gemma-4-31B-it-GGUF:Q4_K_XL | 31.49 ± 0.05 | 14.49 ± 0.04 | 18.30 ± 0.05 | -41.9% | +26.3% |
| ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF:Q8_0 | 98.96 ± 0.56 | 21.30 ± 0.03 | 55.37 ± 0.02 | -44.1% | +160.0% |
| ggml-org/Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF:Q8_0 | oom | 37.69 ± 0.03 | 28.58 ± 0.09 | ∞ | -24.2% |
| Qwen/Qwen3-8B-GGUF:Q8_0 | 92.29 ± 0.17 | 19.78 ± 0.01 | 50.74 ± 0.02 | -45.0% | +156.5% |
| unsloth/Qwen3.5-4B-GGUF:Q4_K_XL | 162.58 ± 0.76 | 60.45 ± 0.06 | 79.09 ± 0.05 | -51.4% | +30.8% |
| unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_M | 148.01 ± 0.38 | 43.30 ± 0.05 | 37.93 ± 0.89 | -70.7% | -12.4% |
| unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_M | 148.64 ± 0.53 | 43.46 ± 0.02 | 36.87 ± 0.42 | -70.8% | -15.2% |
| Average (excluding oom) | -53.5% |
Commands used
Host Vulkan runs
For each model, the host benchmark commands were:
llama-bench -hf <MODEL> -dev Vulkan0
llama-bench -hf <MODEL> -dev Vulkan2
Where:
Vulkan0= RTX 3090Vulkan2= Arc Pro B70
Container SYCL runs
For each model, the SYCL benchmark was run inside the Docker container with:
./build/bin/llama-bench -hf <MODEL> -dev SYCL0
Where:
SYCL0= Arc Pro B70
Test machine
- CPU: AMD Ryzen Threadripper 2970WX 24-Core Processor
- 24 cores / 48 threads
- 1 socket
- 2.2 GHz min / 3.0 GHz max
- RAM: 128 GiB total
- GPUs:
- NVIDIA GeForce RTX 3090, 24 GiB
- NVIDIA GeForce RTX 3090, 24 GiB
- Intel Arc Pro B70, 32 GiB
ziphnor@reddit
Thank you for posting some actual numbers that can be used for comparison. I just tried running a similar one for my 2x RTX 5060 TI 16gb (standard +3000MHz mem OC applied and tested with cuda_memtest).
On the ggml-org/Qwen3-Coder-30B-A3B-Instruct-Q8_0 I am not sure if its "cheating" to add -fitt 512? But considering i bought my the to 5060's almost new at approx the same price as a used RTX 3090 that are pretty hard to find in my region (might still buy some), I am not too unhappy. I am however happy that i didn't go with a B70 Pro, i guess software might mature, but a single one of those would have cost more.
Test Machine: - CPU: Intel Core 2 Ultra 235 - RAM: 64GB (DDR5 6400) - llama.cpp build: cff8b0dbda (8861), CUDA 13.1.1, Blackwell arch 12.0
Prompt Processing (pp512)
Token Generation (tg128)
Commands Used
Qwen3.6-35B-A3B Q4_K_M (20.60 GiB - fits in 32GB VRAM)
docker run --rm --gpus all --entrypoint /app/llama-bench ik-llama.cpp:latest \ -hf unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_M
Qwen3-Coder-30B-A3B-Instruct Q8_0 (30.25 GiB - requires fit-target to squeeze into 32GB VRAM)
docker run --rm --gpus all --entrypoint /app/llama-bench ik-llama.cpp:latest \ -hf ggml-org/Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF:Q8_0 -fitt 512
tovidagaming@reddit (OP)
Yeah, that makes sense. Though at that point, I probably would have just gotten another used 3090 from eBay for about the same amount and hoped my luck would strike a third time (I have had no issue with the first two I bought used last year). The R9700 seems like a good, slightly more expensive option. Basically, the same memory bandwidth as the B70, but the ROCm support seems a bit more mature than Intel's.
ziphnor@reddit
There is no arguing the RTX 3090 is awesome, but the RTX 5060 TI's are easily available and easy to bargain for on the used market. Reasonably priced RTX 3090 are hard to find though. I am in Denmark and got the dual 5060's for \~860€. I have only seen one 3090 available for that price here. The ebay ones are more like \~1000€ and higher with shipping and potential import taxes.
Then there is the higher power consumption (power in denmark is expensive) and the 2x 5060 providing +8 GB VRAM
Serious_Rub_3674@reddit
Can you try running a sanity test using llama-server or cli and check the actual tokens being generated by the aics branch? I tried building their fork and while the benchmark numbers were great, the actual tokens were unusable. Just gibberish.
tovidagaming@reddit (OP)
Good catch. I hadn't gotten around to using it yet. I tested a few of the models with SYCL, focusing on the ones that were way faster using SYCL vs Vulkan.
TheBloke/Llama-2-7B-GGUF:Q4_K_M - is completely broken.
ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF:Q8_0 - sometimes works just fine, sometimes gets completely lost and goes in loops. It seems something related to the termination of the responses is failing. It can answer technical questions just fine most of the time, but a simple "Hi" breaks it :D!
The rest, including Qwen/Qwen3-8B-GGUF:Q8_0 seem to be working fine. All the reasoning models seem fine too.
TheBlueMatt@reddit
This implies there's some synchronization missing. That doesn't mean that other models are actually fine, only that they happen to be running fast/slow enough that the missing sync isn't breaking them. That also probably means that once the missing sync is added all the models will slow down, even the ones that happened to be working :(
tovidagaming@reddit (OP)
I see... Well, that's a bummer. Isn't synchronization for MOE models more complicated? I would expect at least one of the MOE models to visibly break too in that case. Or I guess it depends on exactly what synchronization is missing...
Queasy-Contract9753@reddit
Thanks for those detailed numbers! I don't see much information about Intel cards. Suppose this means buying an old a770 is a bad idea.
What's your ram comditike btw? How many sticks do you have?
tovidagaming@reddit (OP)
I have 8 sticks of 16 GB each. I mixed two kits of 64gb cause it was what I had. All 8 slots on the X399 DESIGNARE EX motherboard are now populated.
fallingdowndizzyvr@reddit
It depends on how much? For $200, sure why not. For $300, don't be crazy. I have 2 that I pretty much never use anymore.
Here are some A770 numbers for you.
https://www.reddit.com/r/LocalLLaMA/comments/1hf98oy/someone_posted_some_numbers_for_llm_on_the_intel/
PassengerPigeon343@reddit
3090 still the top value play, incredible
bennyb0y@reddit
What about 2 3090!
DefNattyBoii@reddit
With nvlink
tovidagaming@reddit (OP)
My understanding is that NVLink only helps during training/fine-tuning and not so much for inference. I have been keeping an eye out for one, but they are crazy expensive and hard to find. I think I may be able to borrow one from work :D
TheApadayo@reddit
It also helps with prompt processing when using tensor parallel which I think just landed in llama.cpp but is a bit buggy still. I just grabbed an NVLink bridge because Q8 30B models with 128k context hybrid attention comes in right under 48GB for me and prompt processing speed is major for agentic coding workflows.
tovidagaming@reddit (OP)
Oh, cool. I will have to test that if I can get my hands on an NVLink.
a_beautiful_rhind@reddit
You can simply use the P2P driver. Especially for PCIE4.
tovidagaming@reddit (OP)
We expect the 3090 to be at least 50% faster based on memory bandwidth: 936.2 GB/s vs 608.0 GB/s. That would be -33% slower for the B70 in my table. Ignoring Llama-2-7B, which seems to be broken, the closest the rest get to is about -50%, so 3090 is in practice at least twice faster than the B70 for token generation. The fact that the 3090 is about 4 times faster for prompt processing is more concerning, especially for agentic work. But hopefully, we will see backend improvements soon. It has only been a few weeks since release.
TheBlueMatt@reddit
On Q4/Q5 models, https://github.com/ggml-org/llama.cpp/pull/21751 should improve vulkan tg by 4-10%. https://gitlab.freedesktop.org/mesa/mesa/-/work_items/15311 should also materially improve pp (less on Q4 models, its almost a double on BF16/F16 models! but should improve Q4 models as well). There's just so much room to optimize these things its crazy, ts so bad right now.
tovidagaming@reddit (OP)
I will check it out if I have energy for more llama.cpp rebuilds lol. I am tired, boss... I guess I knew what I was getting into by buying the B70 instead of another 3090 or the R9700 :D
TheBlueMatt@reddit
I mean give it time. We gotta get a handful of mesa optimizations landed plus probably more in llama.cpp.
LegacyRemaster@reddit
Nice... And rtx3090+ B70 using Vulkan? Will be 24+24+32. I'm using 6000 96gb+w7800 48 + w7800 48 with profit (vulkan)
tovidagaming@reddit (OP)
Things slow down significantly if I try to mix the 3090s with the B70 on Vulkan :(. I have a colleague who also recently bought an RTX Pro 6000, and we were joking that with my 2x3090s, the B70, and even if I throw in the A2000 I have lying around, I would still be 4GB VRam short and 400 watts higher than a single Pro 6000. Queue the "Look what they have to do to mimic a fraction of our power" meme lol.
Polaris_debi5@reddit
First of all, thank you for the incredible work with the benchmarks and the time dedicated to them.
The numbers are very interesting; the 3090 is still a beast in terms of pure speed (especially in prompt processing and CUDA maturity). But what fascinates me about the B70 is the context of its 32GB of VRAM versus 24GB. The ability to run models that the 3090 simply can't seems to me to be the best point to consider. That said, the performance in SYCL vs. Vulkan is very uneven; in some cases, SYCL is much faster (+160% in a generation with Qwen2.5-Coder), and in others, it's slower. I understand that Intel is working on several fronts (vllm, NEO, PyTorch, etc.) to compete with its hardware. For now, it's something that depends on the context, but we understand that Vulkan remains "plug and play," although the OpenVino and SYCL backends continue to evolve.
If you have the time and inclination to run more tests, I'm curious about some models that would help provide an even more complete picture (just a friendly suggestion, no pressure):
unsloth/Mistral-Small-3.2-24B-Instruct-2506-UD-Q4_K_XL.gguf = The Mistral architecture itself seems interesting to me for comparing the two GPUs.
unsloth/GLM-4.7-Flash-Q4_K_M.gguf = A key reasoning model. After seeing improvements of up to +160% in SYCL with other models, I'm intrigued to see how Intel handles this architecture compared to CUDA/Vulkan.
unsloth/gpt-oss-20b-Q6_K.gguf = A very efficient MoE that's been around for a while.
unsloth/Qwen3.6-27B-UD-Q4_K_XL.gguf = This is the dense model and a midpoint between Qwen3-3.5 and the 35B MoEs you tested. Since SYCL seems to win in the dense models but loses in the MoE, it's very intriguing.
unsloth/Llama-3.1-8B-Instruct-UD-Q8_K_XL.gguf = After the Llama-2 results, I want to see if the 2026 optimizations in SYCL/Vulk have closed the gap in the architecture.
Anyway, thank you very much for this incredible info :D
tovidagaming@reddit (OP)
Let me see if I can run these soon... Note that Llama 2 on that SYCL built is broken, as u/Serious_Rub_3674 pointed out. Qwen2.5-Coder-7B is a bit dazed and confused, too.
RemarkableGuidance44@reddit
Intel Drivers are still new and they are updating them weekly. I got 4 x B70's they are great for larger models a bit slower of course but software is still new. Intel are also now going for the AI Datacenters, so expect better performance down the track.
I have the best of both worlds, dual 5090's and 4 x b70's :D 5090's eat so much power while the b70s just munch bit by bit and keep cool. :)
tovidagaming@reddit (OP)
How are your speeds for 1 vs 2/3/4 B70 GPUs for the same model? I only have one B70 currently, so I can't test it, but on Vulkan, things slow down a lot if I try to mix the B70 with the 3090s.
TheBlueMatt@reddit
Tensor parallelism in llama.cpp is still brand new, and vulkan hasn't landed the backend implementations we need for it to be efficient. For more than 2 GPUs, it probably also makes sense to eventually do PCIe-P2P, which would probably require https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40798 as well. There's just a lot to do to optimize these things...
fallingdowndizzyvr@reddit
No. They are not new. They've been working on them for a couple of years. For a solid year on the battlemage specifically. The B70 is not that different from the B580. It just has more of what the B580 has. I'm still waiting for the A770 to meet what it's paper specs promise.
RemarkableGuidance44@reddit
Intel has stated that they are working on the drivers even more than ever. For a $1000 card they are amazing, my 2x5090s were $4000 each, I got 4 b70's that can run large models 24/7 for $4000.
TheBlueMatt@reddit
I don't believe LLMs are a priority for mesa (the open-source drivers the Vulkan backend uses on Linux). They've mostly focused on gaming uses and a lot of the work has historically been done by Valve. There's a lot of low-hanging fruit if you are willing to really dive in, eg https://gitlab.freedesktop.org/mesa/mesa/-/work_items/15311
AbsoluteHedonn@reddit
NixOS mentioned
tovidagaming@reddit (OP)
Yea, I use NixOS btw
MadGenderScientist@reddit
likewise, upvoted just for NixOS.
jikilan_@reddit
The power of CUDA! cheap card for what? But still thanks for sharing with us the result 🙏
I1lII1l@reddit
Thanks so much for this comparison!