My experience with the Intel Arc Pro B70 for local LLMs: Fast, but a complete mess (for now)
Posted by Icy_Gur6890@reddit | LocalLLaMA | View on Reddit | 54 comments
full disclaimer using ai to help clean up my mess of thoughts. i have a tendency of not being coherent once i get many words out.
TL;DR: Bought a B70 on launch day. Achieved an impressive 235 t/s with Gemma 3 27B on vLLM(100 requests), but the software stack is a nightmare. MoE is barely supported, quantifying new architectures is incredibly fragile, and you will fight the environment every step of the way. Definitely not for the faint of heart.
Hey everyone,
I ordered the Intel Arc Pro B70 on the 27th right when it released. I’ve previously wrestled with ROCm on my 7840HS, so my thought process was, "How much worse could it really be?" Turns out, it can be a complete mess.
To be totally fair, I have to admit that a good chunk of my pain is entirely self-inflicted. I used this hardware upgrade as an excuse to completely overhaul my environment:
OS: Moved from Ubuntu 25.10 (with a GUI) to Fedora 43 Server.
Engine: Transitioned from Ollama -> llama.cpp -> vLLM. (Intel is heavily supporting vLLM, and I’m optimizing for request density, so this seemed like a no-brainer).
Deployment: Moved everything over to containers and IaC.
I figured going the container/IaC route would make things more stable and repeatable. I’ve even been cheating my way through some of it by utilizing Claude Code to help build out my containers. But at every turn, running new models has been a massive headache.
The Good
When it actually works, the throughput is fantastic. I was able to run a Gemma 3 27B Intel AutoRound quant. Running a vLLM benchmark, I managed to generate 235 t/s across 100 requests. For a local deployment prioritizing request density, those numbers are exactly what I was hoping for.
The Bad & The Gotchas
The ecosystem just isn't ready for a frictionless experience yet:
MoE Support: Mixture of Experts models are still only partially supported and incredibly finicky.
Quantization Nightmares: I'm currently trying to run a quant through AutoRound for Gemma 4 26B. I’ve watched it blow up at least 30 times. The new architecture and dynamic attention heads just do not play nicely with the current tooling.
Container Friction: I've run into at least 7 distinct "gotchas" just trying to get the Intel drivers and vLLM to play nicely inside containerized environments.
I haven't even tried spinning up llama.cpp on this card yet, but based on the vLLM experience, I'm bracing myself.
Final Thoughts
My background is as a Cloud Engineer. I’ve spent a lot of time hosting SaaS apps across Windows and Linux environments, so while I'm not a pure developer, I am very comfortable with dev-adjacent workflows and troubleshooting infrastructure. Even with that background, getting this B70 to do what I want has been an uphill battle.
If you are looking for a plug-and-play experience, stay far away. But if you have the patience to fight the stack, the raw performance metrics are definitely there hiding under the bugs.
Momsbestboy@reddit
I dont understand the complains. It is a new card, and support will improve over the next weeks, with more people who bought one giving feedback or improving driver support.
If you dont like it, buy a used 3090 which might have cooked for years in a mining rig and is sold on ebay because chances are high it will die within the next months.
And if you do speed comoarisons instead for deciding, dont use small models which fit into a 3090 or 5060, but use one which requires 32 GB. Then check how fast the hyped green cards are, after offloading a larger parts to RAM.
This thing is new. Either risk it or buy used, overpriced green cards. Your choice.
Icy_Gur6890@reddit (OP)
didnt mean to come out as complaining. im fine with having some tough time and i agree that the support on the card will mature over the next couple of weeks. I meant to share that it surprised me how many workarounds i had to implement to get models running. ive been running models on AMD Ryzen APU's like the 7840hs and have to do some workarounds but the amount of workarounds to get supported path working surprised me.
i have found that the 32GB RAM is not really that substantial as the majority of the users out there are primarily still targeting the 3090s. however I'm sure over the next 2 years that 32GB RAM will become more and more critical.
giant3@reddit
Did you try Vulkan or SYCL backend on llama.cpp?
Always compile locally. Don't use the pre-compiled versions as library version mismatch might introduce bugs.
Icy_Gur6890@reddit (OP)
Yeah havent gotten a chance to try it yet but that is nxct on my list. Still trying to get gemma4 and qwen3.5 working on vllm but i might surrender that for the week and get to running llama.cpp
giant3@reddit
It takes about 5 minutes to compile once you install the dependencies.
Icy_Gur6890@reddit (OP)
I just ran a vulkan test with llama.cpp on gemma4 31b q4 It was pretty bad
PP512 219 Tg128 9.27
Seeming like no kv caching though. Will try a couple of other slightly older models to see its a gemma 4 support thing and then hop to sycl testing
giant3@reddit
Try these options.
-t 8 -fa on -ctk q8_0 -ctv q8_0 --gpu-layers 99Icy_Gur6890@reddit (OP)
this didnt end up well on vulkan ended up with 0.45 t/s. sycl seems to have handled it well. looking to have my benchmark file out in the next hour tested 3 models all dense. mistral small 3.2 24b Q8, Qwen3.5 27B Q4, and Gemma4 31B q4. lets just say vulkan performed terribly SYCL performed okay.
giant3@reddit
Did it print at the start?
ggml_vulkan: Found 1 Vulkan devices:Icy_Gur6890@reddit (OP)
i posted a comment in this post with all my performance findings. when i reran it did work with slight performance drops.
Icy_Gur6890@reddit (OP)
i gotta double check i reran with only -t 8 and it worked fine. i agree something went terribly wrong.
ill go back and attempt to rerun. im running couple of other tests right now and as soon as they are done ill retest
higglesworth@reddit
I’m currently in b70 proxmox hell lol. After spending basically the whole day yesterday trying to get it to work with the help of Claude, had to start over today and have the gpu pass through working into my lxc container, now just to get an engine running…vllm has been a massive pain in the ass
Icy_Gur6890@reddit (OP)
pretty much the same experience i had. i chose fedora because i have another node running proxmox and expected driver hell if i had gone that way. my take is actually launch a vm using ubuntu or fedora or your preferred flavor of linux with very recent kernels. pass through the gpu and then play with that there.
This_Maintenance_834@reddit
llama.cpp is very easy to spin up on B70. if you just want to run a prompt. plain stock Ubuntu installation with LM Studio works right out of the box. vllm intel fork is a nightmare.
Icy_Gur6890@reddit (OP)
yeah can definitely agree with this. however at the same time right now it feels like squeezing out every possible drop of performance is a necessity
Accomplished_Code141@reddit
How about OpenVINO backend for llama.cpp? B70s are cheap VRAM but looks like software is a mess, I have 3 MI50s and a Radeon PRO W5800 and the speeds are pretty bad right now using vulkan / mesa drivers. Intel seemed like a good alternative to get cheap VRAM, I guess not in the current state.
Icy_Gur6890@reddit (OP)
i will have to test it. got through testing llama.cpp with sycl and vulkan.
silou07@reddit
How is Performance between Llama.cpp and vllm? I run a A380 through llama.cpp and Vulkan and would be interested in switching to vllm if it performs better for Intel gpus.
Icy_Gur6890@reddit (OP)
after all my testing today. the model support in llama.cpp is far superior. i think performance is relative and about the same maybe a few tokens better at this time. i think time will tell though and hoping the ecosystem in vllm improves greatly
hp1337@reddit
What about pp?
Icy_Gur6890@reddit (OP)
i created a comment on the post showing all my testing today
Icy_Gur6890@reddit (OP)
Didnt seem to record prompt processing Ill see if i can get some pp numbers
============ Serving Benchmark Result ============
Successful requests: 100
Failed requests: 0
Benchmark duration (s): 83.19
Total input tokens: 23885
Total generated tokens: 20639
Request throughput (req/s): 1.20
Output token throughput (tok/s): 248.09
Peak output token throughput (tok/s): 710.00
Peak concurrent requests: 100.00
Total token throughput (tok/s): 535.20
---------------Time to First Token----------------
Mean TTFT (ms): 9738.96
Median TTFT (ms): 9261.16
P99 TTFT (ms): 20037.72
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 326.51
Median TPOT (ms): 175.33
P99 TPOT (ms): 1516.61
---------------Inter-token Latency----------------
Mean ITL (ms): 124.45
Median ITL (ms): 85.78
P99 ITL (ms): 1315.77
===========================================
Icy_Gur6890@reddit (OP)
Icy_Gur6890@reddit (OP)
Icy_Gur6890@reddit (OP)
Icy_Gur6890@reddit (OP)
Final-Rush759@reddit
Does it run Gemma 4?
Icy_Gur6890@reddit (OP)
havent been able to run gemma4 on vllm. just got it running on llama.cpp
JaredsBored@reddit
Phoronix put out some llama.cpp numbers with the b70 and Vulkan backend that look so bad that I don't believe they're real. It's hard to fuck up a llama.cpp Vulkan build, so I'd be curious to see if you can replicate their results.
And if you're up for a real challenge, benchmarking llama.cpp with the SYCL backend would be very, very interesting.
Phoronix review in question: https://www.phoronix.com/review/intel-arc-pro-b70-linux/3
This_Maintenance_834@reddit
the llama.cpp result is real. it is bad. i have llama.cpp driving my B70s.
JaredsBored@reddit
Have you tried building with SYCL? I'd think it would be faster (if it's working/stable)
This_Maintenance_834@reddit
on my list, take time to get there. it is hard to justify spending time on intel when i also have a astable running RTX PRO 4500 on the side.
Icy_Gur6890@reddit (OP)
ill try running this later today. my goal is still to get qwen3.5 and gemma4 running. my first target is seeing if optimized intel quants lead to performance gains. if so i can live with some 30+ hour quant job. ill still give running llama.cpp with vulkan and what performance it leads to
Ford_GT@reddit
I spent this entire evening trying to get qwen3.5 working with vllm on my B70 but I couldn't figure it out. Please keep us posted if you're able to get either qwen3.5 or gemma4 going.
This_Maintenance_834@reddit
same here. i spent a week now.
Icy_Gur6890@reddit (OP)
Ive been on the same boat. Trying to make an autoround quatn to see if that will work
xspider2000@reddit
Strix Halo > B70, according to the numbers
Eyelbee@reddit
What's so bad about it? Seems quite normal to me.
reto-wyss@reddit
Why are people so obsessed trying to run it with llama.cpp?
Intel has their vllm fork and they are now partnering with vllm.
It's like, they are serving a soup and people keep complaining because they can't eat it with a fork, while the soup came with a free spoon.
Woof9000@reddit
Because Intel make their own forks of stuff, maintain them underhandedly for few "promotional" months, to then abandon them entirely by next Christmas.
JaredsBored@reddit
Llama.cpp is also the best way of running models with some layers in system ram. While vLLM does support it, it's not nearly as common (or afaik) as performant as it is in llama.cpp. With a single 32gb GPU and some decent system ram bandwidth, you can run 120b class MoEs at decent speeds.
I've got stepflash 3.5 199b running at 120tps PP and 22tps PG on a single Mi50 32gb and 128gb ddr4 epyc system at q4. It's not an agentic coding powerhouse, but for chat it'll more than do! Qwen 3.5 122b is even faster (see my profile for benchmarks).
Icy_Gur6890@reddit (OP)
well they abandoned the fork with the release of the b70 with the upstream of xpu into vllm.
i find that my biggest gripe is actually model selection availability at supported quants. which i guess is the desire for running llama.cpp.
im pretty sure in a couple of weeks the support of xpu in vllm will improve greatly... or so i hope
DeepOrangeSky@reddit
Well, Elon just threw like 25 billion dollars at them today, so, maybe they can spend a few of those bucks on getting their stuff a bit more polished and conveniently usable. I mean, for some reason I'm not holding my breath, but, a man can dream.
InternationalNebula7@reddit
I assumed that was for fab production of custom asic? Not necessarily utilization of existing stack or GPUs?
ea_man@reddit
I mean it's a a 2 years old arch, the VRAM is videogame grade (similar to AMD 9070), efficiency is pretty bad: the can't really afford to leave some 50% performance on the table. I get that the price is low but used 3090...
Excellent_Spell1677@reddit
Sadly, If it worked it would cost $4000...and be green. No one is going to make a GPU that has a ton of vram, works great, and is cheap...for now.
Return it, and buy two 5060ti, amazing.
Vicar_of_Wibbly@reddit
Does prefix caching work in vLLM or does it still need to be disabled?
reto-wyss@reddit
I assume that was the Int-4 quant from Intel's HF?
Would you mind running a benchmark for image generation?
Z-Image-Turbo (cfg = 0, steps = 8) and Flux.2-klein-4b (cfg = 1, step = 4) at 1024x1024; these should be supported with vllm-omni and you don't need to quant with 32gb VRAM.
Icy_Gur6890@reddit (OP)
Yes it was the int-4 quant from hf
Sure let me see if i can give it a go. Ill post back in a bit
Icy_Gur6890@reddit (OP)
============================================================
FLUX.2-klein-4B Benchmark
cfg=1, steps=4, size=1024x1024
============================================================
Warmup (1 image)...
Warmup: 2.82s OK
--- Concurrency: 1 (10 images) ---
Successful: 10/10
Failed: 0
Wall time: 24.86s
Throughput: 0.402 img/s
Mean latency: 2.48s
Median: 2.49s
Min: 2.36s
Max: 2.68s
Stdev: 0.09s
--- Concurrency: 2 (10 images) ---
Successful: 10/10
Failed: 0
Wall time: 21.96s
Throughput: 0.455 img/s
Mean latency: 4.17s
Median: 4.34s
Min: 2.41s
Max: 4.67s
Stdev: 0.64s
--- Concurrency: 4 (10 images) ---
Successful: 10/10
Failed: 0
Wall time: 21.95s
Throughput: 0.456 img/s
Mean latency: 7.49s
Median: 8.58s
Min: 2.60s
Max: 9.09s
Stdev: 2.19s
============================================================
reto-wyss@reddit
A 5090 is about 1s for Flux so this is a very good result for the B70.
This job is primarily compute bound on the DiT, the text encoder (qwen3).
Icy_Gur6890@reddit (OP)
one of my sample images generated
audioen@reddit
My experience with vllm and python is that it doesn't work, whereas you can probably just build llama.cpp with Vulkan and it will work straight away. Performance might not be what you're hoping for -- I don't know how well this system scales. I noticed that you said 235 tok/s across 100 concurrent requests, so only 2.35 tok/s per actual inferer? I think this kind of extreme scaling is not very realistic and I do doubt that 2 tokens per second is usable.
I would like to know whether vllm can genuinely parallelize well. I'm unsure about how well llama.cpp parallelizes, as out of the box it enables 4 parallel streams. My impression is that it might be stopping all inference during prompt processing, but might actually be scheduling token generation in parallel once the prompts have been done. As you may be aware, prompt processing is completely compute bound and saturates the underlying hardware even from single inferring task, whereas token generation can be severely bandwidth limited and leaves the math units on the GPU sitting idle, unless the task has huge degree of parallelism. My expectation is that you should have around 30-40 parallel streams for token generation on GB10, for instance.
pfn0@reddit
Planning on running 100 concurrent sub-agents? each one chugging 2-3t/s?