Is 2× Intel Arc Pro B70 worth it for local agentic LLMs, or should I stay with NVIDIA?
Posted by Zuck7980@reddit | buildapc | View on Reddit | 15 comments
I’m planning a home workstation that I can access remotely from an iPad through Jupyter/SSH/Tailscale. My goal is to run local agentic workflows using Hermes Agent/OpenWebUI/Ollama/vLLM-style tooling, mostly to avoid relying on cloud models.
The idea I’m considering:
- 2× Intel Arc Pro B70 32GB internally
- RTX 5070 externally through a Sonnet eGPU box for gaming
- Windows as my main OS, possibly Linux dual boot for AI workloads
- 128GB+ system RAM
My concern is software ecosystem maturity. I know the B70 hardware/VRAM is attractive, but most local LLM serving and agent frameworks seem more mature around NVIDIA/CUDA. I’m not sure whether multi-Intel-dGPU serving with vLLM/OpenVINO/OVMS is practical enough for daily use.
Questions:
- Would you buy 2× B70 for local LLM/agent work today?
- Is Intel Arc multi-GPU serving mature enough, or is this still experimental?
- Would I be better off with a single NVIDIA GPU with less VRAM but better software support?
- Does anyone here actually run local agents on multiple Intel Arc dGPU hardware?
- Should I wait for RTX 5080 Super / higher-VRAM NVIDIA options instead?
CompellingBytes@reddit
Maybe use OpenArc/OpenVINO. OpenVINO does wonders with Battlemage silicon. The big issue with OpenVINO is that there is lag time for newest model support, unless you know what you're doing and can convert models to OpenVINO format yourself (I think that's an option).
There's also the OpenVINO Model server (which is sorta an enterprise solution), or coding your own sort of harness (or Vibecoding).
Zuck7980@reddit (OP)
OVMS sucks trust me, you can’t even change the Paged attention so if the model(liquid.ai) requires SDPA as its Paged attention, you cannot serve that particular model as the default is set to Paged attention and you can’t serve it, you can load the model but you won’t be able to serve it. But again my main issue over here is that with OpenVINO I cannot serve the model through multiple dGPU which is a big issue. As far as I know that capability will be introduced next year. So 1 big model spread across multiple GPU’s. So this is what it would look like - 1) layer/Tensor shards distributed across GPU.0 and GPU.1
2) KV cache handled correctly 3) generation loop coordinated across both devices.
For instance on NVIDIA when using vLLM it is quite simple to set tensor parallel size to 2 and easily serve model and distribute one big model across multiple dGPU, same does not exist if the model is converted through OpenVINO that is once the model is in IR format.
Also what are your views on eGPU sonnet thingy and should I wait for 5080 Super which will possibly have 24 GB VRAM!
CompellingBytes@reddit
Oh sorry, last night and today was/is hectic, I missed the part where you said all of that stuff about OpenVINO.
There's always an achilles heel with Intel's stuff. Multi gpu with OpenVINO should be one of their top priorities.
Zuck7980@reddit (OP)
No need to apologize it’s all good, Thank You so much for replying.
This_Maintenance_834@reddit
i ordered two on day 1. i decided to return them after a few weeks. software support was lacking, so difficult to make them work.
Zuck7980@reddit (OP)
That is exactly why I want eGPU with at least one 5080 don’t think that’s a good idea?
This_Maintenance_834@reddit
nvidia is good, but 5080 does not have 32GB VRAM. running qwen3.6-27b is tight.
9okm@reddit
I've heard a few times now that getting anything to run well on Arc is a PITA.
Zuck7980@reddit (OP)
Sorry but what’s a PITA?
This_Maintenance_834@reddit
if you follow their official instruction word by word, nothing works. if you don’t follow, it still does not work.
9okm@reddit
You should probably ask in an LLM sub, like r/LocalLLM
Zuck7980@reddit (OP)
Will do, thank you
Zuck7980@reddit (OP)
I see, I’m pretty good at it tbh. I usually use OpenVINO framework to quantize the large models, that helps me to run them quite efficiently but the issue is while serving the model, apparently I cannot serve them through multiple dGPU’s, I have to rely on PyTorch XPU which won’t run the optimized / converted model that I optimized through OpenVINO but rely on original model precisions/safetensors
9okm@reddit
Right but have you tried doing any of this on Arc GPUs? Single or otherwise?
9okm@reddit
Pain in the ass.