Is 2× Intel Arc Pro B70 worth it for local agentic LLMs, or should I stay with NVIDIA?

Posted by Zuck7980@reddit | buildapc | View on Reddit | 15 comments

I’m planning a home workstation that I can access remotely from an iPad through Jupyter/SSH/Tailscale. My goal is to run local agentic workflows using Hermes Agent/OpenWebUI/Ollama/vLLM-style tooling, mostly to avoid relying on cloud models.

The idea I’m considering:

- 2× Intel Arc Pro B70 32GB internally
- RTX 5070 externally through a Sonnet eGPU box for gaming
- Windows as my main OS, possibly Linux dual boot for AI workloads
- 128GB+ system RAM

My concern is software ecosystem maturity. I know the B70 hardware/VRAM is attractive, but most local LLM serving and agent frameworks seem more mature around NVIDIA/CUDA. I’m not sure whether multi-Intel-dGPU serving with vLLM/OpenVINO/OVMS is practical enough for daily use.

Questions:

Would you buy 2× B70 for local LLM/agent work today?
Is Intel Arc multi-GPU serving mature enough, or is this still experimental?
Would I be better off with a single NVIDIA GPU with less VRAM but better software support?
Does anyone here actually run local agents on multiple Intel Arc dGPU hardware?
Should I wait for RTX 5080 Super / higher-VRAM NVIDIA options instead?

[-]

CompellingBytes@reddit

Maybe use OpenArc/OpenVINO. OpenVINO does wonders with Battlemage silicon. The big issue with OpenVINO is that there is lag time for newest model support, unless you know what you're doing and can convert models to OpenVINO format yourself (I think that's an option).

There's also the OpenVINO Model server (which is sorta an enterprise solution), or coding your own sort of harness (or Vibecoding).

[-]

Zuck7980@reddit (OP)

OVMS sucks trust me, you can’t even change the Paged attention so if the model(liquid.ai) requires SDPA as its Paged attention, you cannot serve that particular model as the default is set to Paged attention and you can’t serve it, you can load the model but you won’t be able to serve it. But again my main issue over here is that with OpenVINO I cannot serve the model through multiple dGPU which is a big issue. As far as I know that capability will be introduced next year. So 1 big model spread across multiple GPU’s. So this is what it would look like - 1) layer/Tensor shards distributed across GPU.0 and GPU.1

2) KV cache handled correctly 3) generation loop coordinated across both devices.

For instance on NVIDIA when using vLLM it is quite simple to set tensor parallel size to 2 and easily serve model and distribute one big model across multiple dGPU, same does not exist if the model is converted through OpenVINO that is once the model is in IR format.

Also what are your views on eGPU sonnet thingy and should I wait for 5080 Super which will possibly have 24 GB VRAM!

[-]

CompellingBytes@reddit

Oh sorry, last night and today was/is hectic, I missed the part where you said all of that stuff about OpenVINO.

Pain in the ass.