About to build a 6× Arc B70 LLM rig, want to talk to someone experienced first
Posted by somesayitssick@reddit | LocalLLaMA | View on Reddit | 41 comments
Hello, I’m preparing to build a rig with six Intel Arc B70s, but before I move forward, I’d like to speak with someone who has experience building similar systems (no arc specific knowledge required) , particularly with llama and vLLM.
In my initial tests using a 5090 machine & a 128GB of unified memory system, I’ve been seeing some interesting results. I have several questions and would really value the opportunity to discuss them with someone experienced so I can make informed decisions and set things up correctly from the start.
I’m open to paying for your time; however, depending on the rate, I would appreciate seeing some evidence of relevant experience.
Thanks!
putrasherni@reddit
spend 2K more and build 6 R9700, much better
alphatrad@reddit
I just bought 3 of those to replace my dual RX 7900 XTX setup.
legit_split_@reddit
Can you explain your reasoning? I currently see used 7900 XTXs going for 750€ but a new R9700 is 1300€.
ImportancePitiful795@reddit
32GB VRAM and most importantly native FP8, optimised INT4, plus several other enhancements (eg Sparsity which RDNA3 doesn't support)
7900XTX is good for LORA's etc due to the bandwidth brute force.
But when comes to LLMs and trying to get most out of it R9700 is miles better.
alphatrad@reddit
First card showed up today. NVIDIA's the basic white girl for local AI unless you got sucked by the Claw hype & wasted your cash on a Mac. But me, total contrarian with my AMD setup.
My reasoning is basically what u/ImportancePitiful795 said. Also the ASRock Phantoms I have are HUGE. My board has 3 PCIe lanes at X16. I can't fit a 3rd card. These things take up 3 slots. The R7900's are WAAAY smaller. And a little lower on power.
I can write off the difference as a business expense and resell my RX 7900 XTX's on ebay. The irony is - both cards are used, from eBay. The R9700 is the first card I bought new. But that's ok.
When I'm done I'll have 96gb of memory. If I could somehow fit 3 of the 7900's I'd only have 72gb.
I read a post in this group of maybe r/LocalLLaMA about someone running two of these.
He posted about getting 153 t/s on Qwen 3.6 35B with MXFP4 quantization on 2x R9700
That's not just "good" that's insane for a $2,600 dual-GPU setup.
And they're getting this with:
- vLLM
- MXFP4 quantization
- MTP speculative decoding enabled
- Tensor parallelism actually working efficiently
That's the power of RDNA4.
The memory bandwidth is only sligthly slower than the RX 7900 XTX. But that 6.7% bandwidth difference doesn't matter when you have hardware accelerated quantization.
Either way, other than 3090's I think AMD is a good alternative for low cost high speed.
Rocwo@reddit
I have 68 tokens/s on a RTX 3090 (800$) with 936GB/s VRAM bandwidth that is the limiting factor.
I can't run version Q8_0 due to my 24GB VRAM, but from experience I would have lost about 10% of the tokens/s, dropping to around 60 tokens/s.
153 tokens/s on 2 x AI Pro R9700 with 1152GB/s is insane, this means that with a MoE, we are perfectly optimized to always be limited by VRAM bandwidth despite distribution across 2 GPUs.
JaredsBored@reddit
Yeah this is not an advisable first timer build. B70 performance in llama.cpp has been bad under SYCL and Vulkan (though there has been very recent with done to improve things), and because it's not a standard 1/2/4/8 of GPUs, vLLM is going to be tough.
Gesha24@reddit
Was there anything solid for VLLM besides benchmarks? Their docker containers aren't production ready, building it from source is another level of nightmare...
AnonsAnonAnonagain@reddit
Iirc, your supposed to use llm-scaler
https://github.com/intel/llm-scaler
Gesha24@reddit
That's precisely the container that I have used. Building regular vllm is nearly impossible for B70
JaredsBored@reddit
I haven't seen anything encouraging since the launch vLLM benchmarks, and I'm surely not dropping a grand on a card to beta test. I have to assume Intel partnering with vLLM means that software stack won't suck (eventually)
ImportancePitiful795@reddit
Anyone trying to use llama.cpp on Intel asking for trouble.
vLLM only.
putrasherni@reddit
I think OP is trolling, so we return troll :)
texasdude11@reddit
Don't need any money, but I can assist some. Arc will be interesting, I have experience with Nvidia only.taje a look at my 2x 6000 pro build : https://youtu.be/e23kbKH9Dmk
Ender_PK@reddit
dude! I rock 4x mi50 32gb + 2x v100 32gb and want to move to 16-32x (in 1 system) mi50 32gb. If you like junk stuff like this, pm me, I'm sure you'll like my builds.
somesayitssick@reddit (OP)
Awesome, thanks for reaching out! I just took a look at your video, and it would be great to chat. I’ll send you a PM tomorrow. Thanks!
Green-Dress-113@reddit
vllm tensor parallel works with 2 4 or 8 GPUs, not 6.
Travnewmatic@reddit
very interesting! this is what qwen3.6-36b-a3b had to say about that:
Not accurate. vLLM's tensor parallelism can work with 6 GPUs — it's not limited to powers of 2.
The real constraint is divisibility, not powers of 2:
Attention heads must be divisible by tensorparallelsize
Hidden size must be divisible by tensorparallelsize
So with TP=6, you need a model where both the number of attention heads and hidden dimensions are divisible by 6. Many popular models happen to have configs that work with 2/4/8 but not 6 (e.g., 32 attention heads isn't divisible by 6, but 48 would be). That's likely the source of the misconception.
The AMD vLLM MoE playbook explicitly lists non-power-of-2 counts (3, 5, 6, 7, 9, etc.) as deployable, noting that pipeline parallelism is actually the better strategy when you can't find a clean TP split.
If you're planning a 6-GPU setup, check your model's architecture:
- Does numattentionheads % 6 == 0?
- Does hidden_size % 6 == 0?
If yes, it'll work. If no, you'd need to use pipeline parallelism instead.
Gesha24@reddit
If you get it working - I (and I think many others) would greatly appreciate it if you could share your experience. I was not able to get any decent performance from a single Arc B70 with either vllm or llama. Vllm did have a solid prompt processing performance with Qwen3.5-27B, but was terrible with token generation. It also had extreme issues with tool calling to the point of really being only useful as a web chat. Qwen3.6 didn't run at all on it. Llama.cpp was working, had OK-ish performance, could run Qwen 3.6 but would grind to a halt (we are talking 50 t/s pp and 5 t/s generation) at a prompt above 100K context.
After spending a few days I came to the conclusion that I don't want a $950 heater that maybe will eventually get sufficient updates and optimizations to be called a GPU, so one changed it to Radeon AI 9700 and it gave me much better and more reliable performance. I did struggle to get Qwen3.6 running with vllm at any decent speed, but llama with rocm was quite solid doing 1000 t/s pp and 50 t/s generation even at the context well above 100K
superloser48@reddit
Given the comparable price point - do you think its better to get 2x nvidia 5060 ti? do you think pp and tg will be better than amd 9700?
Gesha24@reddit
Believe it or not, you are not the first one to think of this: https://www.reddit.com/r/LocalLLaMA/s/BFJj7uYnQk Tldr: single 9700 is better and simpler
superloser48@reddit
Can you share your experience with 9700 & vllm? Did you figure out the root cause?
Gesha24@reddit
I tried to build vllm and ran into issues (but it's possible I was still trying to do it on Ubuntu 25.10, which is not officially supported. I did reinstall 24 LTS, I just don't remember if I tried to build vllm locally on it). Then I used a nightly vllm container with docker and it did work, but performance was very bad. I assumed that it could easily be the problem not only with vllm, but also with the INT4 quant of Qwen3.6 (since it has been literally just released), so I decided to stick with llama.cpp for now. I will try it again, but probably in a week or two once more people have a chance to take a look at it and I have more time myself.
superloser48@reddit
if you need a decent vllm quant for this - im using this quant currently of the same model - on 2x 3090. https://huggingface.co/QuantTrio/Qwen3.6-35B-A3B-AWQ
somesayitssick@reddit (OP)
Yeah, there is quite a bit of risk involved, for sure. That’s why I’m hoping to talk with an “expert” so I can at least minimize risk in one area. I would still be taking a risk by betting on quick software tooling updates from Intel.
No_You3985@reddit
Depends on the models you are planning to run. VRAM bandwidth in b70 is only 608 GB/s. Afaik these gpus don’t have interconnect so you are relying on PCIe lanes for communication. Not ideal, latency may add up with 6 gpus in tensor parallelism mode. Don’t forget you need to have plenty of fast PCIe lanes on your mobo for this. Such hedt/server mobo + cpu can cost as a couple of b70. If you can bump the budget just get two rtx 6000 pro Blackwell. This will give you the same vram but much more vram bandwidth and better compute, compatibility with all CUDA optimized algorithms, cheaper mobo and cpu, etc
ryfromoz@reddit
I got four b60s going, it wasnt easy!
andy_potato@reddit
I might remember wrong, but didn't they say you can't combine more than 4 of these?
Mantikos804@reddit
The connection between GPUs is the bottleneck on Intels.
Long_comment_san@reddit
So...did you ever build anything? /J
ImportancePitiful795@reddit
I assume you have motherboard with 6 pci-e slots and you do not split the slots as you are asking for trouble.
Second vLLM is the only way with Intel Arc, even if still as primitive support. Compared to llama.cpp is miles ahead.
Do NOT try to use 6 cards with bifurcation.
At the cost of 6 B70s + workstation/server motherboard + CPU + 128GB RAM + storage + PSUs, maybe check if a single RTX6000 96GB does the job, or check how 2 DGX work together.
B70s are great for 2-4 on existing old systems eg X570 (2) or X299/X399 (4) workstations. As they give life to those systems and doesn't break the bank. If you build new system especially if want it for production, don't.
Medium_Chemist_4032@reddit
> Do NOT try to use 6 cards with bifurcation.
Why?
ImportancePitiful795@reddit
Because support is unstable at it is right now. Asking for trouble.
Medium_Chemist_4032@reddit
Oh, you mean, because of tensor parallelism? Cool, thought that bifurcation alone isn't recommended.
ImportancePitiful795@reddit
Is a combination in the case of Intel Arc. :)
rpkarma@reddit
I offer you no advice, just encouragement coz I wanna see what it’s like haha do iiiiit
alphatrad@reddit
Why do you want to go that direction? Gonna be slow man.
MentalStatusCode410@reddit
Not worth it, as it stands.
Until AMD and Intel have support for FP4, I wouldn't bother.
Puzzleheaded_Base302@reddit
it cost you more money on electricity alone than api call. unless intel can fix their driver performance, it does not make financial sense.
the high cost is for single user use case. multi-user concurrency changes the narrative a bit.
HopePupal@reddit
this should be interesting, hope to see some numbers and pictures a few weeks hence. good luck OP
somesayitssick@reddit (OP)
The cards arrive tomorrow, but I think I have some knowledge gaps to fill before I feel comfortable moving forward. Hope I have some pics/numbers to show off soon!