About to build a 6× Arc B70 LLM rig, want to talk to someone experienced first

Posted by somesayitssick@reddit | LocalLLaMA | View on Reddit | 52 comments

Hello, I’m preparing to build a rig with six Intel Arc B70s, but before I move forward, I’d like to speak with someone who has experience building similar systems (no arc specific knowledge required) , particularly with llama and vLLM.

In my initial tests using a 5090 machine & a 128GB of unified memory system, I’ve been seeing some interesting results. I have several questions and would really value the opportunity to discuss them with someone experienced so I can make informed decisions and set things up correctly from the start.

I’m open to paying for your time; however, depending on the rate, I would appreciate seeing some evidence of relevant experience.

Thanks!

[-]

SnooDingos8194@reddit

Any update on 6 b70s? Did you ever get it working? What kind of operations did you run successful? And metrics? I just bought a B70. And was considering getting more.

[-]

somesayitssick@reddit (OP)

Ultimately, I ended up just using 4xB70 with Qwen3.6 27B BF16 and get around 10 toks/s.

I cheaped out and went with DDR4 and PCIe 4 because I had free RAM.

I’ll have to do another write-up; I’ve just been lazy. Before the GPT-5.5 nerf, I used a Raspberry Pi as a harness to give an agent physical access to the server so it could change BIOS settings, reboot, and handle other setup tasks. I told it to set up Proxmox, create a virtual Ubuntu machine, pass through the GPUs, configure everything, and benchmark it. I also had it do an Ubuntu bare-metal setup and benchmark that too, and I got the same results. I know all this is unrelated, but I did this with one prompt and it’s pretty mind blowing that it worked.

[-]

HopePupal@reddit

this should be interesting, hope to see some numbers and pictures a few weeks hence. good luck OP

[-]

somesayitssick@reddit (OP)

The cards arrive tomorrow, but I think I have some knowledge gaps to fill before I feel comfortable moving forward. Hope I have some pics/numbers to show off soon!

[-]

mon_key_house@reddit

Hi OP, any news on this? Hope all eent well!

[-]

putrasherni@reddit

spend 2K more and build 6 R9700, much better

[-]

JaredsBored@reddit

Yeah this is not an advisable first timer build. B70 performance in llama.cpp has been bad under SYCL and Vulkan (though there has been very recent with done to improve things), and because it's not a standard 1/2/4/8 of GPUs, vLLM is going to be tough.

[-]

ImportancePitiful795@reddit

Anyone trying to use llama.cpp on Intel asking for trouble.

vLLM only.

[-]

specn0de@reddit

Can you explain? VLLM seemed snappier and faster but also was way harder to manage kv cache and vram with so I went back to llama.cpp and it’s working fine al biet slow on my one b70

[-]

ImportancePitiful795@reddit

Intel is working closely with a fork of vLLM for their Arc products.

[-]

Gesha24@reddit

Was there anything solid for VLLM besides benchmarks? Their docker containers aren't production ready, building it from source is another level of nightmare...

[-]

AnonsAnonAnonagain@reddit

Iirc, your supposed to use llm-scaler

https://github.com/intel/llm-scaler

[-]

RoterElephant@reddit

Unusuable for new models: https://github.com/intel/llm-scaler/issues/371#issuecomment-4306551236

[-]

Gesha24@reddit

That's precisely the container that I have used. Building regular vllm is nearly impossible for B70

[-]

JaredsBored@reddit

I haven't seen anything encouraging since the launch vLLM benchmarks, and I'm surely not dropping a grand on a card to beta test. I have to assume Intel partnering with vLLM means that software stack won't suck (eventually)

[-]

putrasherni@reddit

I think OP is trolling, so we return troll :)

[-]

alphatrad@reddit

I just bought 3 of those to replace my dual RX 7900 XTX setup.

[-]

legit_split_@reddit

Can you explain your reasoning? I currently see used 7900 XTXs going for 750€ but a new R9700 is 1300€.

[-]

ImportancePitiful795@reddit

32GB VRAM and most importantly native FP8, optimised INT4, plus several other enhancements (eg Sparsity which RDNA3 doesn't support)

7900XTX is good for LORA's etc due to the bandwidth brute force.

But when comes to LLMs and trying to get most out of it R9700 is miles better.

[-]

alphatrad@reddit

First card showed up today. NVIDIA's the basic white girl for local AI unless you got sucked by the Claw hype & wasted your cash on a Mac. But me, total contrarian with my AMD setup.

My reasoning is basically what u/ImportancePitiful795 said. Also the ASRock Phantoms I have are HUGE. My board has 3 PCIe lanes at X16. I can't fit a 3rd card. These things take up 3 slots. The R7900's are WAAAY smaller. And a little lower on power.

I can write off the difference as a business expense and resell my RX 7900 XTX's on ebay. The irony is - both cards are used, from eBay. The R9700 is the first card I bought new. But that's ok.

When I'm done I'll have 96gb of memory. If I could somehow fit 3 of the 7900's I'd only have 72gb.

I read a post in this group of maybe r/LocalLLaMA about someone running two of these.

He posted about getting 153 t/s on Qwen 3.6 35B with MXFP4 quantization on 2x R9700

That's not just "good" that's insane for a $2,600 dual-GPU setup.

And they're getting this with:
- vLLM
- MXFP4 quantization
- MTP speculative decoding enabled
- Tensor parallelism actually working efficiently

That's the power of RDNA4.

The memory bandwidth is only sligthly slower than the RX 7900 XTX. But that 6.7% bandwidth difference doesn't matter when you have hardware accelerated quantization.

Either way, other than 3090's I think AMD is a good alternative for low cost high speed.

[-]

Rocwo@reddit

I have 68 tokens/s on a RTX 3090 (800$) with 936GB/s VRAM bandwidth that is the limiting factor.

I can't run version Q8_0 due to my 24GB VRAM, but from experience I would have lost about 10% of the tokens/s, dropping to around 60 tokens/s.

153 tokens/s on 2 x AI Pro R9700 with 1152GB/s is insane, this means that with a MoE, we are perfectly optimized to always be limited by VRAM bandwidth despite distribution across 2 GPUs.

[-]

Sweet-Argument-7343@reddit

Was there any update here ?

[-]

chonkat2@reddit

also trying to figure out the best setup for a b70. so far, difficult. maybe intel released these into the wild too soon.

[-]

ryfromoz@reddit

I got four b60s going, it wasnt easy!

[-]

chonkat2@reddit

ollama, vllm, vulkan, openvino, in docker/native? can you use recent moe models (gemma 4, qwen 3.6)?

[-]

Sweet-Argument-7343@reddit

Let's say I'm Intel software/driver support eng. What would be the priorities for the my fix drivers , vLLM compatibility and other details, particularly for Qwen and Gemma

[-]

Sweet-Argument-7343@reddit

Could you try 8 ? With the most recent vLLM setup suggested by Intel in the articles ? And with Gemma 4 26 MoE. Results should be interesting

[-]

texasdude11@reddit

Don't need any money, but I can assist some. Arc will be interesting, I have experience with Nvidia only.taje a look at my 2x 6000 pro build : https://youtu.be/e23kbKH9Dmk

[-]

Ender_PK@reddit

dude! I rock 4x mi50 32gb + 2x v100 32gb and want to move to 16-32x (in 1 system) mi50 32gb. If you like junk stuff like this, pm me, I'm sure you'll like my builds.

[-]

somesayitssick@reddit (OP)

Awesome, thanks for reaching out! I just took a look at your video, and it would be great to chat. I’ll send you a PM tomorrow. Thanks!

[-]

Green-Dress-113@reddit

vllm tensor parallel works with 2 4 or 8 GPUs, not 6.

[-]

Travnewmatic@reddit

very interesting! this is what qwen3.6-36b-a3b had to say about that:

Not accurate. vLLM's tensor parallelism can work with 6 GPUs — it's not limited to powers of 2.

The real constraint is divisibility, not powers of 2:

Attention heads must be divisible by tensorparallelsize
Hidden size must be divisible by tensorparallelsize

So with TP=6, you need a model where both the number of attention heads and hidden dimensions are divisible by 6. Many popular models happen to have configs that work with 2/4/8 but not 6 (e.g., 32 attention heads isn't divisible by 6, but 48 would be). That's likely the source of the misconception.

The AMD vLLM MoE playbook explicitly lists non-power-of-2 counts (3, 5, 6, 7, 9, etc.) as deployable, noting that pipeline parallelism is actually the better strategy when you can't find a clean TP split.

If you're planning a 6-GPU setup, check your model's architecture:

- Does numattentionheads % 6 == 0?

- Does hidden_size % 6 == 0?

If yes, it'll work. If no, you'd need to use pipeline parallelism instead.

[-]

Gesha24@reddit

If you get it working - I (and I think many others) would greatly appreciate it if you could share your experience. I was not able to get any decent performance from a single Arc B70 with either vllm or llama. Vllm did have a solid prompt processing performance with Qwen3.5-27B, but was terrible with token generation. It also had extreme issues with tool calling to the point of really being only useful as a web chat. Qwen3.6 didn't run at all on it. Llama.cpp was working, had OK-ish performance, could run Qwen 3.6 but would grind to a halt (we are talking 50 t/s pp and 5 t/s generation) at a prompt above 100K context.

After spending a few days I came to the conclusion that I don't want a $950 heater that maybe will eventually get sufficient updates and optimizations to be called a GPU, so one changed it to Radeon AI 9700 and it gave me much better and more reliable performance. I did struggle to get Qwen3.6 running with vllm at any decent speed, but llama with rocm was quite solid doing 1000 t/s pp and 50 t/s generation even at the context well above 100K

[-]

superloser48@reddit

Given the comparable price point - do you think its better to get 2x nvidia 5060 ti? do you think pp and tg will be better than amd 9700?

[-]

Gesha24@reddit

Believe it or not, you are not the first one to think of this: https://www.reddit.com/r/LocalLLaMA/s/BFJj7uYnQk Tldr: single 9700 is better and simpler

[-]

superloser48@reddit

Can you share your experience with 9700 & vllm? Did you figure out the root cause?

[-]

Gesha24@reddit

I tried to build vllm and ran into issues (but it's possible I was still trying to do it on Ubuntu 25.10, which is not officially supported. I did reinstall 24 LTS, I just don't remember if I tried to build vllm locally on it). Then I used a nightly vllm container with docker and it did work, but performance was very bad. I assumed that it could easily be the problem not only with vllm, but also with the INT4 quant of Qwen3.6 (since it has been literally just released), so I decided to stick with llama.cpp for now. I will try it again, but probably in a week or two once more people have a chance to take a look at it and I have more time myself.

[-]

superloser48@reddit

if you need a decent vllm quant for this - im using this quant currently of the same model - on 2x 3090. https://huggingface.co/QuantTrio/Qwen3.6-35B-A3B-AWQ

[-]

somesayitssick@reddit (OP)

Yeah, there is quite a bit of risk involved, for sure. That’s why I’m hoping to talk with an “expert” so I can at least minimize risk in one area. I would still be taking a risk by betting on quick software tooling updates from Intel.

[-]

No_You3985@reddit

Depends on the models you are planning to run. VRAM bandwidth in b70 is only 608 GB/s. Afaik these gpus don’t have interconnect so you are relying on PCIe lanes for communication. Not ideal, latency may add up with 6 gpus in tensor parallelism mode. Don’t forget you need to have plenty of fast PCIe lanes on your mobo for this. Such hedt/server mobo + cpu can cost as a couple of b70. If you can bump the budget just get two rtx 6000 pro Blackwell. This will give you the same vram but much more vram bandwidth and better compute, compatibility with all CUDA optimized algorithms, cheaper mobo and cpu, etc

[-]

andy_potato@reddit

I might remember wrong, but didn't they say you can't combine more than 4 of these?

[-]

Mantikos804@reddit

The connection between GPUs is the bottleneck on Intels.

[-]

Long_comment_san@reddit

So...did you ever build anything? /J

[-]

ImportancePitiful795@reddit

I assume you have motherboard with 6 pci-e slots and you do not split the slots as you are asking for trouble.

Second vLLM is the only way with Intel Arc, even if still as primitive support. Compared to llama.cpp is miles ahead.

Do NOT try to use 6 cards with bifurcation.

At the cost of 6 B70s + workstation/server motherboard + CPU + 128GB RAM + storage + PSUs, maybe check if a single RTX6000 96GB does the job, or check how 2 DGX work together.

B70s are great for 2-4 on existing old systems eg X570 (2) or X299/X399 (4) workstations. As they give life to those systems and doesn't break the bank. If you build new system especially if want it for production, don't.

[-]

Medium_Chemist_4032@reddit

> Do NOT try to use 6 cards with bifurcation.

Why?

[-]

Puzzleheaded_Base302@reddit

it cost you more money on electricity alone than api call. unless intel can fix their driver performance, it does not make financial sense.

the high cost is for single user use case. multi-user concurrency changes the narrative a bit.