Running Four Intel Graphics Cards Under Linux On Ubuntu 26.04

[-]

Timely-Degree7739@reddit

How do you do AI with GPUs? My LLMs still stink? But huge improvement in graphics in all applications and interface, mpv, obviously games etc.

[-]

Qwen 3.6 27b or Qwen 3.6 35b a3b is where its at. Experiment with different quantizations to find the balance you want between speed and quality. Try to get it to fit in VRAM for the dense model, but offload is fine for the MOE.

[-]

Timely-Degree7739@reddit

I have 4GB nvidia GPU so hardly the latest does that mean you should have/do something specific in terms of LLM?

[-]

Zyphixor@reddit

4 GB of vram is barely enough for AI. I'd say 16 GB at the least is what you should have for LLMs

[-]

Qwen30bEnjoyer@reddit

You can get away with less VRAM using CPU offloading, and still get decent speeds.

Some speeds like about 200 TPS PP, 20 TPS TG, but if you're using TheTom's Llama.cpp fork and TQ4 KV cache quantization. We use TQ4, a new memory compression technique to reduce the memory pressure from a larger KV Cache, and have a mildly okay slow coding experience. Give it a try and let me know how it goes!

[-]

Timely-Degree7739@reddit

I see maybe that explains it then. Thanks.

[-]

Qwen30bEnjoyer@reddit

It's not going to perform the best, but try Qwen 35b a3b at Q4_k_s. You might be able to get barely runnable speeds bottlenecked by your lack of VRAM and PCIE version.

[-]

Timely-Degree7739@reddit

I would like to feed source and documents from the shell including instructions what to do improve code look for bugs append stupid jokes etc, I then want it to output its comments in a dedicated space and also read whatever it already said. But I only get the interactive going as soon as I send stuff it has memory like a goldfish (none?)

[-]

SoilMassive6850@reddit

Not tested here but things like graphics accelerated VDI is also an option as these cards do SR-IOV. With a beefy computer to connect these to you can run a lot of VMs on these. Of course VDIs can usually be quite niche mainly for some enterprise use. But I do have to admit I've taken advantage of this functionality for some game bot farming.

[-]

Keplair@reddit

Low cost AI workstation, ARC Battlemage series are really cheap.

[-]

lor_louis@reddit

The AI software landscape is also pretty biased towards Nvidia, so performance is generally just Ok, which does not justify the price.

[-]

natermer@reddit

Depending on what you need it works fine. It is very application dependent.

Lots of time memory speed is the bottle neck, not raw GPU performance. Sometimes using Vulkan API is faster or more stable then using special GPU-specific libraries.

To put the pricing into perspective.. The current budget king for 32GB of VRAM is Radeon AI PRO R9700 and that is about $1400.

A GeForce RTX 5090 with 32GB is about $3400.

The 5090 is going to be faster and it comes with CUDA so if you are interested in a card for "playing around" or CUDA only software then it is obvious the one to get.

But if you want something that will work with something specific, like Llama.cpp, then AMD or Intel is fine.

The B70 is very new so I am not sure of pricing. It probably will be around the $1000 once things settle down. At least one manufacturer is claiming a 32GB 9600 GPU will be coming out.

It all really just depends on what you are doing.

[-]

TripleSecretSquirrel@reddit

Certainly for local AI. The Intel B70s are the cheapest way to get 32gb of VRAM for a brand new card right now, and VRAM is the main bottleneck for local inference.

Depending on exact pricing, you can get four B70s for the price of one NVIDIA RTX5090, giving you 128gb VRAM to work with.

[-]

halfhearted_skeptic@reddit

Do these cards have integrated cooling? I’ve been looking at some that don’t have a fan built in and require the enclosure to provide all the cooling.

[-]

SoilMassive6850@reddit

Just like the B50 and B60 there are both passive and blower designs

[-]

anh0516@reddit (OP)

The TL;DR: Some workloads don't perform any better with 4 vs. 3. Some workloads perform best with 2, and regress with 3 or 4. Some workloads don't scale at all.

There's a lot more work to be done.

[-]

Qwen30bEnjoyer@reddit

Try Intel's vLLM fork, and appropriately sized models. With 128GB VRAM you should try Qwen/Qwen3.5-122B-A10B Dynamic Online FP4 and benchmark it. It would be interesting to record wall wattage under a real agentic-coding task using that and Pi Code or Opencode.

[-]

natermer@reddit

The kinda odd, but understandable, choice is using only small LLMs.

When using llama.cpp it is my experience that if a model can fit into a single card then it will be faster to only use one GPU for it.

Even when you have two GPUs and the model can't quite fit on one card it is often faster to offload layers to the CPU then use the second GPU.

It just has to do with how LLama.cpp spreads the workload across GPUs and the PCIE bus latency and all that fun stuff.

Also if you have 32GB of VRAM then you wouldn't want to run 4 bit quantization. You'd want to run at least Q8 or even full float 16 to maximize accuracy.

The point of having multiple GPUs like that is if you want to run multiple models simultaneously for like a server or multi-agent setup.

Or if you want to run something bigger... like the MiniMax-M2.7 229B model or Deepseek 158B model. It will be slower spread out over multiple GPUs then a single gigantic AI module, but at least you can run it locally.

Of course this approach would not be useful for comparisons. So like I said it is understandable.

[-]

Purple_Jello_4799@reddit

what about windows on that setup

[-]

anh0516@reddit (OP)

You know what that would actually be good to have for reference.

[-]

Purple_Jello_4799@reddit

I'd really want to know. will be waiting, if you'll eventually test it.

[-]

anh0516@reddit (OP)

Not me. Ask Michael Larabel of Phoronix; he ran these benchmarks.

[-]

Purple_Jello_4799@reddit

uh-oh! sorry for confusion

[-]

Sixguns1977@reddit

Would 2 arc cards help gaming any?

[-]

halfhearted_skeptic@reddit

They usually don’t have a frame buffer. I don’t know if you can link them to a card that has one.

[-]

SoilMassive6850@reddit

so taking into account workloads that could run on a single Arc Pro B70 and also supported multi-card/adapter scaling.

I mean sure, but I'd also imagine you might run multi GPU with entirely independent tasks where the scaling will be limited by your machine being capable of feeding the GPUs. Proper multiprocessing with dedicated tasks will likely outperform trying to slap more GPUs on to scaling a single task (and I'd imagine more common).

[-]

aloobhujiyaay@reddit

Intel’s Linux graphics support has quietly become genuinely impressive over the last few years, especially compared to how fragmented multi-GPU Linux setups used to feel

[-]

bawng@reddit

I was really happy when Intel announced they'd get into the discrete GPU business because they have a decent track record on Linux support in general and competition is always good.

Then the actual cards were a bit disappointing compared to their price so now I don't know.

[-]

MidLifeDIY@reddit

I feel like these cards are gonna be popular after they're discontinued and competition keeps getting more expensive. Open drivers will get better and better.

[-]

Infinity-of-Thoughts@reddit

Multi-GPU setups aren't even really a thing on Windows.

Outside of AI, I can't see any benefit.