Intel will sell a cheap GPU with 32GB VRAM next week
Posted by happybydefault@reddit | LocalLLaMA | View on Reddit | 351 comments
It seems Intel will release a GPU with 32 GB of VRAM on March 31, which they would sell directly for $949.
Bandwidth would be 608 GB/s (a little less than an NVIDIA 5070), and wattage would be 290W.
Probably/hopefully very good for local AI and models like Qwen 3.5 27B at 4 bit quantization.
I'm definitely rooting for Intel, as I have a big percentage of my investment in their stock.
https://www.pcmag.com/news/intel-targets-ai-workstations-with-memory-stuffed-arc-pro-b70-and-b65-gpus
KnownPride@reddit
This is good choice for intel. People will buy it only for llm.
IntelligentOwnRig@reddit
The price comparison everyone should be making here isn't NVIDIA consumer cards. The only other consumer GPU with 32GB is the RTX 5090, and that goes for 2,200+. So yes, 949 for 32GB is genuinely cheap in that context.
But VRAM capacity is only half the story for inference. Bandwidth determines your tok/s. Here's where the B70 falls in the stack:
The B70 lands in the same bandwidth class as the RTX 4070 Ti Super. On a model that fits both cards, like Qwen 3.5 27B at Q4_K_M (needs about 16GB), you'd expect roughly similar tok/s. The B70's real advantage is headroom. You can run Q5_K_M of that same model (19GB) for better output quality, or even Q8_0 (29GB) for near-lossless. The 4070 Ti Super is maxed out at Q4.
Versus a used 3090 at about the same price: the 3090 has 54% more bandwidth (936 vs 608) with full CUDA support, so it will be meaningfully faster on anything that fits 24GB. But the B70 gives you 8GB more VRAM for models and quant levels the 3090 can't touch.
The risk nobody in this thread is talking about enough is software. This is not CUDA. You're on SYCL/oneAPI or Vulkan through llama.cpp. One commenter above is running an R9 7900 AI PRO on Vulkan and says it works, but another says ROCm gave 8x the tok/s on the same AMD hardware. Vulkan leaves a lot on the table. How Intel's SYCL stack actually performs for LLM inference is the open question, and there are zero B70 benchmarks to answer it yet.
My take: if you need 32GB and can't afford a 5090, this is the only game in town at 949. If your models fit 24GB, a used 3090 is faster and cheaper with a mature software stack. If they fit 16GB, a 4070Ti Super gives you similar bandwidth for 779 with full CUDA.
relmny@reddit
I wonder what is faster, a 16gb GPU with more bandwidth offloading to CPU multiple layers to fit a way bigger model or a 32gb with less bandwidth but without offloading or with way less offloading?
IntelligentOwnRig@reddit
The 32GB card without offloading wins almost every time.
CPU RAM bandwidth is roughly 50-90 GB/s (DDR4/DDR5 dual-channel). GPU VRAM runs 288-1,792 GB/s depending on the card. That's a 6-20x gap. Even offloading a small fraction of a model to CPU creates a bottleneck that wipes out whatever bandwidth advantage the GPU has.
Concrete example: Qwen 3.5 27B at Q5_K_M needs about 19GB.
On an RTX 5080 (16GB, 960 GB/s), you'd offload roughly 3GB to CPU. The GPU churns through its 16GB fast, then waits for CPU RAM to deliver the rest at maybe 90 GB/s. That 3GB offload alone more than doubles your per-token time compared to running entirely on the GPU.
On the B70 (32GB, 608 GB/s), the whole model sits in VRAM. Lower bandwidth, but zero time spent waiting on CPU. Faster overall despite the slower memory.
The only scenario where the 16GB card wins: the model fits entirely in 16GB with no offloading at all. Then it's a pure bandwidth race and the faster card is faster. The moment any layers hit CPU RAM, it's not close.
MentalRegular5335@reddit
Excellent explanation, thanks a lot 🙂
relmny@reddit
good explanation! thanks!
Aphid_red@reddit
What matters far more (for single user inference) is:
For 10 tps fp8, you want bandwidth of at least 10x capacity. In this case, 320GB/s. All of the listed GPUs pass this test.
Note that with multiple GPUs, you need tensor parallel. If you're doing layer parallel, then you want more bandwidth per GPU as only one GPU is working at the same time.
From which you can derive these two requirements:
For example, if you have a 20:1 prompt ratio, and 200 tps prompt processing, and 10 tps generation, then you have effectively 5 tps generation.
Whereas with a 10:1 prompt ratio, processing 10 tokens takes 0.05s, thus generating one every 0.15s, so you have \~6.7 tps generation.
The biggest model a GPU can get 5 tps (or whatever your target is) on is what determines how good it is. Same with the biggest resolution you can get 60 fps on with max quality for games. Spending more is... not useful. You're better off with more VRAM so you can run a higher quality model.
CubicleHermit@reddit
Takes some hunting to get a used 3090 that cheap.
QuestionMarker@reddit
Not to be ignored is that you can buy two for less than a single 5090. The memory bandwidth is an annoyance, but otherwise it slots nicely into the ecosystem slot currently occupied by 3090 pairs, with much more space and much lower wattage. It's a *very* interesting card.
IntelligentOwnRig@reddit
The dual B70 math works. 64GB for $1,898 means a 70B model at Q4_K_M (\~41GB) fits across two cards without touching CPU RAM. Dual 3090s only get you 48GB for roughly the same price used.
The catch: 3090 pairs get NVLink for cross-GPU communication, which matters a lot for multi-GPU inference. Dual B70s are PCIe only. TheBlueMatt's benchmarks in that Vulkan PR showed dual B60 performance was heavily limited by PCIe 3.0, so the interconnect speed really matters here. You'd want PCIe 4.0 x16 slots at minimum.
The wattage point is underrated too. Dual B70s at roughly 460W vs dual 3090s at 700W. That's a meaningful difference in power supply, thermals, and electricity cost over time.
tangled_girl@reddit
Thanks for the analysis. I've been looking at a pair of 3090's with NVLink as my first local rig, but the 48GB felt quite limiting in terms of the models I could run. So two B70's would be a major step up, but I feel like I want to wait for the benchmarks to come out to see how they'd compare in practice, especially w.r.t the software. And losing NVLink will be unfortunate. But from the sounds of it, you'd be leaning towards the B70's in my situation?
IntelligentOwnRig@reddit
Since 48GB feels limiting, I'm guessing you're targeting 70B class models.
For 70B at Q4_K_M (\~41GB): 48GB is tight once you add KV cache at any meaningful context length. 64GB on dual B70s gives you actual headroom. That's a real advantage for your use case.
The tradeoff: dual 3090s have 54% more bandwidth per card (936 vs 608 GB/s) and NVLink for clean inter-GPU communication. For anything that fits in 48GB, they'll be noticeably faster.
Your instinct to wait is right. If the B70's Vulkan/SYCL stack lands at even 70-80% of CUDA efficiency, dual B70s look strong for 70B workloads. If it's lower, the math tilts back to 3090s.
TL;DR: if you need a rig now, 3090s are proven. If you can wait a month for real B70 numbers, wait.
Maks244@reddit
can you give a ballpark amount as to what the tok/sec would be if you compared it directly to the 4070 ti?
IntelligentOwnRig@reddit
Rough estimate, since no B70 benchmarks exist yet (card launches March 31).
For the 4070 Ti Super on Qwen 3.5 27B at Q4_K_M (\~15GB): the bandwidth shortcut is bandwidth divided by model size, discounted to real-world efficiency (typically 40-50% in llama.cpp). That gives 672 / 15 = \~45 theoretical, times 0.4-0.5 = roughly 18-22 tok/s.
The B70 has 90% of the 4070 Ti Super's bandwidth (608 vs 672 GB/s). If the software were equally optimized, that puts it around 16-20 tok/s.
The unknown: CUDA on the 4070 Ti Super has years of optimization behind it. Vulkan/SYCL on Intel is improving fast (that PR linked above shows a 2.5x speedup from a single kernel change on Battlemage), but nobody knows where the actual efficiency lands yet. The real B70 number could be lower until the stack matures.
General-Economics-85@reddit
What if one also wants TTS inference on top of that? I don't think I've seen many do benchmarks outside of LLMs on these huge non-nvidia cards.
giant3@reddit
When I tested llama.cpp few months ago, SYCL was faster than Vulkan.
IntelligentOwnRig@reddit
That tracks. The Vulkan backend for Intel GPUs has been pretty far behind.
But that PR TheBlueMatt linked is worth reading. The benchmarks show a B60 going from 25.66 to 74.06 tok/s on a 20B MoE model with a new shared memory staging kernel. Nearly 3x. And the cross-GPU tests from the maintainer confirm it's specifically a Battlemage/Xe2 optimization. The A770 (older Intel) saw about 26%, NVIDIA was flat, and AMD actually regressed. It's architecture-specific, not a general Vulkan improvement.
The Qwen 3.5 27B at Q8_0 result on two B60s went from 3.45 to 6.41 tok/s, but that was bottlenecked by PCIe 3.0 and splitting 29GB across two 24GB cards. The B70 fits Q8_0 on a single 32GB card with no cross-GPU overhead. Different situation entirely.
Even with the optimization though, the B60 hits 74 tok/s versus 182 for an RTX 3090 on the same Vulkan backend. Bandwidth gap (936 vs 456 GB/s) is still real. The software is catching up fast, but it doesn't close the hardware gap.
TheBlueMatt@reddit
https://github.com/ggml-org/llama.cpp/pull/20897 changes that, but also demonstrates just how much headroom these cards have compared to the state of the drivers/software for them.
IntelligentOwnRig@reddit
Just read through the PR. The numbers make the case.
The B60 going from 25.66 to 74.06 tok/s on that 20B MoE model is nearly 3x. And the cross-GPU benchmarks from 0cc4m show this is specifically a Battlemage/Xe2 win. The A770 barely moved. AMD and NVIDIA saw no gain. So this maps directly to the B70, same architecture.
The Qwen 3.5 27B Q8_0 result on two B60s (3.45 to 6.41) is also telling for the B70 specifically. That test was bottlenecked by PCIe 3.0 interconnects and splitting 29GB across two 24GB cards. The B70 fits Q8_0 on a single card with 32GB. No cross-GPU overhead. Different situation entirely.
Worth noting though: even with the optimization, the B60 hits 74 tok/s versus 182 for an RTX 3090 on the same Vulkan backend. The bandwidth ratio (936 vs 456 GB/s) roughly predicts that gap. Headroom in software is real, but it doesn't close the hardware bandwidth gap.
The mesa driver issue you filed might be the more interesting long-term fix. If the driver handles coalesced loads properly, the kernel workaround becomes unnecessary.
happybydefault@reddit (OP)
And I imagine you can use it for gaming too. I heard drivers were terrible at the beginning but that now are so much better.
Stochastic_berserker@reddit
They are literally problematic on the software level and not hardware. Pixel errors and texture issues
SmileLonely5470@reddit
Coding is solved tho now so they'll fix it soon
Candid_Highlight_116@reddit
Just gotta tell them make no mistakes
CyberAttacked@reddit
There is no way you believe AI can write low level code for gpu drivers
milanove@reddit
Opus 4.6 was able to extend the open source Nvidia driver for Linux to provide some functionality I needed. Was quite impressed.
QbitKrish@reddit
No, it can definitely write it. It might not work per se, but it can write it.
Primary_Emphasis_215@reddit
? What why not, AI is used in embedded just as much as web.
KnownPride@reddit
Ai write? What you think corporation and many other ai developer will do when this released? Sit on them?
mellenger@reddit
Loll
Kale@reddit
I have a side project that is for number theory, factoring numbers. If someone wanted to get an Intel GPU for uint32 math, and possibly some non-division, non-modulo uint64 math, how would they program it? OpenCL? I know ROCm is the library to use for AMD and CUDA for nVidia. I already have some code in OpenCL to run on CPUs.
randylush@reddit
Can you tell me more about your project? It’s fascinating to me that there are still open mathematical problems that consumer hardware can help solve
Kale@reddit
It's not really an unsolved problem. It's not mathematically interesting, just engineering interesting. I try to factor large Fermat numbers or prove giant numbers are prime using Proth's Theorem or things like that.
It's fun writing an integer FFT multiplication algorithm using the Four step method, then completely rewriting it using a different method and still have it work.
It's kind of like doing sudoku. I'm wrapping up an OpenCL implementation that does Gentleman-Sande transform forwards, then does the multiplication, then does Cooley-Tukey in reverse. I don't have to move stuff around in GPU memory since the GS inputs ordered and outputs bit-reversed, while CT inputs bit-reversed and outputs ordered.
I used the Chinese Remainder Theorem so I could do three 32-bit transforms in the GPU rather than one 90-bit transform. I needed to find three prime numbers where p-1 had 2^28 as a factor, but p had to be less than 2^31, so I could do A+B and know they wouldn't overflow (since both are less than 2^31). I discovered four prime numbers. Literally, that was it. So it was crazy discovering how close to the edge I'm getting with 32-bit math on the GPU.
To me, this is the fun part. Multiplying numbers by FFT has been known to be the fastest practical method since the 1960's, but which method is fastest can change from GPU to GPU. Mine algorithm needs compute units with lots of local memory. I've heard the fastest only using global GPU memory is Stockham's algorithm. I've never written that one before.
Stochastic_berserker@reddit
Have you looked at SYCL?
Kale@reddit
Nope. Looks interesting. But I'm not great with C++. And I'm already working with OpenMP and OpenCL which are very different animals, and it seems like this SYCL might not be that close to OpenCL in syntax?
Thanks though, crazy to see Khronos has a third parallel programming standard on top of OpenCL and Vulkan.
randylush@reddit
Literally
4baobao@reddit
a driver is software level
adeadbeathorse@reddit
Apparently the game developer Pearl Abyss refused to share the highly-anticipated game Crimson Desert with Intel early despite doing so with Nvidia and AMD (as well as reviewers) so that they could have game-ready drivers on launch day. Seeing as they’re partnered with AMD, something tells me there’s fishy business afoot.
ANR2ME@reddit
According to AI-Playground, it can also be used for diffusion models https://github.com/intel/AI-Playground
iamaredditboy@reddit
Without drivers how does this work? What’s qualified to run on this?
timschwartz@reddit
Why wouldn't there be drivers?
Anru_Kitakaze@reddit
Because it's Intel and their GPU is famous of 2 things:
SKirby00@reddit
If they make a habit of releasing high VRAM GPUs like this, someone's bound to decide it's worth the investment to improve drivers for running LLMs on Intel GPUs.
If these things actually end up being <$1000, they'd be like 1/3 the cost of an RTX 5090 for obviously much less compute, but the same amount of VRAM. With decent driver support (including multi-GPU support), this could easily become the best value consumer GPU for running sparse MoE models much faster than a Strix Halo or DGX Spark.
I certainly wouldn't buy it on the chance that drivers might improve, but it wouldn't shock me if this kind of release acts as a catalyst for them to improve.
qwen_next_gguf_when@reddit
Why not 96gb? What is the difficulty?
happybydefault@reddit (OP)
I imagine memory is very, very expensive.
mertats@reddit
Memory is expensive, but to have more memory you would also need to increase the bus width of the card which is also more expensive.
Succubus-Empress@reddit
Why not keep bus same and increase memory?
Pie_Dealer_co@reddit
Well in line with your name succubus-empress imagine that your surrounded by 20 cylinders all ready to go. Alas even if we use all 3 inputs for the 20 cylinders we can probably stick 6 cylinders in the 3 input ports at best. As such our succubus can handle only fraction of the 20 cylinders.
However if we increase the size of the inputs or the number of them we can fit all 20 cylinders but such modification of our succubus will ofcourse cost us something.
Infinite_Tiger_3341@reddit
I know this is old, but Im lost on how their username ties into the explanation and what cylinders have to do with succubi
Pie_Dealer_co@reddit
Well the joke is old enough to explain it. The user is named Succubus which are demons that like to suck on well penis. And are supposed to be exceptional experts at it.
Penis= has cylindrical shape but is also a play on a reddit story ( just Google the inner cylinder must not be harmed reddit)
So now that the context is done in basically explained to Succubus empress why it will not work by describing how a typical woman can handle as max 6 penises at same time because this is the max amount you can fit in. And that even if we add more they are just going to be on stand by.
The funny part is that I explain the actual real issue in quite the filty way without being vulgar to a Succubus which is supposed to make it really relatable and easy to understand to an actual Succubus. Its like trying to explain to a chef a complex math problem using spoon and forks
Succubus-Empress@reddit
If you cannot increase voltage, increase ampere to increase power. Can’t we do something like that here?
Wolvenmoon@reddit
Speaking as an electrical engineer, you can do anything you want. The real question is if it'll have the effect you want or not. Increasing current on a signal will, of course, blow it up.
Now, as far as increasing voltage goes? Voltage increases are generally undesirable because electronics are getting smaller. Now, we need to talk about how electricity works. In short, it's water. Voltage is pressure, current is flow rate, resistance is pipe diameter. Energy is volume ('gallons'), power is volume over time ('gallons per minute').
So, we have an electrical flow (amperage) with a certain amount of pressure (voltage). That tells us how wide the pipe (resistance) is, (Ohm's law, V=i*R).
You know how wires get hot with high power loads? That's called resistive heating. Power of resistance heating = current x current x resistance (or flow x flow x pipe diameter). This means that, when it comes to managing heat in electronics, decreasing amperage is MUCH more important than decreasing resistance.
Last, increasing voltage is increasing pressure, but the whole goal of electronics advancement is to reduce the size of electronics to smaller and smaller almost atom-scale constructs. But if you put highly pressurized electrons into a thin plastic tube, well, it's almost like making a pipe out of a single layer of plastic wrap and then pressurizing it. The pipe bursts.
So, no. You can't increase voltage or current to handle memory issues.
The way to think of a memory controller is like the loading dock of a warehouse. Each truck is 64 bits wide, the loading docks are 384 bits wide, that means they can dock 6 trucks at once.
Now, it's possible to have trucks with two trailers instead of one, but that adds complications to the docking procedure - is the docking yard wide enough for them to maneuver? What about using trains to dock instead? (LRDIMM, slight latency penalty, needs more power, but the docking yard has less logistics to do in exchange for the train handling it).
With GPU memory? Adding latency fucks things royally. So, a 384-bit wide bus means that it sends/receives 384 bits every time it clocks (the clock speed of the memory). RAM modules tend to have a bit width (I.E. 64 bits). Theoretically you could get more RAM modules with shorter bit widths, but this has diminishing returns. If the memory controller runs at 2.0 GHz and is 384 bits wide, that gives us 2.0x10E9x384/(1024x1024x1024)=715.26 gigabytes/second maximum theoretical bandwidth. Practical is always less than maximum theoretical, but point being with a 64 bit bus width each RAM module gets a 119 gigabyte slice of that. If we half the bus width, each memory module gets a smaller slice, and I'll admit my knowledge on how data is interleaved with memory controllers is lacking, but we run into the concept of a word at some point and have to deal with the ramifications of what happens when your memory bus width is smaller than a single word (either you have to wait multiple clocks to write a word to memory, or you interleave the word).
At high speeds, you have to keep everything synced up, calibrated+trained video or article that should explain some of what calibration+training looks like, with the article being from Altium Designer which I really wish I could afford. Anyway, it all has to be routed on the PCB which requires talented engineer time. After which it has to be verified and tested to make sure signal integrity is like what the simulations say it will be.
So, no. Even in this, I assume that memory chips can be connected to arbitrary bus widths. This is an extreme assumption roughly akin to assuming all dogs are friendly and putting your face in their face. I've never read a spec sheet/built something using GDDR of any type, so I don't know if it can be done or not. If it can be done, part of the reason it isn't is that memory is expensive. I've not priced it, yet, but it looks like $51.60 for 16Gb (2GB) of GDDR7 off Aliexpress. Meaning 32GB of GDDR7 is $825.60 retail if the random seller I found was reputable. Assuming Intel pays half that price, it's not an insignificant amount, and would put their cost into a bracket where there are other, better options.
PANIC_EXCEPTION@reddit
Current flow is something that should be minimized, not increased. It's what causes the chip to melt if you let it run amok without throttling. Power consumption and Joule heating is a byproduct of compute, not something you deliberately increase to improve performance.
Increasing the supply voltage can improve SNR/noise margins and enable faster clocks, but increasing the clock speed increases dynamic power, which is a big component of power draw. There is a limit to how much you can crank up the clock or transistor count before you have to start relying on paralellism and toggleable clock domains to prevent your chips from melting. We found that limit roughly in 2006.
Pie_Dealer_co@reddit
Not really to expand on my metaphor:
What you are proposing is to increase the intensity/power of the engaged cylinders with the inputs and while this will make for more explosive culmination and more rapid interaction it will simply increase the speed at which we can transfer data. The 6 cylinders will be very much engaged however the other 14 won't be serviced due to physical limitations. We simply have nothing physical the stand by cylinders no matter how hard we work with the 6 ones.
WolfeheartGames@reddit
That's why we need middle out compression.
engineerfromhell@reddit
Sighs, don’t forget about Cylinder to Floor (C2F) ratio…
Succubus-Empress@reddit
What did i read just now?
Charuru@reddit
I'll try to reword it, which is that the GPU die needs IO ports to connect to memory, and the GPU is already maxxed out PHYSICALLY. There's no way to increase the number of ports the GPU itself has. The only way to increase memory is to put 2 dies together, directly attach memory onto the GPU without using normal ports (CoWoS), or get bigger memory chips (this depends on samsung etc to invent it).
Succubus-Empress@reddit
But B200 gpu easily do it with HBM memory. They have higher speed and higher capacity
Charuru@reddit
Yes you need CoWoS and HBM and 2 dies. So it is not easily. If that's what you want just buy it then...
But people are asking for a much lower end product, a single much smaller die using cheap RAM and cheap packaging (no CoWoS) and that is not possible.
rrdubbs@reddit
So you are saying the succubus could upgrade and handle more cylinders per unit time, or, increase the size of the cylinder for a larger load per cylinder.
DreamLearnBuildBurn@reddit
Increasing the bus width would allow more data to pass at once. To me this means larger cylinder but I'll allow that I'm out of my element here and defer to someone else to unpack this metaphor.
tob8943@reddit
w explanation
mertats@reddit
Because bus width basically controls how much memory modules you can have on the gpu.
Memory comes in modules of 1 to 3GB. And modules need a 32 bit bus traced to per module region. (You can double stack the modules by putting another module on the other side of the board)
Let’s say you have 256 bit bus width, that means you can have 256/32, 8 memory lanes. At 3GB per module that is 24GB on one side and 48GB if you double stack.
At 2gb per module that is 16GB on one side and 32GB if double stacked.
Higher capacity modules are much much more expensive. So is increasing the bus width to accommodate them.
Succubus-Empress@reddit
Gimme 1024 bus width and 4GB memory module and my soul is yours for 1 year.
Charuru@reddit
It's not possible with GDDR. The only way to increase the bus width that much is to go to HBM.
the_friendly_dildo@reddit
GDDR6 chips are pretty cheap, like less than $10 per 2GB chip, and commonly less than $5/chip.
the__storm@reddit
96 GB of GDDR6 loose in a plastic bag would cost more than $1k. Spot price is like $12/GB.
Aphid_red@reddit
So what you're saying is that a 96GB GPU could be sold for $2K. Add 20% margin on that memory and it's 2,400. That's still a lot better than the \~8000 NVidia's asking.
mslindqu@reddit
But that's uncut... surely you can bulk it out with beach sand?
NickCanCode@reddit
They want you to buy 3 cards,
touristtam@reddit
Be glad that we are now out of this cycle of good enough.
Specialist-Heat-6414@reddit
The CUDA ecosystem argument is real but it gets weaker every year for inference specifically. Training still lives and dies by CUDA. But for running models locally, llama.cpp's Vulkan backend has gotten good enough that ecosystem lock-in matters less. The real question for the Arc B70 is driver stability and power management on Linux -- Intel's track record there has been shaky, but the last 12 months have been noticeably better. At 49 for 32GB it doesn't need to beat a 5090. It just needs to not brick itself when you leave it running for 48 hours straight. If it clears that bar it will sell well to the local AI crowd.
happybydefault@reddit (OP)
Well said.
Unrelated — I miss when people could freely use em-dashes without being confused with AI. I see your sad, resigned double-dash, but I also sense your humanity.
submarine-quack@reddit
sometimes it's just a matter of ease. i use double-dash -- on my linux and on android, simply because i'm too lazy to set up a shortcut to the em-dash
locuturus@reddit
I'm a big fan of dashes. Always have been. And now for a couple of years I've felt attacked by AI. Oh well—my grammar is too idiosyncratic to be AI. Probably.
Kirin_ll_niriK@reddit
They can take the em-dash from my cold dead hands
It’s the one “might sound like AI” thing I refuse to change my writing style for
Specialist-Heat-6414@reddit
It'll come back :))
EarlMarshal@reddit
989 Dollars is cheap now? Wtf.
happybydefault@reddit (OP)
I mean, relative to other GPUs with ~32 GB of VRAM and ~600 GB/s of bandwidth, not to like a banana.
gargoyle777@reddit
I mean my strix halo with 128 gb shared ram was 1500 for the full mini pc...
Icy-Signature8160@reddit
where did you get strix halo 128gb at 1500?
gargoyle777@reddit
Bos game. Slightly uglier case than gmktec but basically same hardware
Icy-Signature8160@reddit
thank you, what llm are you running on it, is qwen 3.5 27b enough for coding?
hope the new medusa halo will have 384 bits and support for lpcamm2 so I can upgrade to lpddr6 14,400 MT/s for 690 GB/s BW
gargoyle777@reddit
So the work is tricky, you need to move around a couple kernel config but you can run qwen 3.5 122B (it's a moe so it runs pretty well). That's definitely the goto imo. If you want to save some ram i remember there is a 30something moe too. I'd stick to moe as inference can be slow on dense model
Badger-Purple@reddit
R97000 was originally 1k now 1200. At least you’re getting a software stack that is kind of functioning with AMD, whereas intel, it’s neither cuda nor rocm so you are at the mercy of whether they will create support and people will port the code to that architecture.
Ok_Mammoth589@reddit
And Intel doesn't even do "support" correctly. They forked vllm, llama.cpp and even auto1111. And then never upstreamed those improvements. Then they abandoned the forks.
Badger-Purple@reddit
This here is a huge reason to not want this card. Like half this price, it would be worth it, but unless they are actively showing improvement in the stack its a risk not worth the investment. You may run oss-120b but without improvements you won’t be running the actual models you want to run with more RAM, since they won’t have compatible versions of vllm or llama.cpp
Maks244@reddit
so each model needs a compatible version of vllm?
Badger-Purple@reddit
not each model, but often families of models, architecture changes, new updates, etc, can’t be compiled without specific rocm or cuda or whatever intel uses support.
squired@reddit
Fully agreed. I hate NVIDIA, but I also would not abandon CUDA for less than 50% off. A 5090 competitor for $1k makes sense, this doesn't outside of commercial use where the scale justifies development for a single use case. This board is going to be a nightmare for hobbyists and the price does not justify the pain.
rrdubbs@reddit
It seems crazy that they wouldn’t be throwing top men at improving the AI stack. Every investor is literally throwing money at the segment
MmmmMorphine@reddit
It seemed crazy to me 2 years ago they weren't throwing as much vram as they could into their cards, and frankly I still think they should be trying for 48 - but regardless
Think your point stands though, the fact they didnt throw the same towards the software is bizarre to me
Badger-Purple@reddit
Why would nvidia do that? They own intel now.
letsgoiowa@reddit
They do not.
Badger-Purple@reddit
I guess I dont follow the new rumors, but the partnership w nvidia doesn’t seem that complementary. Putting nvidia chips on SoC intel systems is kind of against new gpu development, but we will see. https://hothardware.com/news/intel-commits-to-future-gpu-releases#:~:text=Rumors%20of%20the%20demise%20of,got%20to%20be%20more%20flexible.
djdanlib@reddit
If only we could use it as a simple extra VRAM expansion for an RTX card
letsgoiowa@reddit
I had exactly this experience with my A380. RIP IPEX-LLM that got updates 6 months late and then not at all.
Gotta rely on Vulkan now but you would've thought they would've provided a smooth migration plan. No mention at all! No notice!
UltraSPARC@reddit
Hell ya. I'm glad Intel isn't giving up the tradition of dropping the ball with their product lines.
happybydefault@reddit (OP)
I think you are wrong.
These GPUs seem to be supported (basic support at the moment) by upstream vLLM, as shown in the screenshot taken from https://docs.vllm.ai/en/stable/getting_started/installation/gpu
inevitabledeath3@reddit
Actually VLLM has mainline support now. Intel has been working on this in fairness to them.
WiseassWolfOfYoitsu@reddit
Yeah, my first thought was immediately that this isn't that compelling over an R9700 unless there's some more info missing. The R9700 isn't much more expensive, has higher compute and bandwidth, and has a more robust ecosystem.
That said I'm still cheering for Intel to succeed here since we need more competition.
BillDStrong@reddit
It depends a bit. Intel has vGPU support for this generation of GPUs, so a 32GB card with no vGPU license needed like for Nvidia will be a big win for enterprises, and if they are standardizing on this for vDesktops, it can make it simpler to keep the same driver stack, etc.
At the same time, they have that dual card solution for their 24GB card, there is no reason that can't work with this card as well, so its possible we might see a dual card setup with 64GB of memory at some point, though those won't be cheap.
Now, you are probably right, though.
xrailgun@reddit
I mean, a modded 4080 32gb is about $1500 USD. It's much faster and has full CUDA support. I think most people who want to play with a $1000 toy would be able to get a $1500 toy without blinking.
happybydefault@reddit (OP)
Yeah, that makes sense to me.
xrailgun@reddit
My price is based on direct conversion from China/Taobao, it's still true. They mark things up a bit when selling on ebay/aliexpress for the extra platform fees, you can typically get forwarding/freight handled for about $20 so it's much cheaper to order direct from Taobao if you can navigate the language/account-creation stuff.
FinalCap2680@reddit
With other GPUs you are paying for the software stack/support as well.
It should have been with more VRAM or even cheaper to worth the risk and pain. But at the current market that is hard to be done.
I remember when looking for GPU for experiments 3-4 yars ago, I saw very cheap second hand, original intel Arc A770 16Gb and was seriously considering it for image generation. But then searched around for usage for LLMs as well. There was one question about that in Intel support forum and the answer from Intel person was something like "We sold you the hardware and if it does not work with the software, it is not our problem", Technically it is true, but the next day I bought more expensive second hand RTX 3060 12Gb and still have it. You can not win market share with attitude like that. and without marketshare, you can not sell at prices like others.
sixcommissioner@reddit
telling customers that software compatibility isnt your problem is a bold strategy when youre trying to compete with cuda
Much-Researcher6135@reddit
Alright, someone convert the price into a banana count.
tracyde@reddit
Well I saw single bananas for sale at a cafe today for $1.29.
$1500 == 1162.79 bananas 😁
kingwhocares@reddit
So, the Intel Arc Pro B60 with 24 GB is a better value.
mslindqu@reddit
Accepted and solved.
https://www.npr.org/2024/11/29/nx-s1-5210800/6-million-banana-art-piece-eaten
Sticking_to_Decaf@reddit
How much could a banana cost, Michael? Ten dollars?
xXprayerwarrior69Xx@reddit
let me talk to your banana guy if he has 32gb bananas for 10
yossarian328@reddit
For an AI-oriented "GPU" with 32GB of VRAM. Yes. Very.
A used 3090 with 24GB will range $1000-1500.
New 5090 with 32GB is $4000.
This won't run gaming graphics like the 5090. It lacks the straightforward and mature AI toolchain setup of NVidia (and increasingly AMD getting on their level) ... but for AI models, VRAM is king. And 32GB for $1k is a very strong proposition.
DocMadCow@reddit
For current generation plus 32GB VRAM? Oh ya!
Ok_Mammoth589@reddit
Definitely not current generation. It's not even gddr7. It's Intel's current generation which is not current at all.
HiddenoO@reddit
"Current generation" is a practically meaningless term on its own anyway. Even a 3080 still has a higher memory bandwidth and more TFLOPs than half the 50 series cards, and that wasn't even the best card two generations ago.
AC1colossus@reddit
Show me the other time you could buy a $1000 32GB GPU.
EarlMarshal@reddit
I've bought a 24 GB GPU for below 700 years ago.
Mochila-Mochila@reddit
700 years ago, man, that ancient civilisation was pretty advanced for its time.
happybydefault@reddit (OP)
That's still not 32 GB, and memory is far more expensive now than years ago, sadly.
EarlMarshal@reddit
It's not more expensive. It costs the same to create it. Only the greed got bigger. And yeah sure it's not 32gb but 324GB is more than 232GB with more compute for almost the same price.
onan@reddit
Any time in the past 6ish years?
Caffdy@reddit
only the M5 Max or the Ultras can reach 600GB/s or more, and those don't run at $1000
AC1colossus@reddit
You and I both know it's not the same
onan@reddit
True. One of them also throws in an entire free computer!
redimkira@reddit
my words. came here for this. rooting for Intel but this is not a price point I am interested. The market is so fed up that even 989 dollars looks cheap at this point
brainrotbro@reddit
It’s not even a month of groceries 😭
EarlMarshal@reddit
Where I am from it's 2-3 months of food.
Ryoonya@reddit
Well, that is the case, you must have other pressing matters like survival instead of a computationally intensive hobby.
Just because you are from a country with a shit economy doesn't mean you are going to get subsidised products that originated from another economy.
That is not how the world works.
StoneCypher@reddit
it is half the price of other cards in its performance space
a car can be cheap at $10k, and a house can be cheap at $100k
ldn-ldn@reddit
A house for just $100k, mmm...
KadahCoba@reddit
My market? Add another 0 and you have the cheapest house in the county.
ldn-ldn@reddit
Yeah, that's a more realistic price in my area too.
KadahCoba@reddit
When I was looking last year, the only house on my side of country that was under $1M was still the same red-tagged fire and flood damaged house on a tiny narrow lot directly along a protected stream that floods about annually, also it's in a rock slide zone. It seems to be off market currently. I doubt it sold (unless they found a real sucker) as it's very likely to be impossibly to get all the permits to repair it, let alone all the required types of insurance (fire, floor, mud flow, rock side, earthquake, environmental).
StoneCypher@reddit
hello fellow californian
KadahCoba@reddit
At least we're not Los Angles.
StoneCypher@reddit
Lol
ldn-ldn@reddit
Just to back it up with official government stats, average house prices in London:
Detached £1,152,000 Semi-detached £719,000 Terraced £646,000
Source: https://www.gov.uk/government/news/uk-house-price-index-for-january-2026#:~:text=Table_title:%20Average%20price%20by%20property%20type%20for,2026:%20£554%2C000%20%7C%20January%202025:%20£564%2C000%20%7C
$100k is like £80k or something. Won't even buy a shed with that money...
KadahCoba@reddit
I was talking floor/minimum. Those averages on detached are about the same if I include the cheaper cities between me and the coast. My side of the county's average is more around or above $2M, mainly due to low volume allowing the $5-30M+ properties to skew an average. Scanning MLS listings and discounting the obvious condemned, auctions, errors, 55+, and scam listings, the cheapest appears to be <10 houses listed at $1.1M
That's a bit over $1.5M
Closest equivalent to that type in my region is are condo and "townhomes". Looking at listing for the ones around my office, the prices range from a few at $550-900k for a 1 bed/1bath to $1-3M for non-studios. Most are in a $1.2-1.5M range for a 2-3 bed unit.
Checked some that I was looking at back in 2017 that were $300-350k then, they are now $1-1.5M. I would say I should have bought them, but I couldn't afford that back then either. xD
I've been looking out of state. Average for one market I've been eyeing is about $350k. When I started looking just before covid, that was around $100k. There is regret for not buying in to there, but shit went turbo literally within the month I had begain looking and prices on most jumped 10x. :|
AdOne8437@reddit
I'll buy 5.
kaisurniwurer@reddit
It's comparable to a 3090 per GB from a year ago, so not too bad actually.
KadahCoba@reddit
It's apparently a card with 33% more vram than a 3090 for about 20% more money than the current used ebay price of a 3090.
Its going to need to be quite a lot faster than a 3090 to compete with that downside of 3090's working with almost everything out of box. Its the same problem with AMD compute.
Honestly, 32GB should have been the minimum for any AI compute/high-end gaming GPU hardware in 2025. I've been running 4-8 4090's and that started to be not enough for a lot of new open models from last year.
bnolsen@reddit
Youtube titles on reddit
muyuu@reddit
"i got a small loan of a million dollars" moment
haagch@reddit
So far the radeon R9700 was more than 2x the price of a RX 9070 for just +16gb vram and still by far the cheapest "current gen" GPU with 32GB vram.
It's not cheap in absolute terms but it's cheap compared to every comparable product.
EarlMarshal@reddit
7900XTX had 24 GB and you could get them below 700.
lol-its-funny@reddit
1k? No thanks
Consistent-Height-75@reddit
practically free. Pocket change.
xmesaj2@reddit
Did I hear the price will be $1249?
Ok-Drawing-2724@reddit
Could be a solid option for running bigger local models like Qwen 3.5 27B at 4-bit. Before using any new GPU with agents, I’d scan the setup with ClawSecure first.
Clayrone@reddit
Hats off for the people who want to experiment with this. I got the R9700 AI PRO with 32GB VRAM for my SFF server build and I am pretty satisfied with 640 GB/s. The speed is acceptable for my needs and llama.cpp built for vulkan works flawlessly plus it takes 300W max, so I believe Intel will be it's direct competitor and I am curious how the comparison will turn out.
findingsubtext@reddit
For what it’s worth, my Arc A380 can run LLMs flawlessly aside from the fact it only has 6GB of VRAM. Excited to see what Intel has up their sleeve here.
Anxious-Outside2500@reddit
for which models?
findingsubtext@reddit
I mostly serve Gemma3 4b with it via KoboldCPP API.
bcell4u@reddit
I second this. My motherboard doesn't even support resizable bar and CPU is an old Celeron. Works great at 16-17tps
happybydefault@reddit (OP)
That's an interestingly similar GPU, then.
Have you tried vLLM or SGlang with your GPU? I imagine thet would be much faster than llama.cpp but I'm not really sure.
Clayrone@reddit
I have not tried those yet, but they are on my list!
UltraSPARC@reddit
vLLM was a lot faster than llama.cpp for me.
Ok-Ad-8976@reddit
How was it faster on R9700? Did you actually get it running properly? Because VLM is on a R9700 is a pain in the ass.
I'm actually right now trying to get the QWEN 3.5 27b running properly on R9700 and trust me it's not pleasant.
guywhocode@reddit
I'm 20 compiles into getting qwen 3.5 quants to work, took 10 to break pp512 35t/s. Now it is at 1440, tg was 58t/s since first try tho.
gdeyoung@reddit
Would love to know more your recipe for this I have up on Qwen3.5 on my 9700 for now
Ok-Ad-8976@reddit
I had claude take kuyz vllm toolbox and rebuild vllm for it while keeping everything else pinned, that seems to be more performant than just using the stock vllm docker image from AMD. Also you need to enable MTP and MTP of 3 seems to give about 66-70% token acceptance. Seems like MTP only helps with 2x R9700 though. I still haven't stumbled upon a reasonable solution for running quantized Qwen 3.5 27B on the VLLM in R9700.
Llama.cpp with Q4 gives a decent performance of 30 t/s if only using one slot and no -kvu on Vulkan but pp is subpar at around 660 t/s
Ok-Ad-8976@reddit
Yeah, I've been struggling with it. It doesn't work that well. I have a dual R9700 and I can get token generation to be best case scenario 35 tokens per second if I'm using MTP3. But that's a very optimistic number if I use
https://github.com/eugr/llama-benchy
That gives me much lower numbers. I get only 11.5 tokens per second. At depths of 16k, I get 4 tokens per second.
It's still somewhat usable, it looks better in a chat interface than what the number says because pp is almost 1600 t/s, but it's nowhere near as good as for example, I can get from TP=2 clustered sparks for a 397B that gives me steady, 30 t/s tg128, and 1650 t/s pp2048.
I tried the stock VLLM image we can pull from Docker and that one was quite a bit worse. I ended up having to do my hybrid build where I use, well not me, Claude takes Kuyz's image and then it heavily patches in a way that it uses the newest VLLM, but it keeps Triton kernels fixed at 3.6 or something so that they don't crash and there's some other patches that Kuyz has. Bottom line, it's not worth the trouble. Tokens per second just running on single R9700 at q4
by the way, above is all trying to run FP8. I have not been able to get any sort of GPTQ or AWQ quants running on R9700 successfully with vLLM
sixcommissioner@reddit
the part where claude has to take a custom docker image and heavily patch it with pinned triton kernels just to get vllm running is not exactly a sign the ecosystem is ready
colin_colout@reddit
I had nothing but issues with vllm with my Strix Halo (gfx1151).
Is RDNA4 more compatible? Which gfx target is that board?
Explurt@reddit
gfx1201, same as the rx 9070
letsgoiowa@reddit
My friend got WAYYYYYYYYY better results with ROCm like 8x the TPS on Qwen 3.5 9b.
armeg@reddit
Can I honestly ask - what are you guys actually doing with Qwen 3.5 9b? I’m honestly serious.
letsgoiowa@reddit
Fun and Zeroclaw for "free"
armeg@reddit
Gotchya, alright, my main business use case is OCR, but even then I'll probably end up just using it via somewhere I can host a model since I need fault tolerance.
I guess for me the fun factor for local only really becomes real when I can use it for coding for real which would require approaching Sonnet 4.5 levels.
Clayrone@reddit
The reason I went with vulkan was that there was constant power drain on indle with ROCm. Might check if this got fixed though.
6jarjar6@reddit
They are working on a fix https://github.com/ggml-org/llama.cpp/issues/20482#issuecomment-4122628483
ElementNumber6@reddit
That's just the crypto coin miner. Don't mind it at all.
letsgoiowa@reddit
Ah fair. That's pretty weird.
ForsookComparison@reddit
Prompt processing is way better. If you have rdna3/4 it's not even close
FullOf_Bad_Ideas@reddit
I am curious about top BF16 flops achievable on R9700 AI to see compute/cost numbers but I can't find any place to rent them out on-demand for an hour without commitment.
Could you please try to run this? No full run needed, just a few minutes until max tflops numbers get stable TFLOPs floor. If you'll have ROCm issue don't bother with troubleshooting it.
https://github.com/mag-/gpu_benchmark/
R9700 AI theoretically could have up to 190 TFLOPS there but I expect it to be lower, the big question is whether it will be a tiny bit lower or 2x lower.
Clayrone@reddit
I ran it with pytorch rocm7.2 and got a stable result of 138.2 TFLOPS, so not 2x but not a tiny bit either.
FullOf_Bad_Ideas@reddit
thanks!!!
Specific-Goose4285@reddit
Two or three years ago I was piecing together an ungodly mess of library and broken instructions for ROCm on consumer RDNA2 cards. Setting library paths, using their patched LLVM compiler to build llama.cpp, variables to force set GFX versions to convince ROCm to work and all that.
I had fun doing it. Would gladly do it again but at that time I happened to have that AMD laptop with discrete graphics I wanted to make work.
Hopefully intel gets to a decent point soon.
mslindqu@reddit
Can you speak to token rate on a model that mostly fills the card?
Clayrone@reddit
I ran some benchmarks with the basic builds I have so this is what it looks like without any deeper dive:
mslindqu@reddit
Thank you. Wow, looks like that thing crushes it in general.
pixelpoet_nz@reddit
No, it's inanimate
mslindqu@reddit
You can still speak *To* it. Do better.
pixelpoet_nz@reddit
You know I'm talking about the silly turn of phrase and how it's basically pompous "talk about". Try harder, or just keep eroding what is literal and what is figurative, just like your use of "literally".
What do you think emphasizing "to" here distinguishes from the previous use?
pixelpoet_nz@reddit
You know very well I'm talking about the American turn of phrase and how it's basically pompous "talk about". Try harder
Clayrone@reddit
I will see if I can give some benchmark when I have a bit more time for testing it.
Much-Researcher6135@reddit
That's an excellent power profile.
a201905@reddit
I just picked up 2 of these and got it delivered yesterday. Any tips/suggestions? It's my first time switching from cuda
TheyCallMeDozer@reddit
Oh nice it literally just got dual R9700 cards for my build awesome to see it runs with llama.cpp, was thinking I might need to learn how to use vllm after I build it tonight
spaceman_@reddit
Are you running Linux, and if so, what distro? I've just gotten two R9700 and on Debian 13 (with kernel and mesa from backports) I'm seeing nothing but issues using Vulkan.
ROCm is a little better but still crashes occassionally.
Clayrone@reddit
I am using Ubuntu 24.04.3 LTS, but honestly I have just a couple of models that I use and it's stable enough so not much tinkering here. I tried Qwen 3.5 35B Q6 and 27B Q6 and Q8 via opencode and some smaller ones and they have been fine so far, however I only just assembled that machine not that long ago.
NineBiscuit@reddit
Intel GPUs are actively getting dumped by Video game developers. skip.
fallingdowndizzyvr@reddit
How are they getting dumped? If the card supports DX12 and/or Vulkan, then it's support in video games. Regardless, this is a server card. Playing games on it is not it's primary purpose.
NineBiscuit@reddit
There are 2 games so far that refused to support Intel GPUs and counting.
fallingdowndizzyvr@reddit
LOL. Talk about gaslighting. Only 2? You made it sound like it was the entire industry.
LOL. Did you miss the part when I said "this is a server card"? Do you think that Intel only makes one GPU? You must if you think that gamers are using this to play games on. Especially since it's not out yet.
There are many Intel GPUs. Some are for consumers. Some are for businesses. This is a business card. The "Pro" in the name is a clue. I guess not enough of a clue for you.
LOL. The only one gaslighting here is you. And you are doing a piss poor job of it.
NineBiscuit@reddit
2 paves the way. you can dismiss as you want.
fallingdowndizzyvr@reddit
One server GPU is not all Intel GPUs. One is not all. You can try to gaslight all you want. You will fail.
Cokadoge@reddit
wow! millions of gamers bought this GPU already? crazy!
sleepy_roger@reddit
Who wants it for video games? I think most high end consumer gpu's are now being used for AI.
NineBiscuit@reddit
who? there are millions. are you that delusional?
Long_comment_san@reddit
Does it support 4 bit natively?
BallsInSufficientSad@reddit
I'm not sold on the notion that LLMs are best at 4-bits. It seems too small when models are trained on so much more.
Long_comment_san@reddit
They aren't massively reduced in precision. But they are massively reduced in size. That's the whole point. Thats why I waited for blackwell supers so much because 24gb at 4 bit precision is a huge ass model if thats quantized to 4 bit itself.
BallsInSufficientSad@reddit
Being fast and dumb isn't necessarily better.
happybydefault@reddit (OP)
No, not natively, it seems.
Source: https://www.tomshardware.com/pc-components/gpus/intel-arc-pro-b70-and-arc-pro-b65-gpus-bring-32gb-of-ram-to-ai-and-pro-apps-bigger-battlemage-finally-arrives-but-its-not-for-gaming
Long_comment_san@reddit
Meaning no model in particular. So its BF16, bruh. Well, that's not that big of a deal currently, 32gb is a lot of VRAM in MOE age.
TechExpert2910@reddit
pretty much every model is available in an int8 quant, though — so this should be fine
TuxRuffian@reddit
They don't seem to publish numbers for it like they do for FP32 and INT8, however This chart from a WCCFtech article shows X^e^ Matrix Extensions support INT2, INT4, INT8, FP16 & BF16.
nospotfer@reddit
but no CUDA so....
dark_bits@reddit
Genuine question, in terms of performance CC is unbeatable for about $20 per month (this is enough for me since I don’t rely on it to write ALL my code), and I’ve tried local LLMs and while they’re okayish I still fail to see a reason to drop $1k on them. So what’s the actual use case for them?
happybydefault@reddit (OP)
For me, personally, there are several reasons:
Reliability. I'm very skeptical of the quality of commercial models at times when they are under heavy load. I think they are not being transparent at all about the quantization or other lossy optimizations they do to their models, maybe sometimes even dynamically. So, you can't even get an accurate grasp of how reliable they are because that reliability can change at any time. They can even update the weights and not update the model version, and you wouldn't know about it.
Privacy. I don't want those companies to have the ability to know/keep my data. To my understanding, they keep logs of your data even for legal reasons, even if they don't end up training on it.
I hate Claude's moral superiority and condescending attitude. I want my model to follow my instructions to the letter, not to do its own thing. That's less of a problem with Gemini and OpenAI models, though, in my experience. But that's definitely something that, if you are knowledgeable enough, you can address yourself with your own models.
Price. You can run a local model in a loop forever and it will not cost you a ton of money besides electricity.
happybydefault@reddit (OP)
Another reason. I would love experimenting with training my own small models. That's possible or at least much better with your own GPU.
dark_bits@reddit
This for me would be the only reason tbh.
We’ve been HEAVILY using CC at work for a rewrite and honestly sometimes it was less performant, but even so it was still miles ahead of any local model I’ve used.
I understand and this is a personal choice, however let’s be realistic we’re not really dealing with top secret rocket science stuff, so who cares even if they end up training it on your code. I tend to open source almost everything remotely complete that I do, so for me it’s a no biggie tbh. Let civilizations make use of your brain power (in this regard I’m 100% pro with Qwen distilling Claude - they can and they should).
Eh I don’t see it tbh, but I believe you.
True, but again if your needs can be satisfied by a $20 subscription then price tends to favor Claude.
Experimenting with AI locally tho? I love it! I’d drop a grand to be able to do that as much as possible.
SneakyPositioning@reddit
for number 2 that was my main concern tbh. LLM could be used for more than coding. If you want to go with a personal/family assistant that's included in your family group chat and handles chores for you at your wish, I wouldn't want any providers to see those even tho they are not top secret. But I am still too poor to build a dedicated local LLM inference machine, so I just have to be patient I guess.
dark_bits@reddit
I think if you go to your account settings you can opt out of data sharing. Now that might not be the same but I guess it’s something?
DeconFrost24@reddit
Ya know, thinking about this, there's probably a concerted industry effort to not give the peasants too much GPU and vRAM as to not impact cloud hosted (paid) models. The bigger this gets (meaning capabilities and use cases), the less I want it in the cloud.
happybydefault@reddit (OP)
I hadn't thought of that but it makes sense. I think it's unlikely but it's definitely plausible.
MentalStatusCode410@reddit
Wouldn't 2x 5060TI 16GB be better value ?
Each card will have 448gb/s (almost 900gb/s) and occupies x8 PCI-E.
Seems more sensible given the optimisations/compatibility for Nvidia.
happybydefault@reddit (OP)
To my understanding, in that case the memory bandwidth wouldn't double, but instead it would remain that of a single GPU (448 GB/s) or even a little lower.
tryingtolearn_1234@reddit
This is s smart move they should have done years ago.
Even_Package_8573@reddit
32GB VRAM at that price is honestly kind of wild. Feels like Intel is targeting the “run stuff locally without selling your soul” crowd lol.
I’m more curious how it holds up in real workflows thoug, like not just inference, but the whole loop (loading models, compiling, iterating). Sometimes that’s where things start to feel slow even if the raw specs look great.
If this ends up being stable + decent driver support, I can see a lot of people jumping on it just for experimentation alone.
Tai9ch@reddit
Are they really going to sell them, or is this another paper launch with no stock for 6 months and then at 50% higher than announced prices like the B60?
happybydefault@reddit (OP)
Well, taking into consideration that they supposedly start selling them in like a week, I imagine they will have stock. Not sure, though.
lightmatter501@reddit
If you actually have a contact in the enterprise sales space, you will be able to get one very soon. Priority is going to go to companies first since this is a pro card.
Tai9ch@reddit
Intel launched the B60 in May 2025 for $500. The first ones became available for sale online around December for like $800.
happybydefault@reddit (OP)
Oh, that sucks.
jduartedj@reddit
the 608 GB/s bandwidth is honestly the most interesting part for me. for inference thats what actually matters more than raw compute, since most local LLM work is memory-bandwidth bound. at $949 with 32GB thats pretty competitive vs getting a used 3090 for like $800 and dealing with the power draw.
my main concern would be the software stack tho. llama.cpp has SYCL support but its still not as polished as CUDA. has anyone actually tried running qwen 3 or similar models on the existing arc gpus? curious how the tok/s compares in practice vs what the bandwidth numbers would suggest
wsxedcrf@reddit
As nvidia has said "Free is not cheap enough" in the grand scheme of things. It's the whole ecosystem that matters.
Tight-Requirement-15@reddit
NVIDIA's moat isn't the GPUs, other accelerators have always existed, now more than ever with everything like TPUs/Wafers/Trainium. It's CUDA. The tooling is very mature. It's been around for about 20 years now with the entire toolchain, compilers, drivers, dev tools, optimized libraries maintained by NVIDIA engineers like the kernels in cuBLAS. Good luck trying to recreate all that and the trust around the ecosystem in one go. Not saying its impossible, but there's definitely a lot of stability
One-Employment3759@reddit
That's just Nvidia propaganda to justify their rip offs.
Tai9ch@reddit
Nah.
There's still some CUDA wall, but it's not that big a deal for most use cases.
happybydefault@reddit (OP)
I agree with that, but if you only care about inference and vLLM supports the GPU, then I see a lot of value there already.
I would love running Qwen 3.5 27B at a decent speed and quantization, but an NVIDIA GPU with 32 GB of VRAM would be far more expensive than this Intel one.
colin_colout@reddit
Do you know if vllm fully supports the card, or does it only support a subset of functionality via a less-optimized translation layer (like HIP with consumer AMD GPUs)?
happybydefault@reddit (OP)
From vLLM's website:
https://docs.vllm.ai/en/stable/getting_started/installation/gpu
But I'm unsure of what that means exactly.
ocean_protocol@reddit
Yeah, the interesting part isn’t performance, it’s the 32GB VRAM at that price that’s basically aimed straight at local AI use, not gaming. Feels like Intel’s betting on “more memory for cheaper” rather than chasing Nvidia on raw speed.
Real question is whether the drivers hold up this time :)
Ok_Warning2146@reddit
Not a bad product but I think it needs 64gb+ to be competitive
glenrhodes@reddit
32GB at $949 is genuinely interesting for local inference. The bandwidth story is decent at 608 GB/s. My concern is driver quality on Linux though. Intel's GPU drivers have been getting better but they're still nowhere near the CUDA ecosystem for production workloads. Running Qwen 30B at 4-bit would be sweet if the tooling actually supports it without constant wrestling matches.
IntelligentOwnRig@reddit
The bandwidth is the number to watch here. 608 GB/s puts the B70 below the RTX 4070 Ti Super (672 GB/s), which costs $779 with half the VRAM. And the used 3090 at 936 GB/s has 54% more bandwidth for roughly the same price, just with 24GB instead of 32.
The B70's real value is fitting models in the 27B-34B range at Q6 or Q8 without quantizing as aggressively. A 70B at Q4 needs about 41GB, so even 32GB won't get you there. But Qwen 3.5 27B at Q8 sits around 30GB and that's where this card earns its keep.
The catch is the software stack. No CUDA. Vulkan through llama.cpp works but isn't as fast. vLLM having mainline support is promising, but "day one support" and "day one performance parity with CUDA" are very different things.
If 24GB is enough for your models, the used 3090 is still the better buy. If you need 32GB and don't want to deal with AMD's ROCm, this is worth watching once real benchmarks land.
redditrasberry@reddit
also compare to to Apple M5 Max at 600GB/s
It's cheaper but then, with the Apple you get a whole computer with it and unified VRAM
Alarmed_Wind_4035@reddit
for 999 I will buy two 5060 ti 16gb
Kutoru@reddit
It sucks how NVIDIA pretty much still makes the best hardware.
This is roughly the same TOPS as DGX Spark but at 2x the power usage. The only kicker is that you get 2x the memory bandwidth as well (Also GDDR6 vs LPDDR5).
Then consider the PCB and chassis size.
AcePilot01@reddit
Eh, they screw it with the 608GB's tbh.
Odd-Ordinary-5922@reddit
thats a good amount bro
ttkciar@reddit
Why would I buy this when I can get an AMD MI60 with 32GB and 1024 GB/s at 300W for $600?
happybydefault@reddit (OP)
Whoa, that sounds like a much better GPU, then. I didn't know about that GPU.
I wasn't able to find it for $600, but I did find a few MI100 (seemingly better than the MI60), each for around $1000, which seems like a better option than the new Intel GPU.
ttkciar@reddit
Oof, you're right. There used to be a ton available on eBay, but looking on eBay just now, they seem to have evaporated.
I'm only seeing MI50 upgraded to 32GB (which are technically equivalent to MI60, but carry some risk because the upgrade is third-party and of irregular quality) and MI100 (which is significantly more expensive).
If MI60 availability has gone the way of the dodo, that would be a solid argument in favor of this Intel GPU, though as you point out the MI100 would still be a strong contender.
Zidrewndacht@reddit
The 32GB Mi50s aren't "upgraded" like a 48GB 4090. Mi50 is an HBM card, that can't be done.
They're born with 32GB, just have less enabled shaders than a Mi60.
Tai9ch@reddit
I wouldn't.
I've got a couple MI60's, and they're fun, but it's basically llama.cpp only and prompt processing is sloooow.
happybydefault@reddit (OP)
That's good info. Why would vLLM not work?
Tai9ch@reddit
AMD dropped support a while back, and vllm dropped support at the same time. There's an old vllm fork that works, but it doesn't support any recent models.
The key problem is that the MI60 released back in 2019, which means it was designed before the LLM hype really got going. That means it doesn't have any of the hardware features that really speed up inference. No fast matrix instructions, no FP8 support, it doesn't even have BF16 support.
I actually spent a couple days trying to port modern vllm to it. It's certainly possible. It wouldn't even be that slow. But there's no way in hell I'd recommend MI60 (or even MI100) for ~$500 over a modern supported card like the R9700 or this B70 for ~$1000.
happybydefault@reddit (OP)
That definitely takes those old AMD GPUs out of the question for me, then.
I wish @ttkciar, the OP of this thread would have given that context if he had it. Otherwise, shame!
ttkciar@reddit
I think because llama.cpp supports the MI60 via its Vulkan back-end, and vLLM only supports AMD via a ROCm back-end, and recent ROCm drops support for MI60.
That's a genuine limitation of MI60 which I omitted from my earlier comment.
happybydefault@reddit (OP)
Got it. I appreciate the new context.
Life_is_important@reddit
But can they run AI models the same as nvidia? ComfyUi? LTX? WAN? llama.cpp ? And other LLM or visual/audio gen ?
ttkciar@reddit
I can't speak to those other projects, but llama.cpp's Vulkan back-end supports AMD Instinct GPUs marvelously. A lot of folks in this community (including myself) use them for exactly that.
Life_is_important@reddit
That's amazing!! AMD cards are a lot cheaper. I bought used 3090 for cheap, but my next card might be AMD. By then, probably all these kinks will be worked out even better.
Tai9ch@reddit
Because the MI60 is slow and has basically zero software support.
ttkciar@reddit
But it's significantly faster than this new Intel GPU, which was the point.
Yes and no. Using llama.cpp's Vulkan back-end has JFW with AMD Instinct GPUs for years now, and PyTorch support for AMD GPUs is now a thing and getting better.
For projects with hard dependencies on CUDA kernels (like Axolotl) the MI60 would be a non-starter, but so would this Intel GPU. Hence it is a valid comparison.
Tai9ch@reddit
No. It's not.
On paper, the MI60 has nearly twice the memory bandwidth. That's great, and there's certainly some possibility that a custom MI60 optimized inference engine could compete on throughput.
But nothing is optimized for the MI60. Most stuff doesn't even support the MI60, because it doesn't have the fast matrix instructions or BF16 data format support that modern inference engines rely on.
And without that hardware support, there's no way to fix the main weakness of the MI60: prompt processing. Literally anything else is 5x - 10x faster. You'd get faster prompt processing on 2x Intel B50's than on one MI60. If an RTX 3060 could push 2k tokens per second prompt processing, an MI60 would give 300 t/s with the same model.
XccesSv2@reddit
i bought it for 250btw but to be clear: you cannot buy it new. So you cant compare that.
Stochastic_berserker@reddit
Any AMD is preferred over Intel GPUs because of software stability
HealthyInteraction90@reddit
32GB VRAM for $989 really hits that 'Goldilocks' zone for local inference. While the CUDA moat is real, the progress llama.cpp has made with the Vulkan backend makes these Intel cards a viable path for hobbyists who just want to run quantized 70B models without selling a kidney for an A100 or dealing with the power draw of dual 3090s. If the drivers hold up under a 48-hour inference load, this is going to be a huge win for the 'Local AI' crowd.
kidflashonnikes@reddit
I run a team at one of the largest AI companies (head of research for a department). My thoughts on the new intel GPU as I deal with hardware every day of my life, for about 11 hours working from Monday - Saturday night. This GPU is good for cheap VRAM - but it exposes the entire GPU industry. Cheap VRAM is not enough. It just doesn't cut. If I were to rank this GPU, out of the entire Nvidia line up - it sits right below the RTX 3090 and 3090 Ti.
Intel is catching up, but they started a marathon by shooting their foot before the race even started. That is just the reality. Yes you will be able to run larger LLMs, but you wont be able to RUN local LLMs like with Nvidia chips. It's just reality. I want Intel to catch up - but its too late. The company I work for - the models that will be released in 2027 are beginning to make me question what being human even means. It's too late for Intel.
nntb@reddit
I want 200gb+ vram
Inevitable-Buy9463@reddit
Rats. I just ordered another 3090 because I get tired of waiting for for new gen GPUs to exceed it's price performance ratio.
GloomyRecognition636@reddit
About f time
Podalirius@reddit
I'm all for rooting for the underdog, but owning intel stock is just admitting you're a masochist
cafedude@reddit
Thanks for letting us know your financial incentives.
redditrasberry@reddit
what local stack will work with these? is it supported by eg: llama.cpp to fully use the GPU memory / acceleration primitives?
happybydefault@reddit (OP)
It seems it's supported by upstream vLLM. I don't know what the support by llama.cpp is.
chuckaholic@reddit
Intel has been making some interesting moves recently. They have some budget CPUs right now that compete with AMD in performance per dollar.
Their Arc GPUs though... A lot of devs aren't even supporting the architecture at all. A lot of triple A game titles don't run on Arc. Kinda sad really, because the GPU industry REALLY needs some competition right now, to drive down prices.
If Intel is really interested in entering this market and competing, they need to start writing libraries for PyTorch, TensorFlow, Jax, and all the other stuff that runs faster on Cuda. Either write new libraries, or offer some kind of Cuda virtualization microcode.
And will Intel GPUs support any kind of interlink that's faster than PCIe? 32GB is a good start, but I can't run Kimi on that. The models I WANT to run will need 4 of those cards. And they need unified memory.
happybydefault@reddit (OP)
Oh, I thought essentially all games except for a few would run on Intel Arc GPUs. It support really still that bad?
chuckaholic@reddit
https://www.reddit.com/r/IntelArc/comments/zl2dum/arc_incompatible_games_list/
happybydefault@reddit (OP)
That's a list from 2023 (or 20023 in the future, if you want).
GenerativeFart@reddit
I’m not even sure if this is a good deal. The expensive GPUs are expensive because they support NVIDIA compute capability 8 and up. There are plenty of cheap GPU options with lots of VRAM.
lemon07r@reddit
Used 7900 xtx go for roughly 700 USD in my area (Canada), so I'm not sure how appealing this is. You get like 33% more vram at a 42% cost more and I imagine it won't be as fast. Not to mention buying a used card here means no 13% tax we'd have to pay here for the new Intel card. I'm not super familiar with the Intel software stack either, but rocm has been decent for me. I've been able to do most things on my amd cards. I guess this could still be a good option if per slot vram matters to you most..
nickm_27@reddit
Exactly the same math that I did
tezcatlipoca314@reddit
I’ll go either Nvidia card or Apple unified RAM. Nvidia card is expensive but with CUDA support, good training and inference. Apple RAM is almost inference only but it’s a lot more cheaper than others
Late-Assignment8482@reddit
SOMEONE needs to get in on the x86 side besides NVIDIA and AMD, so godspeed to them.
zubairhamed@reddit
They need an NVLink equivalent
jrexthrilla@reddit
I’m running qwen 27b at 4bit right now on a 3090 it has plenty of headroom why would you need 32gb for the 4bit
KiranjotSingh@reddit
Will it be good enough for video generation?
Vicar_of_Wibbly@reddit
Pre-order at Newegg is live for $949 each, limit 2 per customer. Release day is April 2.
Griznah@reddit
"Cheap"... nope, $950+ not cheap
happybydefault@reddit (OP)
Much cheaper than most other options with 32 GB of VRAM and ~600 GB/s of bandwidth.
tothatl@reddit
People think they will be doing modern llms with embedded GPUs.
Nope, not soon anyway.
For that you have to pay.
Griznah@reddit
Just because something is cheaper doesn't make it Cheap. Aggressively priced, agreed. Hopefully they can get their drivers in order. I heard a rumor Intel was dropping out of the discreet market, fake news?
IrisColt@reddit
Anon, I...
nmkd@reddit
> Intel will sell a cheap GPU
> $949
flockonus@reddit
The mos disagreeable thing i've read is Intel as a stock, it's atrocious, sell it 😂
fallingdowndizzyvr@reddit
I have to disagree with you on that. I bought Intel during the depths of it's recently lows these last years. Now that Trump has nationalized them, I've made a pretty penny. Well on paper. I'm still holding since never bet against a company that the government will not let fail.
happybydefault@reddit (OP)
The atrocious stock made you a pretty penny. We see each other once again, stranger. At this pace we'll end up being friends.
happybydefault@reddit (OP)
Well, your disagreement doesn't correspond to reality. I bought the majority of it about a year ago when it was less than half of what's worth at the moment. Of course, it could go down to nothing, but I think that won't be the case.
ArtfulGenie69@reddit
I wish their was some real competition happening. That $1000 card shows that the 5090 is probably worth a lot less, like in reality $1500 for it if they didn't have the market by the balls? It's all about stupid cuda. Wish there was an actual option for that.
madrasi2021@reddit
One can hope this drives some market pressure for prices / product offerings...
Palmquistador@reddit
Don’t you need an NVIDIA GPU for inference? Pardon my ignorance.
happybydefault@reddit (OP)
For the most compatible, performant inference, yes. But other GPUs also do inference. I mean, that's what they do when they "run" LLMs or other type of ML models.
fallingdowndizzyvr@reddit
I don't even know what you mean by that. Inference isn't dependent on any particular hardware. It's just software. By the way, Nvidia isn't most performant. The likes of Cerebas are more performant.
happybydefault@reddit (OP)
I think my response was as accurate as it makes most sense for somebody that didn't know whether other GPUs besides NVIDIA ones can do inference.
fallingdowndizzyvr@reddit
Your response is not accurate period. It's misleading. Educate with facts. Not misinformation.
happybydefault@reddit (OP)
I think it was as accurate as needed. Bye, stranger.
fallingdowndizzyvr@reddit
It was as misleading as it didn't need to be.
fallingdowndizzyvr@reddit
No.
dingo_xd@reddit
Can Intel do what AMD refused to do?
mmhorda@reddit
I tried different backend on Intel llama.cpp, ollama, ipex images and it seems like openvinonworks the best but it lags with supporting latest models. Maybe I am doing something wrong and someone could point me to the right direction. Otherwise on Intel Arc iGPU with openvino I get about 29 t/,s generation on qwen3 30B a3b instruct model.
so_chad@reddit
If I get this, can I “casually” game? RDR2, The Last Of Us, etc.. Steam games you know.. I would replace my RX 9070 XT
happybydefault@reddit (OP)
Yeah, I've heard very good things about Intel GPUs, like the B580. At the beginning drivers sucked, but now I think essentially all games run on them except for a few.
Stochastic_berserker@reddit
I have the B580 and it is a shitpacked GPU recycling 2019 graphics and hardware in a modern product
happybydefault@reddit (OP)
Dang, that's disappointing to read.
Rollingplasma4@reddit
The B580 is on the level of a 4060 if you look at actual benchmarks. Don't know where that person is getting there numbers from.
Darth_Candy@reddit
Intel GPUs are pretty reasonable for gaming. Obviously you'll need to look at benchmarks, but I was geared up to buy an Arc B580 for 1080p/60fps gaming (no interest in crazy ray tracing or hyperrealism) before I found a good local deal on an AMD card. Intel was missing a higher-end card, which apparently now they're trying to remedy.
Nattramn@reddit
I've heard good things about Intel gpus for gaming (and watched some benchmarks before deciding to just go with cuda).
Might want to research why Crimson Desert, one of the latest releases, doesn't support Intel gpus. Not because you want to play it, but it might reveal underlying issues with support and if you want something to last the test of time, it wouldn't hurt to have Intel (pun intended) about the situation
BlindPilot9@reddit
They already sell a 16gb one and no one is able to find it anywhere. I bet that it will be a paper launch without anyone being able to get their hands on it.
squachek@reddit
96-128gb or don’t bother
MissZiggie@reddit
Arch drivers?? 👀👀
leonbollerup@reddit
"cheap" :)
inagy@reddit
Define cheap? Wendell said 4 of them is cheaper than a Stryx Halo. I kind of hardly believe that with the current RAM situation.
Anru_Kitakaze@reddit
GPU
Looks inside
Intel...
Seriously, nobody use it, so nobody will write drivers, software or make models for it. No ecosystem therefore impossible to use. And it's 1000 dollars. Forget it.
pas_possible@reddit
Said that the software support is soooo bad, I have a Arc A770, it's basically not usable besides simple Adam optimization and using it through vulkan
standingstones_dev@reddit
32GB VRAM for \~$1K is interesting for dedicated inference boxes. Puts you in 70B parameter territory without multi-GPU.
But for that money I'd lean towards a beefier Mac with unified memory. a refurb M4 Max with 128GB runs the same models, no driver headaches, and yes you spend a bit more but you get a laptop that does actual work too
The Intel offering makes more sense if you're building a headless inference server that sits in a rack or you already have a dedicated system to do a GPU swap.
The real question is driver maturity brought up in the thread earlier ... Intel's GPU compute stack and driver support has been "almost there" for a while.
TuxRuffian@reddit
Seems like the big draw here is for multi-GPU setups w/its' native VRAM pooling. I think the extra $350 for an R9700 would be worth it for running just one, but pooling ROCm w/vLLM is a pain and the native pooling via LLM Scaler is appealing. I've seen 8 B60's pooled for 192GiB and 8 B70s would get you to 256GiB but at $7,600 plus all other hardware costs would mean at least a $10k build when you can currently get a Mac Studio M3 Ultra w/256GiB for $6,000 and the M5 Ultras supposedly coming in June. I got my Strix Halo box (128GiB UMA) for A Tier MoE models at $2k too so it's hard for me to see the target market here. Still, the more options the better and maybe it will help keep costs down if nothing else.
Elite_Crew@reddit
So the same price as a 5070ti at scalping prices but with 32GB of ram instead of 16gb.
But can it play Crimson Desert?
Big_River_@reddit
well well well what do we have here? city slicker trying to slick some city? i will get one just to high noon square my R9700 and 5090 and rtx 4500 and sip that sweet sarsaparilla
Ok-Measurement-1575@reddit
Way too expensive.
happybydefault@reddit (OP)
Is there anything cheaper that has 32 VRAM and a similar or better bandwidth? I don't think so.
toooskies@reddit
You can find old MI50s with 32GB, and some super-old formerly high-end nVidia boards that won’t get newer CUDA feature support. Which may still be more useful than intel AI support.
etaoin314@reddit
its great to have, if your model can use it, that depends on the support.
ea_man@reddit
Aye, the should release a "gaming/consumer" GPU, not pro serie, for \~700 with 32GB of VRAM and then we may talk.
Wanna sell me PRO with trash support? At least make it 48GB or 64GB so I can swallow Vulkan only / LLM only.
GravitationalGrapple@reddit
Intel GPUs don’t jive with CUDA though, correct?
Far_Composer_5714@reddit
Considering cuda is a Nvidia product... It only runs on Nvidia...
WolfeheartGames@reddit
There are cuda implementations on riscv
drooolingidiot@reddit
How does this compare against Apple's M5 devices when it comes to tok/s throughput? is it better value?
happybydefault@reddit (OP)
I think only the M5 Max has around the same bandwidth as the Intel GPU, so I imagine that one would perform similarly but for a much higher price than the GPU.
qado@reddit
Yes and no, no CUDA no fun. Not the best option, but in fact not the worst too.
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
Upbeat-Cloud1714@reddit
Ya that's still really expensive for a GPU.
HairyAd9854@reddit
They have been on and off with their GPU programs for probably 20 years now. Intel discontinued ipex-llm in May, amid a spending review that cut off all their non-core projects. It is very hard to believe this the start of a long term sustained effort toward a competitive inference offer by Intel.
I would really like to be proven wrong but I am sceptical for the time being
happybydefault@reddit (OP)
Well, with the rise of ~the machines~ AI, I imagine it's extremely unlikely that Intel abandons their GPU efforts in the foreseeable future.
Whiz_Markie@reddit
Will it be a blower style card?
wind_dude@reddit
What’s the tooling like for Intel? OpenVino, what else ? I haven’t paid attention at all.
happybydefault@reddit (OP)
I'm not sure but I've read vLLM supports these Intel GPUs.
Icy_Programmer7186@reddit
Will anything similar to Greenboost be possible on this card?
eidrag@reddit
hope they have dual gpu similar to maxsun b60 too
Potential-Bet-1111@reddit
Runs cuda?
sleepingsysadmin@reddit
vulkan only.
lobehubexp@reddit
What is the difficulty?
DarkArtsMastery@reddit
Personally I wouldn't. Running AMD metal is exotic enough and contrary to popular opinion even Nvidia is not bugfree experience. Intel is good for CPUs if you do not mind the extra heat, but I am afraid GPUs are just rather bound to die. Just look at Crimson Desert and the solution from devs: refund for you as a customer. Seems like nobody is taking support for Intel GPUs seriously.
happybydefault@reddit (OP)
Yes, these GPUs are exotic in comparison to an NVIDIA one, but it seems they have been improving a lot in terms of drivers, and it seems they are already supported by vLLM.
AdamDhahabi@reddit
Why not, maybe good for offloading MoE's their expert layers while still mainly running on Nvidia stack.