[ServeTheHome] AMD Intros Instinct MI350P Accelerator: CDNA 4 Comes to PCIe Cards
Posted by Noble00_@reddit | hardware | View on Reddit | 36 comments
This was a bit of a surprise. All that compute for your local needs, but at what cost? ...very expensive most likely.
Fit-Produce420@reddit
At 144GB this is intended for businesses that don't need full clusters, it's not aimed at home users and the price will be reflective of it as a business investment.
Just slightly less bandwidth than h200 but no cuda.
Probably $30,000.
SirActionhaHAA@reddit
C'mon, a mi355x is around $30k. This is a half sized gpu, half the silicon and half the memory. Somehow ya think that it's gonna cost the same. 🤷🏻
Fit-Produce420@reddit
Huh for some reason I thought they were more, no reason to get this over a cuda solution.
Express_Living2264@reddit
the issue is still 'can you get a cuda solution?' They are living because nvidia can't saturate the market.
alwayswashere@reddit
i have 4x R9700 ripping about 20% slower than a cuda setup for lass than 1/2 the cost. soon to be on par.
Express_Living2264@reddit
should be an AMD slogan. The reality sadly it never happens.
waiting_for_zban@reddit
There is litlle value to choose this over a cluster of RTX 6000 Pro if AMD rolls that high. You could snag nearly 4 RTX 6000 Pro (96x 4= ~400 GB GDDR7 VRAM) for the same price. Not to mention, ROCm is not a positive feature.
I am at least happy there will be more high VRAM PCIe options on the market beside the last versions of the MI210.
pellets@reddit
I’ve always wondered why AMD doesn’t make a CUDA compiler for their GPUs. There used to be a project for that, but last I looked, it was discontinued.
AnActualWizardIRL@reddit
They sort of do with HIP, but its more a transpiler with warts. It'll get you most of the way there, but you still have some work to do.
pellets@reddit
You can have independent implementations of closed standards. That’s how emulators often work.
Looks like there’s some new effort in that direction since I last looked. https://www.phoronix.com/review/radeon-cuda-zluda
EmergencyCucumber905@reddit
That's what HIP is.
pellets@reddit
Where does it say ROCm supports CUDA. It seems the use cases are similar, but moving between the two would require a rewrite.
EmergencyCucumber905@reddit
Not a re-write, a recompile.
I think this is one of the most misunderstood things about ROCm. HIP is semantically nearly identical to CUDA. The difference is syntactic sugar like function names prefixed with "hip" instead of "cu" or "cuda".
You can auto convert CUDA code to HIP and build for AMD chips. This is what most projects do.
Or you can write HIP and build for both AMD and Nvidia.
Nothing needs to be re-written.
Noble00_@reddit (OP)
HIPIFY in short. But IIRC not always a full performant solution for HW
EmergencyCucumber905@reddit
Well yeah you need to account for architecture differences if you want to get the most out of the hardware. But that's true even when writing CUDA code for Nvidia GPUs. Luckily a lot of the important stuff people care about is abstracted away by libraries like cuBLAS/hipBLAS etc that are optimized for each arch.
iBoMbY@reddit
Because CUDA is not an open standard. It's a proprietary interface owned by NVidia.
Noble00_@reddit (OP)
Legality.
There are two projects done in different ways.
https://vosen.github.io/ZLUDA/blog/zluda-update-q4-2025/
https://scale-lang.com/ (I think what you're thinking of)
toalv@reddit
ZLUDA, it's... complicated.
From-UoM@reddit
It should be okay for single cards.
But the H200 will still be far superior for multi GPU setups as you can use NVLink interconnects to get 4 of them connected directly
Mi350p only has PCIe scaleup
duy0699cat@reddit
Dont know why u got downvote with no explain tho. nvlink is multiple time faster than the pcie x16 and we know how much memory bandwidth these models want.
From-UoM@reddit
H200 NVLink is 900 GB/s direct GPU to GPU
Meanwhile PCIe 5 x16 is 64 GB/s and has to go through the CPU
Its massive advantage for H200.s
Vb_33@reddit
* crickets *
upbeatchief@reddit
AMDz dastest product to a reach a billions $ in sales was an enterprise chip. So no not crickets.
Microsoft, openAI and many others are well aware that being beholden only to Nvidia is bad. And having an alternative ,even if somewhat inferior, is necessary.
Noble00_@reddit (OP)
Taping out half an MI350X for PCIe is a really interesting move, where AMD seems rather confident in the product. These are chiplets so I guess it does make sense if they were to bin less than ideal XCDs/IODs on a single AID. This would be a baller r/selfhosted project, but most likely for corporations that prefer in house production use (reminds me of tinyboxes). The fragmentation of ROCm on RDNA and CDNA is apparent, so I guess this pulls customers closer to the DC environment. Though, I'm pretty sure these still have to be properly maintained as this just may be another gfx target (which btw AMD doesn't have a good track record of).
Polar_Banny@reddit
Any word about FP64 (Double Precision 64-bit Floating Point) for science workloads? Thanks
SirActionhaHAA@reddit
You'd want the mi430 version of this although there ain't news about it yet.
EmergencyCucumber905@reddit
MI430 is CDNA5. CDNA4 doesn't have cut down FP64.
SirActionhaHAA@reddit
It does compared to cdna3 and 5
Mi300x (FP64) Performance: 81.7 TFLOPs
Mi355x (FP64) Performance: 78.6 TFLOPs
Cdna4 is a low precision upgrade, it regresses slightly on fp64. Mi450 is the same, the non cut version is mi430 which has 200tflops of fp64.
JakeTappersCat@reddit
It still has 96% of the FP64 performance. Doesn't seem like it's that much of a regression
EmergencyCucumber905@reddit
It really isnt a regression. Those are theoretical numbers. In real workloads the extra bandwidth of the MI355 probably more than makes up for the 4% fewer theoretical FLOPS.
EmergencyCucumber905@reddit
Both MI300 and MI355 have full rate FP64 (128 FLOPS per clock per CU).
This is not the same thing with MI430 and MI450. The MI430 has full rate FP64 while the MI450 has 1/16, or whatever.
SirActionhaHAA@reddit
Never said it wasn't full rate. I said cut, and that was in ref to mi300x in terms of cu count and perf
dsoshahine@reddit
Listed under GPU Specs.
https://www.amd.com/en/products/accelerators/instinct/mi350/mi350p.html
Noble00_@reddit (OP)
Full MI350X FP64 is peak ~72.1 so these would be ~36.05
noiserr@reddit
They didn't need to tape it out. They are chiplets. They just assembled half an mi355x.
chip_thoughts@reddit
What stands out to me is how datacenter class power draw is slowly becoming normal even outside traditional servers now…..600W PCIe cards would have sounded pretty absurd just a few years ago…Honestly the 144GB HBM3E is the craziest part to me……Feels like AMD is betting that a lot of future AI workloads will care more about fitting larger models locally than just chasing peak FLOPS numbers…