[ServeTheHome] AMD Intros Instinct MI350P Accelerator: CDNA 4 Comes to PCIe Cards

Posted by Noble00_@reddit | hardware | View on Reddit | 36 comments

This was a bit of a surprise. All that compute for your local needs, but at what cost? ...very expensive most likely.

[-]

Fit-Produce420@reddit

At 144GB this is intended for businesses that don't need full clusters, it's not aimed at home users and the price will be reflective of it as a business investment.

Just slightly less bandwidth than h200 but no cuda.

Probably $30,000.

[-]

SirActionhaHAA@reddit

Probably $30,000.

C'mon, a mi355x is around $30k. This is a half sized gpu, half the silicon and half the memory. Somehow ya think that it's gonna cost the same. 🤷🏻

[-]

Fit-Produce420@reddit

Huh for some reason I thought they were more, no reason to get this over a cuda solution.

[-]

Express_Living2264@reddit

the issue is still 'can you get a cuda solution?' They are living because nvidia can't saturate the market.

[-]

alwayswashere@reddit

i have 4x R9700 ripping about 20% slower than a cuda setup for lass than 1/2 the cost. soon to be on par.

[-]

Express_Living2264@reddit

soon to be

should be an AMD slogan. The reality sadly it never happens.

[-]

There is litlle value to choose this over a cluster of RTX 6000 Pro if AMD rolls that high. You could snag nearly 4 RTX 6000 Pro (96x 4= ~400 GB GDDR7 VRAM) for the same price. Not to mention, ROCm is not a positive feature.

I am at least happy there will be more high VRAM PCIe options on the market beside the last versions of the MI210.

[-]

pellets@reddit

I’ve always wondered why AMD doesn’t make a CUDA compiler for their GPUs. There used to be a project for that, but last I looked, it was discontinued.

[-]

AnActualWizardIRL@reddit

They sort of do with HIP, but its more a transpiler with warts. It'll get you most of the way there, but you still have some work to do.

[-]

pellets@reddit

You can have independent implementations of closed standards. That’s how emulators often work.

Looks like there’s some new effort in that direction since I last looked. https://www.phoronix.com/review/radeon-cuda-zluda

[-]

EmergencyCucumber905@reddit

That's what HIP is.

[-]

pellets@reddit

Where does it say ROCm supports CUDA. It seems the use cases are similar, but moving between the two would require a rewrite.

[-]

EmergencyCucumber905@reddit

Not a re-write, a recompile.

I think this is one of the most misunderstood things about ROCm. HIP is semantically nearly identical to CUDA. The difference is syntactic sugar like function names prefixed with "hip" instead of "cu" or "cuda".

You can auto convert CUDA code to HIP and build for AMD chips. This is what most projects do.

Or you can write HIP and build for both AMD and Nvidia.

Nothing needs to be re-written.

[-]

Noble00_@reddit (OP)

HIPIFY in short. But IIRC not always a full performant solution for HW

[-]

EmergencyCucumber905@reddit

Well yeah you need to account for architecture differences if you want to get the most out of the hardware. But that's true even when writing CUDA code for Nvidia GPUs. Luckily a lot of the important stuff people care about is abstracted away by libraries like cuBLAS/hipBLAS etc that are optimized for each arch.

[-]

iBoMbY@reddit

Because CUDA is not an open standard. It's a proprietary interface owned by NVidia.

[-]

Noble00_@reddit (OP)

Legality.

There are two projects done in different ways.

https://vosen.github.io/ZLUDA/blog/zluda-update-q4-2025/

https://scale-lang.com/ (I think what you're thinking of)

[-]

toalv@reddit

ZLUDA, it's... complicated.

[-]

From-UoM@reddit

It should be okay for single cards.

But the H200 will still be far superior for multi GPU setups as you can use NVLink interconnects to get 4 of them connected directly

Mi350p only has PCIe scaleup

[-]

duy0699cat@reddit

Dont know why u got downvote with no explain tho. nvlink is multiple time faster than the pcie x16 and we know how much memory bandwidth these models want.

[-]

From-UoM@reddit

H200 NVLink is 900 GB/s direct GPU to GPU

Meanwhile PCIe 5 x16 is 64 GB/s and has to go through the CPU

Its massive advantage for H200.s

[-]

Vb_33@reddit

* crickets *

[-]

upbeatchief@reddit

AMDz dastest product to a reach a billions $ in sales was an enterprise chip. So no not crickets.

Microsoft, openAI and many others are well aware that being beholden only to Nvidia is bad. And having an alternative ,even if somewhat inferior, is necessary.

[-]

Noble00_@reddit (OP)

GPU	MI350P	MI350X
Compute Units	128	256
Matrix Cores	512	1024
Peak Engine Clock	2200MHz	2200MHz
Memory	144GB HBM3E	288GB HBM3E
Memory Bandwidth	4TB/sec (8Gbps x 4096-bits)	8TB/sec (8Gbps x 8192-bits)
Matrix Perf (MXFP8)	2.3 PFLOPS	4.6 PFLOPS
I/O	PCIe Gen5 x16	PCIe Gen5 x16
TBP	600W (Optional: 450W)	1000W
Form Factor	PCIe CEM, 10.5-inch FHFL DS	OAM
Architecture	CDNA 4	CDNA 4

In short, AMD is not using salvaged MI350X chips for this product. Instead, they are building a smaller chip especially for use on the MI350P by leveraging the original’s use of chiplets to make a smaller chip out of the same silicon. Whereas the MI350X was built from two I/O dies (IODs) each with four accelerator complex dies (XCDs) stacked on top (for a total of 8 XCDs), the MI350P’s chip is half of that. It is a single IOD with four XCDs, which is clocked identically to the MI350X and, at peak performance figures, offers half of the performance of AMD’s modular accelerator.

Taping out half an MI350X for PCIe is a really interesting move, where AMD seems rather confident in the product. These are chiplets so I guess it does make sense if they were to bin less than ideal XCDs/IODs on a single AID. This would be a baller r/selfhosted project, but most likely for corporations that prefer in house production use (reminds me of tinyboxes). The fragmentation of ROCm on RDNA and CDNA is apparent, so I guess this pulls customers closer to the DC environment. Though, I'm pretty sure these still have to be properly maintained as this just may be another gfx target (which btw AMD doesn't have a good track record of).

[-]

Polar_Banny@reddit

Any word about FP64 (Double Precision 64-bit Floating Point) for science workloads? Thanks

[-]

SirActionhaHAA@reddit

You'd want the mi430 version of this although there ain't news about it yet.

[-]

EmergencyCucumber905@reddit

MI430 is CDNA5. CDNA4 doesn't have cut down FP64.

[-]

SirActionhaHAA@reddit

CDNA4 doesn't have cut down FP64

It does compared to cdna3 and 5

Mi300x (FP64) Performance: 81.7 TFLOPs

Mi355x (FP64) Performance: 78.6 TFLOPs

Cdna4 is a low precision upgrade, it regresses slightly on fp64. Mi450 is the same, the non cut version is mi430 which has 200tflops of fp64.

[-]

JakeTappersCat@reddit

It still has 96% of the FP64 performance. Doesn't seem like it's that much of a regression

[-]

EmergencyCucumber905@reddit

It really isnt a regression. Those are theoretical numbers. In real workloads the extra bandwidth of the MI355 probably more than makes up for the 4% fewer theoretical FLOPS.

[-]

EmergencyCucumber905@reddit

Both MI300 and MI355 have full rate FP64 (128 FLOPS per clock per CU).

This is not the same thing with MI430 and MI450. The MI430 has full rate FP64 while the MI450 has 1/16, or whatever.

[-]

SirActionhaHAA@reddit

Never said it wasn't full rate. I said cut, and that was in ref to mi300x in terms of cu count and perf

[-]

dsoshahine@reddit

Listed under GPU Specs.

https://www.amd.com/en/products/accelerators/instinct/mi350/mi350p.html

[-]

Noble00_@reddit (OP)

Full MI350X FP64 is peak ~72.1 so these would be ~36.05

[-]

noiserr@reddit

Taping out half an MI350X for PCIe is a really interesting move

They didn't need to tape it out. They are chiplets. They just assembled half an mi355x.

[-]

chip_thoughts@reddit

What stands out to me is how datacenter class power draw is slowly becoming normal even outside traditional servers now…..600W PCIe cards would have sounded pretty absurd just a few years ago…Honestly the 144GB HBM3E is the craziest part to me……Feels like AMD is betting that a lot of future AI workloads will care more about fitting larger models locally than just chasing peak FLOPS numbers…