Why is AMD's new N48 (9070XT) so massive ~390mm² compared to PS5 Pro's die ~279 mm² ?
Posted by fatso486@reddit | hardware | View on Reddit | 77 comments
Can someone explain why AMD's new N48 is so massive at an estimated 390mm², despite having basically the same number of CUs as the Viola (RDNA 3.75?), which is under 280mm²?
Pic here for reference: PS5 Pro die \~280mm².
I know Infinity Cache on the N48 is a factor, but I’m not entirely convinced—that PS5 Pro SoC has an 8-core CPU, which should offset that. Are there any other major (area-hungry) features I might have missed? It seems kind of crazy, especially since AMD is usually obsessed with smaller, cheaper dies. Even the lower-tier Kraken Point seems huge, considering it’s also on 4nm.
Thoughts?
b3081a@reddit
It's a 4080super class GPU rather than something like 4060ti.
Acrobatic_Age6937@reddit
currently it looks more like a 5070ti competitor at best.
b3081a@reddit
All of them are similar in size anyway. 5070ti is the same GB203 die which is exactly the same size as 4080 (super) just a slightly cut down version.
Acrobatic_Age6937@reddit
my bad, i misread your 4080 as 5080 for some reason.
DuranteA@reddit
I/O and caches etc. probably play a big role, but calling it "RDNA 3.75" is also being extremely generous to the SoC. Each individual CU is likely to be significantly more capable (and thus larger) on 9070XT.
Gachnarsw@reddit
Isn't it more accurate to call PS5 Pro GPU RDNA 2.x rather than 3.75?
bubblesort33@reddit
I think their slide said it's RDNA2.x in some way, and RDNA4 in other ways, and then other custom silicon. So I don't think it matters if you can it 2.5 or 3.5, because they are both right and wrong in some way.
puffz0r@reddit
I think it's wholly inaccurate to call it rdna3 since it isn't MCM, doesn't feature any of the fp32 double-pumping stuff, and the only thing that might be included is the wmma instructions for AI stuff. Even the RT pipeline is different
Gachnarsw@reddit
It sounds like AMD's semi custom is a buffet arrangement. You can pick the features you want, exclude others, and add your own customizations. The resulting chip may not be architecturally identical to any other products on the market.
kuroyume_cl@reddit
Yes, it's RDNA2 with some RDNA3 features
ProperCollar-@reddit
Yes.
fatso486@reddit (OP)
Got it, but there was a common "perception" that the PS5 Pro might be seen as 'RDNA 3,' with some RDNA 4 features like the improved ray tracing. The reported \~33 TFLOPS vs \~10TFLOP for ps5 and \~45% better raster performance also add to this idea.
Could you explain why you think it's more closely aligned with RDNA 2? I'd love to understand the specifics behind that.
GateAccomplished2514@reddit
You’re going off pre-launch rumors. Sony did a deep dive and they called it RDNA2.X themselves: https://www.eurogamer.net/digitalfoundry-2024-ps5-pro-deep-dive
PS5 Pro is only 17TFLOPS. It doesn’t have dual issue. And it’s RDNA2 with extensions on top so that existing shaders do not need to be recompiled for PS5 Pro.
fatso486@reddit (OP)
Thanks I saw the Cerny tech brief after posting. Didnt know they corrected all these 33tflops rumors. Tt all makes sense now.
Earthborn92@reddit
I'm pretty surprised that you went into die analysis of the PS5 Pro...without watching Cerny's presentation on it.
steinfg@reddit
First of all, no concrete number was provided, only \~350 figure was mentioned by pixel counting a grainy photo.
N48 includes a lot of missing features: 16x PCIe interface, 64MB Last level cache, Dual-issue thing (or whatever it's called) that alows theoritical TFlops to go 2x over RDNA2, Media Encoders, more display engines.
uzzi38@reddit
Mostly, but I do have one correction to make:
Navi33 shows that even on the same node, RDNA3 CUs are physically smaller than RDNA2 CUs even with the dual issue thing. It clearly doesn't add much die area at all. PS5 PRo seems to be more like a frankenstein between RDNA2 and RDNA4 from what we can tell, there's no indication if the CU itself is closer aligned to RDNA3/4 with dual issue disabled or RDNA2 with certain RDNA4 features bolted on.
SirActionhaHAA@reddit
Latter. The cu is mostly rdna2 foundation except for rt.
MrMPFR@reddit
...and custom ML HW.
Earthborn92@reddit
That's not more die area. There are no Tensor core equivalents. It is two things:
It is very minimally "custom".
doscomputer@reddit
even on the same node?
uzzi38@reddit
Navi33 is 32 RDNA3 CUs on N6, and comes in at around 208mm^2 (off the top of my head). Navi32 is 32 RDNA2 CUs on N7 and is more like 232mm^2 (also off the top of my head).
The node shrink here is only 6%, and that's only for logic. There's no scaling for SRAM nor analog circuits. Meaning the vast majority of that shrink is just the new compute unit being smaller than the last one.
Azzcrakbandit@reddit
The whole mix of rdna 2 and 4 thing gets more complicated when the base ps5 gets thrown in because that was closer to rdna 1.5 than it was rdna 2.
bubblesort33@reddit
Is there actually any evidence of 64MB at all? It's been said there were anchor points found on RDNA3 MCDs indicating AMD was actually planning to 3D stack L3 cache. But because they couldn't see many gains (because of coming short of the 3ghz target their slides claimed), they just figured there was no point. The 7800xt for example, was initially likely supposed to get 128MB of L3 for 60 CUs.
...So that makes me wonder if maybe the 9070xt actually has at least 96mb of L3. After all it's supposed to be better in raster, has more CUs with higher clocks, and is also better in really every other area like RT, and ML. Either that, or they turned all the L3 cache into faster L2 cache like Amperes, or Blackwell have.
uzzi38@reddit
AMD's Infinity Cache is directly tied to their memory controllers, it's not actually a GPU side cache at all. Like it's not tied to any of the GPU structures, like how L0 cache is on each WGP.
With a 256b memory bus, there's only going to be 4 pools of Infinity Cache. It would be very irregular for AMD to go with the 24MB per memory controller you'd need for 96MB across the whole die. More likely it's just 16MB per memory controller again for 64MB across the whole die.
bubblesort33@reddit
Why would it be tied to the memory controller like this for this design, if it's not a chiplet design like rdna3? My understanding is that the only reason it was tied to the memory controller last time, is so they could seperate both out together.
uzzi38@reddit
It's been tied to memory controller since it's introduction in RDNA2.
It's why it's called "Memory at Last Level" (MALL) sometimes, the "last level" being your memory controller.
bubblesort33@reddit
The ratio been the 6600xt and 6700xt is different, though. So it doesn't seem impossible. Maybe 24mb per controller is a bit odd, but it's not like it's not doable. And the whole design could possibly also just have changed to be a design more like Nvidia where it's no longer tied to the memory controller, but the CUs instead.
uzzi38@reddit
Yes, because there was two configs for the cache per IMC: 8MB and 16MB. This was defined in drivers.
Apprehensive-Buy3340@reddit
Oh dual-issue is still present in RDNA4? I thought they'd given up on it after seeing it work so rarely
theQuandary@reddit
In 2011, Anandtech reported average AMD VLIW usage was 3.4 which is part of why AMD went from VLIW5 to VLIW4 before switching to GCN which was completely compute focused.
VLIW2 is almost complete upside when it comes to compute density and because it's VLIW, it shouldn't add a burden to the scheduler either.
The problem with RDNA3's implementation was that only a couple of instructions were VLIW2 and all the remaining instructions couldn't use the second SIMD at all.
Rather than eliminate it, I believe the best solution would be to analyze which execution instructions are most used and add VLIW2 versions for the top 10-15 of them. That would almost certainly give a big increase in utilization and performance.
Apprehensive-Buy3340@reddit
Interesting, thanks
noiserr@reddit
You covered it pretty well. I think the key is 64MB of Iinfinity Cache. This boosts the effective memory bandwidth.
They could have also used sense libraries for PS5 Pro since it runs at lower clocks. We've already seen some 9070xt GPUs being factory OCed to 3060Mhz. PS5 Pro only boosts to: 2350 MHz.
fatso486@reddit (OP)
Good points, and thanks for the insight, guys! I feel like the dense libraries (used for PS5 Pro's lower clock speeds) might be the biggest area-saving factor I overlooked.
As for the dual-issue feature, if I'm not mistaken, the Viola should also have it, considering it’s rated at \~33 TFLOPS versus \~10, with Sony suggesting only \~45% better raster performance.
GenZia@reddit
SRAM cells require far more silicon than logic.
I just did some (very) rough calculations (with the help of GPT o1, admittedly) and, as far as I can tell, a 64MB SRAM block should take up anywhere between \~110mm^(2) to \~130mm^(2) of die space on N4, "assuming" N4 has the exact same SRAM cell size as N5 @ 0.021 µm² (as per Wikipedia).
But take it with a grain (or even a pinch) of salt.
theQuandary@reddit
It's actually the opposite for the vast majority of cache.
SRAM takes 6 gates. NOT gate takes 1 transistor. 2-input NAND gate takes 2 transistors. 2-input XOR requires 4 gates.
That sounds like SRAM is bigger right? But what if I told you that not all transistors are the same size? Logic actually gangs multiple transistor fins into one larger transistor fin (1x2, 2x2, and 2x3 are typical) with more fins being needed for higher-switching speed (clockspeed) and current.
From those numbers, you can see that a high-performance NAND gate is now as big as 4-12 fins per gate and the XOR is now a massive 24 fins per gate. You need high-performance SRAM for L1, so it'll use bigger layouts. L2 is generally 4x slower than L1, so you can use smaller layouts. L3 is generally 25x slower than L1 (or more), so you can almost certainly use high-density layouts.
Of course, cache requires a cache controller which uses logic gates, so the overall density varies based on how sophisticated and fast the cache controller needs to be. I think I can say with pretty good confidence that L1 is lower density and both L2 and L3 are massively higher density on high-performance CPU/GPU designs.
ColdStoryBro@reddit
Your fet counts are based on transmission gates. This is not what a modern highspeed digital circuit uses. An AND gate is 4 fets, 2 pmos for the pull up network. You make an XOR with four NAND gates so that would be 16 fets.
Everything else I agree with, the sizing really depends on how the physical layout is done. 16 fets of XOR will be smaller than 4x AND. And libraries will offer a few different options.
jedijackattack1@reddit
That's seems a bit high given zen5 has 40mb of cache and it's only half an 80mm or so die
the_dude_that_faps@reddit
I also think that's a bit too much. However, I wouldn't just compare it to a CPU cache. For starters, I think that the bandwidth requirements are very different and that affects transistor costs.
jedijackattack1@reddit
Yes but they will be different designs in more ways than bandwidth and latency. Cache line size, associativity, latency, prefetch, clear behavior and queue depths will all play a role. But zen5 with about 40mm of die area manages 1.4tb/s with a line size of 64 bytes and a latency of 34 cycles (8ns). Rdna will be using 256 byte wide cache lines and lower associativity which will save some area along with having way more latency tolerance. I would honestly be very surprised if it wasn't around 60-70mm of die space for the 64mb of rumored cache.
the_dude_that_faps@reddit
Fair
Qesa@reddit
Trusting the stochastic bullshit generator for factual information 🤦♂️. Do a sanity check. The zen 3/4 v-cache chiplet is 64 MB in 36mm^2 on N7.
GenZia@reddit
Snaity check?!
No need to get triggered, mate. Besides, what part of "rough calculations" you didn't understand?
I don't claim to be an expert on semiconductor mathematics. In fact, I've a degree in behavioral psychology, so forgive me if I seem a bit out of my element!
Besides, I'm not the one comparing the die areas of 'traditional' planar ("2D") SRAM with stacked "3D" V-Cache.
Take what you will.
Qesa@reddit
A sanity check in a STEM context means "is this number vaguely reasonable" mr behavioral psychologist. I'm not telling you to take a psychiatric exam.
GenZia@reddit
I guess you really don't understand what 'rough calculations' or 'grain of salt' means.
Or even planar and stacked SRAM.
Or the fact that comparing SRAM of CPUs with GPUs isn't an exact science or an apples-to-apples comparison. There's the matter of latency, bandwidth, datapath width, and application in general you've to consider.
CPUs are about latency. GPUs are about bandwidth.
Hence my "pinch of salt" footnote.
But I'm sure you already knew that!
ET3D@reddit
The way I see it, one of the following is likely:
Either 390 mm2 severely overestimates the chip size, or the chip doesn't have only 64 CUs. I'd assume the first, but won't rule out the second.
Someone else also mentioned the idea that the WGPs are larger to allow for higher frequencies, and there's definitely more hardware for ray tracing and AI, but I doubt that with everything it will still reach 390 mm2 with 64 CUs.
Aywololo@reddit
Is there a die shot of ps5 pro die?
Sheep_CSGO@reddit
Why do I only hear and see about this AMD gpu are no others coming?
Boreras@reddit
For comparison the ps5 was 286 mm², same cu numbers on 6700 is 335mm2.
fatso486@reddit (OP)
The ps5 (7nm) is 309mm . also the 6700 is a binned down die do that doesnt really count as a fair comparison it would have been less than 300mm2.
IDONTGIVEASHISH@reddit
PS5 pro die is 280mm? Sony is getting like 200$ from every purchase or something.
tukatu0@reddit
And it is still unironically the cheapest way to get new gaming experiences. Even if you have to pay to play multiplayer.
I guess microsoft was right about no pro console being possible. But they were 10 years too early once again lmao. What a shame.
Fromarine@reddit
yup
DYMAXIONman@reddit
The 6750xt is just as powerful as the PS5's gpu and it is only 237 mm²
Aggravating-Dot132@reddit
Techpowerup mentions 200, no?
hey_you_too_buckaroo@reddit
The PS5 pro GPU was not redesigned. It was really just a refresh to make it smaller and more power efficient. RDNA4 is several generations newer and will support a different feature set. Basically there's no point comparing these two.
bubblesort33@reddit
I couldn't find any info on Pro die size. I read 9nm bigger than than base PS5. Which already was over 300. Not smaller. Is this 279mm accurate?
fatso486@reddit (OP)
The only numbers i found are estimated I linked them here https://imgur.com/a/0qL2DTj
Ill look for the source that i originally took them from. Oberon is 309mm, Oberon+ is 260mm and Viola is 279mm.
bubblesort33@reddit
Oh, so I guess the source I was thinking of was saying it was bigger then Oberon+. That makes sense.
ProperCollar-@reddit
CU's are not directly comparable across generations and we don't know the exact size yet. Also infinity cache and shit.
PS5 is RDNA 2.X, not 3.X.
fatso486@reddit (OP)
Got it, but there was a common "perception" that the PS5 Pro might be seen as 'RDNA 3,' with some RDNA 4 features like the improved ray tracing. The reported \~33 TFLOPS vs \~10TFLOP for ps5 with only \~45% better raster performance also add to this idea.
Could you explain why now it's more closely aligned with RDNA 2? I'd love to understand the specifics behind that.
kukusek@reddit
I don't know if you'll get full reply, and I would be interested in it too, but I believe its about simply logic.
Ps5 is rdna2, ps5 pro is totally compatible with it, playing all ps5 titles natively. By going next gen they would have a lot of software work with each game. By sticking to this enhanced rDNA 2 they have better performance with new features without much hassle and skipping rDNA 3 cost from AMD - however the semi custom chip business work.
MrMPFR@reddit
PS5 is not RDNA 2 it's RDNA 1.5. It doesn't support VRS, Sampler feedback or mesh shaders. Ps5 pro does support mesh shaders which are superior to primitive shaders (introduced with Vega).
fatso486@reddit (OP)
Youre right just watched the PS5P tech Brief where he denies the 33 number aand its actually 16. this clearly explains the rdna2 part.
FloundersEdition@reddit
Mark Cerny himself said RDNA2.x. It has 16.7 TFLOPS (no dual issue and thus ower ML throughput as well, but it's undisclosed how ML works on PS5 Pro. Maybe it's concurrently running, but press X for doubt). PS5 Pro has ~50 new instructions, some ML, some RT (traversal, BVH8, potentially something like SER because it works better with divergent rays) and potentially some for controlling the caches and prefetching/sync data between CU (like RDNA4).
yeeeeman27@reddit
9070xt is another architecture entirely with a different performance target first of all.
they can't be compared.
MrMPFR@reddit
RDNA 3.75. No PS5 Pro is RDNA 2.75 at best. Custom ML and RDNA 4 backported RT + Full RDNA 2 features (Mesh shaders etc...) is all you're getting.
The design is much wider CU-wise and built for higher core clocks + has a massive L3 cache which could be as large as 96MB according to rumours.
redsunstar@reddit
What is interesting is that N48 is about the same size as a 5080/5070 Ti. I'm hoping that that it's the same performance class, at least in raster, but you know what they say about hope.
fatso486@reddit (OP)
If the leaked benchmarks correct then Im expecting 9070XT to have same performance levels are 5070Ti.
Bear in mind that the 5080 in nvidas yesterday benchmarks are only %15 than the 4080.
Sopel97@reddit
"CU" is not a unit of performance
teutorix_aleria@reddit
Yeah its physical component that takes up space on the die, which is what OP is asking about.
INITMalcanis@reddit
OP didn't say it was. They're implicitly asking for speculation or information about why the RDNA 4 CU is apparently so much larger than its predecessor.
HyruleanKnight37@reddit
Firstly, that 390mm\^2 is not an official number but an estimate based on a grainy image taken at an angle.
Secondly, RDNA4 (allegedly) incorporates some new IP blocks that we still know nothing about, and they are most certainly not present in the PS5 Pro's chip.
And finally, you're discounting Infinity Cache too much - Cache, or more specifically SRAM cells take up a lot of space compared to logic, and there has also been a stagnation in the shrinkage of SRAM in recent nodes. Because none of the current gen console chips have anything akin to Infinity Cache they are much smaller than even their RDNA2 dGPU counterparts.
For example, the PS5's chip (7nm, 36CU, 305mm\^2) has a similar die size as the Navi 22 (40CU, 335mm\^2) chip despite having an entire block of 8-core Zen 2 CPU, a wider 256-bit bus (Navi 22 as 192-bit) and other PS5 specific IP blocks like the FPU which all Navi 2x chips lack.
Take all these away and only keep the 36CUs and you'll see the total chip is less than 2/3rds the size of the Navi 2 chip. That is how much space Infinity Cache takes.
FloundersEdition@reddit
PS5 had 40CUs as well and N22 had a massive 96MB IFC. N48 was always 64MB, which is ~50mm². 2x MCDs would be 75mm² with memory controller and infinity fabric overhead. There is certainly something fishy with RDNA4s density. Maybe it's just for clocks and wider RT, maybe there is more to it.
HyruleanKnight37@reddit
Didn't know that, thanks.
Yeah you're right, something is amiss. RDNA4 probably has something different about it that makes it completely incomparable to all previous RDNA designs. You can only spread out logic so far to achieve a higher clock.
If only AMD would fucking tell us by now.
TheAgentOfTheNine@reddit
Not all cores are made equal
ElementII5@reddit
Because nobody said it yet but Ally will the new RT capabilities should add some space as well. Also dedicated ML logic?