No. All the execution units are tied to a unified register file. The register filed don't have enough ports to issue enough operands to excite multiple operations at once. There is a very small scenario where it can dual issue but not feeding tensor units and ALU at the same time.
I'm not very familiar with Nvidia's architecture. But I suspect it's the same. Superscalar support is very expensive in transistor count and GPU'S derive parallelism with SIMD so there is not much that can be gain going superscalar beyond some limited support.
Yes the RDNA3 architecture is supposed to be dual issue it's a limited form of superscalar but because the register file cannot support feeding two execution engines at the same time. It's only on very rare situation that the two ALU's are working at the same time. So AMD cannot report double the number if only half of the shader are working most of the time.
It's not exactly half there is some minor improvements. And it's a preliminary advancement. In RDNA4 they have manage to improve the utilization rate. That's where most of RDNA4's performance improvement is coming from by using the second ALU more.
Neural Shaders: It is possible to run a small neural network on shaders (without relying on tensor cores) on Blackwell, and I’m curious if this will be feasible on RDNA4 as well. This isn't merely a software solution. The core concept involves using a compact neural network, stored on the GPU, to approximate computations that would typically be too resource-intensive, either in terms of shaders or data. RTX Neural Shaders integrate AI into programmable shaders.
It's just shaders with AI so yes AMD can do something similar the hard part is programing the software which is not really AMD's strong point but it can be done with RDNA3 and RDNA4 hardware.
AMD moved away from rdna2 days. Navi 23 Rx 6650xt with it's 11B transistors performing the same as a 4060 with freaking twice the transistors at more than 19B transistors. Hope it's worth it
Two corrections: The "DMA" part in the block diagram, while being a part of the direct memory access, probably isn't the part that allows SAM/ReBAR.
And second, the SMT/Hyper-Threading example to explain the dual-issue shaders isn't the best one. With SMT, the scheduler uses clock cycles where thread #1 has to wait for something like a memory request, to execute thread #2. Dual-issue shaders actually can execute two instructions at the same time, if they meet certain requirements. So it's not another "thread" running on the ALUs, but some of the instructions are just done two at the same time.
It's nice that we have someone regularly doing these die shot videos after Locuza (also German) hung up his hat some 2 years ago, though I wish he had gotten the same level of success with his analysis content.
The answer to the question "how did AMD achieve higher transistor density than Nvidia", revealed towards the end of the video, is: "I don't know".
Otherwise enjoyable video.
As an extremely coarse comparison, Navi 48 has 102 MB of SRAM between its caches, shared memory and register file, while GB203 has 95.5. Conversely it has 8192 FP ALUs to 10752. It has a bit less triangle cull throughput (16/clk vs 21) and a bit more raster (8 tris or 128 px per clock vs 7 or 112), and while direct comparisons are harder it's safe to say its RT implementation is less sophisticated.
Overall it's got a bit more in some areas, a bit less in others. But the extra 18% transistors clearly aren't translating to 18% more functional units.
There are a ton of ways you can lay out logically equivalent circuits. It's not just a matter of high performance and high density libraries - you can use the simplest possible logic and densely packed transistors, but your achievable clocks will be garbage. To improve speed you can use lots of transistors to minimise any given wire delay, you can increase the spacing to avoid brownouts, you can add decoupling capacitors for the same purpose. Or any combination of the above. And on the decoupling capacitor point there isn't even universal agreement on whether decap cells are counted as transistors or not! Either way the configuration space is massive, it's not clear on what the best configuration is, and it can wildly swing transistor counts with no change to high level design and only small effects on PPA.
AMD's clearly gone with an approach with RDNA4 that favours lots of highly dense transistors. This isn't really a surprise since the RDNA3 GCD before it also had a similarly high density - albeit with both less cache and analogue. But it is in stark contrast to RDNA 1 and 2 which had very low densities.
EEE is very far from physics in terms of photolithography. Maybe you learn logic and basic ICs. Photolithography,VLSI design, materials science and quantum for transistor scaling is all over our courses heads. You being physics is actually closer to photolithography than we do imo. So is chemical engineers involved in RnD of UV tech and silicon manufacturing.
Source:Am EEE undergrad but if I see VLSI carreer path might put my back on it :)))
As a software engineer, to me EE's and silicon design engineers are like wizards tuned into some higher power. I've learned that if they added some weird instruction, or a chip works in a way you think isn't opitmal, don't question it.
I switched to CS after dropping the Electromagnetic Waves class. We were already on our 3rd or 4th professor by the time I dropped it. One got stuck in Mexico, another got sick, and don't remember what else happened, but I took it as a sign to get out.
I will say, you get pretty good at solving differential equations in EE, although I couldn't solve one now to save my life.
although I couldn't solve one now to save my life.
Funny how that completely gets wiped from your mind, isn't it? Somebody said it is like a muscle you need to train to keep it functional. I disagree, it's like bruise. It stays purple only as long as you keep hitting it with more differential equations.
I feel you. did 2 semesters of engineering first but i hated the EE classes (calculate the resistance and current across these three junctions.....) . Now i am doing semiconductorphysics so i guess its a compromise lol
For me, there was mostly a culture of "if you don't understand this concept immediately in the way I'm explaining it now poorly, you might as well quit", which shed 50% of the class on the first year.
Some had to take 1-2 years extra to keep up on the courses, and many times, you just had to triage which reports you wrote, because there was simply not time enough.
It wasn't the subject matter as much as the abject time crunch and teachers lying to your face about the exams to make you flunk.
cettm@reddit
Can the tensor core run in parallel with alu units?
PointSpecialist1863@reddit
No. All the execution units are tied to a unified register file. The register filed don't have enough ports to issue enough operands to excite multiple operations at once. There is a very small scenario where it can dual issue but not feeding tensor units and ALU at the same time.
cettm@reddit
This happens on nvidia also?
PointSpecialist1863@reddit
I'm not very familiar with Nvidia's architecture. But I suspect it's the same. Superscalar support is very expensive in transistor count and GPU'S derive parallelism with SIMD so there is not much that can be gain going superscalar beyond some limited support.
cettm@reddit
Thank you. Do you know why rx 7090 has double the number of shaders, but Md reports only half 4096 ?
PointSpecialist1863@reddit
Yes the RDNA3 architecture is supposed to be dual issue it's a limited form of superscalar but because the register file cannot support feeding two execution engines at the same time. It's only on very rare situation that the two ALU's are working at the same time. So AMD cannot report double the number if only half of the shader are working most of the time.
cettm@reddit
why make it this way then if only half are used most of the time?
PointSpecialist1863@reddit
It's not exactly half there is some minor improvements. And it's a preliminary advancement. In RDNA4 they have manage to improve the utilization rate. That's where most of RDNA4's performance improvement is coming from by using the second ALU more.
cettm@reddit
do you know if rdna4 supports neural rendering like rtx50 series?
PointSpecialist1863@reddit
What's neural rendering?
cettm@reddit
Neural Shaders: It is possible to run a small neural network on shaders (without relying on tensor cores) on Blackwell, and I’m curious if this will be feasible on RDNA4 as well. This isn't merely a software solution. The core concept involves using a compact neural network, stored on the GPU, to approximate computations that would typically be too resource-intensive, either in terms of shaders or data. RTX Neural Shaders integrate AI into programmable shaders.
PointSpecialist1863@reddit
Both RDNA3 and RDNA4 has WMMA to run neural network on their shaders. So it is simply a software problem. The hardware is already there.
cettm@reddit
https://developer.nvidia.com/blog/nvidia-rtx-neural-rendering-introduces-next-era-of-ai-powered-graphics-innovation/
PointSpecialist1863@reddit
It's just shaders with AI so yes AMD can do something similar the hard part is programing the software which is not really AMD's strong point but it can be done with RDNA3 and RDNA4 hardware.
EmergencyCucumber905@reddit
AMD likes to keep shader count proportional to CU count. A shader is a shader whether it's dual-issue or not.
Since they are dual-issue shaders, it's not the same as doubling the CUs. It doesn't give you the ability to schedule more threads at a time.
Even on MI300 where dual issue is quite good they don't count those extra ALUs as shaders.
ResponsibleJudge3172@reddit
Nvidia SM has 4 partitions so each could independently do a tensor or other operation per clock
fatso486@reddit (OP)
AMD moved away from rdna2 days. Navi 23 Rx 6650xt with it's 11B transistors performing the same as a 4060 with freaking twice the transistors at more than 19B transistors. Hope it's worth it
ResponsibleJudge3172@reddit
It doesn't perform the same. Even 7600 is slower than 4060
high_yield_yt@reddit
Two corrections: The "DMA" part in the block diagram, while being a part of the direct memory access, probably isn't the part that allows SAM/ReBAR.
And second, the SMT/Hyper-Threading example to explain the dual-issue shaders isn't the best one. With SMT, the scheduler uses clock cycles where thread #1 has to wait for something like a memory request, to execute thread #2. Dual-issue shaders actually can execute two instructions at the same time, if they meet certain requirements. So it's not another "thread" running on the ALUs, but some of the instructions are just done two at the same time.
Sevastous-of-Caria@reddit
Danke, I'm more looking to your die shot guesses on flagship silicon than flagship launches themselves these days ;)
JuanElMinero@reddit
It's nice that we have someone regularly doing these die shot videos after Locuza (also German) hung up his hat some 2 years ago, though I wish he had gotten the same level of success with his analysis content.
Apprehensive-Buy3340@reddit
The answer to the question "how did AMD achieve higher transistor density than Nvidia", revealed towards the end of the video, is: "I don't know".
Otherwise enjoyable video.
Strazdas1@reddit
The most honest answer.
ParthProLegend@reddit
Dang🤣
Qesa@reddit
As an extremely coarse comparison, Navi 48 has 102 MB of SRAM between its caches, shared memory and register file, while GB203 has 95.5. Conversely it has 8192 FP ALUs to 10752. It has a bit less triangle cull throughput (16/clk vs 21) and a bit more raster (8 tris or 128 px per clock vs 7 or 112), and while direct comparisons are harder it's safe to say its RT implementation is less sophisticated.
Overall it's got a bit more in some areas, a bit less in others. But the extra 18% transistors clearly aren't translating to 18% more functional units.
There are a ton of ways you can lay out logically equivalent circuits. It's not just a matter of high performance and high density libraries - you can use the simplest possible logic and densely packed transistors, but your achievable clocks will be garbage. To improve speed you can use lots of transistors to minimise any given wire delay, you can increase the spacing to avoid brownouts, you can add decoupling capacitors for the same purpose. Or any combination of the above. And on the decoupling capacitor point there isn't even universal agreement on whether decap cells are counted as transistors or not! Either way the configuration space is massive, it's not clear on what the best configuration is, and it can wildly swing transistor counts with no change to high level design and only small effects on PPA.
AMD's clearly gone with an approach with RDNA4 that favours lots of highly dense transistors. This isn't really a surprise since the RDNA3 GCD before it also had a similarly high density - albeit with both less cache and analogue. But it is in stark contrast to RDNA 1 and 2 which had very low densities.
Quatro_Leches@reddit
TSMC 4N is the same as last gen gpus, which had the exact same density, cant compare apples to burgers
mtthefirst@reddit
Great work. Very easy to follow and understand.
BenchmarkLowwa@reddit
Excellent work, love it! :)
Xbux89@reddit
In dummy terms is impressive ?
Improvement2242@reddit
I love High Yields videos. Maybe i should have choosen electrical engineering over physics for my degree \^\^
Sevastous-of-Caria@reddit
EEE is very far from physics in terms of photolithography. Maybe you learn logic and basic ICs. Photolithography,VLSI design, materials science and quantum for transistor scaling is all over our courses heads. You being physics is actually closer to photolithography than we do imo. So is chemical engineers involved in RnD of UV tech and silicon manufacturing. Source:Am EEE undergrad but if I see VLSI carreer path might put my back on it :)))
EmergencyCucumber905@reddit
As a software engineer, to me EE's and silicon design engineers are like wizards tuned into some higher power. I've learned that if they added some weird instruction, or a chip works in a way you think isn't opitmal, don't question it.
moofunk@reddit
As a trained EE, but not practicing, that was so damn stressful that I had sometimes wished for something soft and cuddly, like physics.
xole@reddit
I switched to CS after dropping the Electromagnetic Waves class. We were already on our 3rd or 4th professor by the time I dropped it. One got stuck in Mexico, another got sick, and don't remember what else happened, but I took it as a sign to get out.
I will say, you get pretty good at solving differential equations in EE, although I couldn't solve one now to save my life.
ElementII5@reddit
Funny how that completely gets wiped from your mind, isn't it? Somebody said it is like a muscle you need to train to keep it functional. I disagree, it's like bruise. It stays purple only as long as you keep hitting it with more differential equations.
Improvement2242@reddit
I feel you. did 2 semesters of engineering first but i hated the EE classes (calculate the resistance and current across these three junctions.....) . Now i am doing semiconductorphysics so i guess its a compromise lol
moofunk@reddit
For me, there was mostly a culture of "if you don't understand this concept immediately in the way I'm explaining it now poorly, you might as well quit", which shed 50% of the class on the first year.
Some had to take 1-2 years extra to keep up on the courses, and many times, you just had to triage which reports you wrote, because there was simply not time enough.
It wasn't the subject matter as much as the abject time crunch and teachers lying to your face about the exams to make you flunk.
F*ck that whole degree.
punktd0t@reddit
Why is it marked as a "spoiler"?
fatso486@reddit (OP)
My bad... Fixed