Every Architectural Change For RTX 50 Series Disclosed So Far
Posted by MrMPFR@reddit | hardware | View on Reddit | 162 comments
Disclaimer: Flagged as a rumor due to cautious commentary on publicly available information.
There are some key changes in the Blackwell 2.0 design or RTX 50 series that seem to have flown under the radar on Reddit and in the general media coverage. Here I’ll be covering those in addition to more widely reported changes. With that said we still need the Whitepaper for the full picture.
The info is derived from the official keynote and the NVIDIA website post on the 50 series laptops.
If you want to know what the implications are this igor’sLAB article is good+ this article by Tom’s Hardware is good too for additional details and analysis.
Neural Shaders
Hardware support for neural shaders is result of the integration of neural networks inside of the programmable shader pipeline. This is possible because Blackwell has tighter co-integration of Tensor and CUDA cores which optimizes performance. In addition Shader Execution Reordering (SER) has been enhanced with software and hardware level improvements. For example the new reorder logic is twice as efficient as Ada Lovelace. This increases the speed of neural shaders.
Improved Tensor Cores
New support for FP6 and FP4 is ported functionality from datacenter Blackwell. This is part of the Second Generation Transformer Engine. To drive Multiple Frame Generation Blackwell’s tensor cores have doubled throughput (INT8 + other formats) vs Ada Lovelace and 4x with FP4.
Flip metering
The display engine has been updated with flip metering logic that allows for much more consistent frame pacing for Multiple Frame Generation and Frame Generation on 50 series.
Redesigned RT cores
The ray triangle intersection rate is doubled yet again to 8x per RT core as it’s been done with every generation since Turing. Here’s the ray triangle intersection rate for each generation:
- Turing = 1x
- Ampere = 2x
- Ada Lovelace = 4x
- Blackwell = 8x
Like previous generations no changes for BVH traversal and ray box intersections have been disclosed.
The new SER implementation also seem to benefit ray tracing as per RTX Kit site:
”SER allows applications to easily reorder threads on the GPU, reducing the divergence effects that occur in particularly challenging ray tracing workloads like path tracing. New SER innovations in GeForce RTX 50 Series GPUs further improve efficiency and precision of shader reordering operations compared to GeForce RTX 40 Series GPUs.”
Like Ada Lovelace’s SER it’s likely that the additional functionality requires integration in games, but it’s possible these advances are simply low level hardware optimizations.
RT cores are getting enhanced compression designed to reduce memory footprint. Whether this also boosts performance and bandwidth or simply implies smaller BVH storage cost in VRAM remains to be seen. If it’s SRAM compression then this could be “sparsity for RT” (the analogy is high level, don’t take it too seriously), but technology behind remains undisclosed.
All these changes to the RT core compound, which is why NVIDIA made this statement:
”This allows Blackwell GPUs to ray trace levels of geometry that were never before possible.”
This also aligns with NVIDIA’s statements about the new RT cores being made for RTX mega geometry (see RTX 5090 product page), but what this actually means remains to be seen. But we can infer reasonable conclusions based on the Ada Lovelace Whitepaper:
”When we ray trace complex environments, tracing costs increase slowly, a one-hundred-fold increase in geometry might only double tracing time. However, creating the data structure (BVH) that makes that small increase in time possible requires roughly linear time and memory; 100x more geometry could mean 100x more BVH build time, and 100x more memory.”
The RTX Mega Geometry SDK takes care of reducing the BVH build time and memory costs which allows for up to 100x more geometric detail and support for infinitely complex animated characters. But we still need much higher ray intersections and effective throughput (coherency management) and all the aforementioned advances in the RT core logic should accomplish that. With additional geometric complexity in future games the performance gap between generations should widen further.
Hardware Advances Powering MFG and Enhanced DLSS Transformer Model
With Ampere NVIDIA introduced sparsity, a feature that allows for pruning of trained weights in the neural network. This compression enables up to a 2X increase in effective memory bandwidth and storage and sparsity allows for up to 2X more math/computations. Ada Lovelace doubles these theoretical benefits with structural sparsity support.
For new MFG, FG and the Ray Reconstruction, Upscaling and DLAA transformer enhanced models it’s likely they’re built from the ground up to utilize all the architectural benefits of the Blackwell Architecture: structural sparsity and sparsity for dense math and FP4, FP6, FP8 support (Second Gen Transformer Engine).
Whether DLSS CNN models use the sparsity feature is undisclosed.
NVIDIA said the new DLSS 4 transformer models for ray reconstruction and upscalinghas 2x more parameters and requires 4x higher compute.How this translates to ms overhead vs DNN model is unknown but don’t expect a miracle; the ms overhead will be significantly higher thantheDNN version. This is a performance vs visuals trade-off.
Here’s the FP16 tensor math throughput per SM for each generation at iso-clocks:
- Turing: 1x
- Ampere: 1x (2x with sparsity)
- Ada Lovelace: 2x (8x with sparsity + structural sparsity), 4x FP8 (not supported previously)
- Blackwell: 4x (16x with sparsity + structural sparsity), 16x FP4 (not supported previously)
And as you can see the delta intheoretical FP16,lack of support forFP(4-8)tensor math (Transformer Engine) and sparsity will worsen model ms overhead and VRAM storage cost with every previous generation. Note this is relative as we still don’t know the exact overhead and storage cost for the new transformer models.
Blackwell CUDA Cores
Duringthe keynoteI heard theAda Lovelace SM and a Blackwell SM are not apples to apples at all. Based on the limited information given during the keynoteby Jensen:
"...there is actually a concurent shader teraflops as well as an integer unit of equal performance so two dual shaders one is for floating point and the other is for integer."
NVIDIA's website also mentions this:
"The Blackwell streaming multiprocessor (SM) has been updated with more processing throughput"
How this implementation differs from Ampere and Turing remains to be seen. We don’t know if it is a beefed up version of the dual issue pipeline from RDNA 3 or if the datapaths and logic for each FP and INT unit is Turing doubled. Turing doubled is most likely as RDNA 3 doesn’t advertise dual issue as doubled cores per CU. If it’s an RDNA 3 like implementation and NVIDIA still advertises the cores then it is as bad as the Bulldozer marketing blunder. It only had 4 true cores but advertised them as 8.
Here’s the two options for Blackwell compared on a SM level against Ada Lovelace, Ampere, Turing and Pascal:
- Blackwell dual issue cores: 64 FP32x2 + 64 INT32x2
- Blackwell true cores: 128 FP32 + 128 INT32
- Ada Lovelace/Ampere: 64 FP32/INT32 + 64 FP32
- Turing: 64 FP32 + 64 INT32
- Pascal: 128 FP32/INT32
Many people seem baffled by how NVIDIA managed more performance (Far Cry 6) per SM with 50 series despite the sometimes lower clocks compared to 40 series. This could explain som of the increase.
Media and Display Engine Changes
Display:
”Blackwell has also been enhanced with PCIe Gen5 and DisplayPort 2.1b UHBR20, driving displays up to 8K 165Hz.”
Media engine encoder and decoderhas been upgraded:
”The RTX 50 chips support the 4:2:2 color format often used by professional videographers and include new support for multiview-HEVC for 3D and virtual reality (VR) video and a new AV1 Ultra High-Quality Mode.”
Hardware support for 4:2:2 is new and the 5090 can decode up to 8x 4K 60 FPS streams per decoder.
5% better quality with HEVC and AV1 encoding + 2x speed for H.264 video decoding.
Improved Power Management:
”For GeForce RTX 50 Series laptops, new Max-Q technologies such as Advanced Power Gating, Low Latency Sleep, and Accelerated Frequency Switching increases battery life by up to 40%, compared to the previous generation.”
”Advanced Power Gating technologies greatly reduce power by rapidly toggling unused parts of the GPU.
Blackwell has significantly faster low power states. Low Latency Sleep allows the GPU to go to sleep more often, saving power even when the GPU is being used. This reduces power for gaming, Small Language Models (SLMs), and other creator and AI workloads on battery.
Accelerated Frequency Switching boosts performance by adaptively optimizing clocks to each unique workload at microsecond level speeds.
Voltage Optimized GDDR7 tunes graphics memory for optimal power efficiency with ultra low voltage states, delivering a massive jump in performance compared to last-generation’s GDDR6 VRAM.”
Laptop will benefit more from these changes, but the desktop should still see some benefits. These will probably mostly from Advanced Power Gating and Low Latency Sleep, but it’s possible they could also benefit from Accelerated Frequency Switching.
GDDR7
Blackwell uses GDDR7 which lowers power draw and memory latencies.
Blackwell’s Very High Compute Capability
The ballooned compute capability of Blackwell 2.0 or 50 series at launch remains an enigma. Normally the compute capability of a card at launch trails the version of the official CUDA toolkit by years, but this time it’s the opposite. CUDA toolkit trails the compute capability of Blackwel 2.0 by 0.2 (12.8 compute capability vs 12.6 CUDA toolkit). Whether this supports Jensen’s assertion of Blackwell consumer being the biggest architectural redesign since the programmable shaders were introduced with the GeForce 256 (world’s first GPU) in 1999 remains to be seen. The increased compute capability number could have something to do with neural shaders and tighter Tensor and CUDA core co-integration + other undisclosed changes. But it’s too early to say where the culprits lie.
For reference here’s the official compute capabilities of the different architectures going all the way back to CUDA’s inception with Tesla in 2006:
- note: As you can see in one generation from Ada Lovelace to Blackwell compute capability takes a larger numerical jump than between three generations from Pascal to Ada Lovelace.
Blackwell: 12.8
Enterprise – Blackwell: 10.0
Enterprise - Hopper: 9.0
Ada Lovelace: 8.9
Ampere: 8.6
Enterprise – Ampere: 8.0
Turing: 7.5
Enterprise – Volta: 7.0
Pascal: 6.1
Enterprise - Pascal 6.0
Maxwell 2.0: 5.2
Maxwell: 5
Big Kepler: 3.5
Kepler: 3.0
Small Fermi: 2.1
Fermi: 2.0
Tesla: 1.0 + 1.3
EmergencyCucumber905@reddit
CUDA toolkit version and compute capability are two different things.
MrMPFR@reddit (OP)
Yes you're right they have nothing to do with each other and I have removed the part suggesting that.
konawolv@reddit
You still have CUDA Compute capability as v12. I dont think that is the case. CUDA SDK is onto version 12, but Compute Capability is on v10.
MrMPFR@reddit (OP)
I'm just repeating what NVIDIA reports for 50 series here. A week after the launch and still at compute capability 12.8. Very odd.
WHY_DO_I_SHOUT@reddit
Correction: shaders were introduced in GeForce3 in 2001.
Pinksters@reddit
Not to mention GeForce wasn't nearly the "Worlds first" GPU.
There were MANY "GPUs" before then but the term wasn't coined at the time.
MrMPFR@reddit (OP)
Yes indeed but I believe they offloaded the remainder of the rendering pipeline to the GPU. Before to Geforce 256 a lot of the rendering was still done on GPU.
ibeerianhamhock@reddit
Definitely wasn't the fist graphics card with hardware transformation, clipping, and lighting.
PSX, Saturn, and N64 all had hardware TCL coprocessors that performed these functions.
But yeah I suppose it was the first home graphics card with hardware TCL capabilities and it was all on the same die.
MrMPFR@reddit (OP)
Thanks for the info. Probably shouldn't place too much faith in anything NVIDIA says.
nismotigerwvu@reddit
That's really only true in the context of gaming oriented cards for desktops. The professional market (think SGI and 3D Labs) used this kind of approach from the very start (late 80's if my memory serves correctly) since trying to run geometry calculations on a 386 (well 387 is more correct here I guess) is a baaaaaaad idea. The biggest reason for 3Dfx's early success was that they were able to correctly predict which stages made the most sense to kick back out to the (rapidly evolving) CPU and what stages to double down on in 1996. The REALLY interesting aspect here is that the T&L engine on the GeForce 256 was MASSIVELY underpowered and even at launch, typical CPUs could outpace it. It's easy to see there was such a swift pivot to putting some control logic in front of those ALUs. In all fairness to NV, CPUs were doubling in clock speed annually AND gaining new features/higher IPC so it would have been a monumental task to outcompete the Athlon or Pentium III during that time.
f3n2x@reddit
The idea of a GPU is to have the entire pipeline on-chip. This absolutely wasn't the case for SGI which had a myriad of chips doing different things, similar to the Voodoo cards but on a much bigger scale.
nismotigerwvu@reddit
Okay, if we're saying only single chip solutions, Permedia NT still predates the GeForce 256 by years. This isn't meant as a dig or to diminish the importance of the card, it's just that the marketing was a little bombastic. In the end, NV coined the term and can apply it as they wish, but it's simply just marketing fluff.
Adromedae@reddit
Neither Permedia nor Geforce 256 were proper 'GPUs' either, since they only implemented the back end of a traditional (GL) graphics pipeline.
Adromedae@reddit
The term GPU had been in use since the 70s. And just like CPU, it doesn't necessarily imply a single chip implementation.
FWIW according to that standard, NVIDIA didn't have a proper GPU until G80. ;-) Since most of the geometry transforms were done on the CPU section of the Geforce architecture (mmx/sse) prior to Tesla.
capybooya@reddit
I remember the release, I and several others thought the 'GPU' labeling was a bit cringe, and there was definitely arguing about it on the internet. Very good marketing move though.
MrMPFR@reddit (OP)
Thanks for the GPU history lesson. Exciting read.
halgari@reddit
Fun fact, when SGI went downhill a whole group of the engineers left, and joined this new startup called NVidia.
dankhorse25@reddit
My Riva 128 was certainly a GPU.
Plank_With_A_Nail_In@reddit
GPU term was also first used by Sony in relation to the playstation.
JakeTappersCat@reddit
Actually the ATI Radeon DDR had shaders before the Geforce 3 and was the first GPU with full DirectX 8 support
MrMPFR@reddit (OP)
Corrected the mistake.
FloundersEdition@reddit
Nvidias specsheet indicates 1.33x RT performance per core and clock, so not doubling. IIRC 94 RT-TFLOPS on 5070 vs 67 on 4070.
Nvidia wasn't to keen to talk about their definition of TOPS, but claim 2x per core and clock. Probably the same, but with data sizes cut in half. Otherwise they may only increased INT matrix due to the new SIMDs.
They probably went with a "Turing 2" layout, Ada seem to natively execute warp32 on 16x SIMDs in two cycles, but has another 16x FP32 pipe. AMD executes wave32 on 32x SIMD in a single cycle.
MrMPFR@reddit (OP)
Rewatched RTX 4090 reveal again and something didn't look right regarding 2080 TI vs 3090 TI vs 4090 AI TOPS. Apparently AI TOPS is a BS marketing term which means the highest possible AI FPx throughput, proof is here. So your suspicion was not without merit. NVIDIA are using FP16 for RTX 20 series, sparse FP16 for 30 series, sparse FP8 for 40 series, and sparse FP4 for 50 series. The underlying FP16 and INT8 has remained unchanged on a per SM basis since Turing, only additional FPx functionality + sparsity has been added.
Sorry for the confusion and I'll edit the post and probably do a post in r/hardware given how many people have watched the post. We also need to quell the AI TOPS panic for the older cards. People think DLSS Transformer won't be able to run on 20 and 30 series.
MrMPFR@reddit (OP)
RT tflops is an aggregate of various RT metrics like BVH traversal, ray box intersections and ray triangle intersections. I only listed ray intersection because it's the only one I can find mentioned in the White Papers with Ampere and later gens.
AI tops is based on INT4 math, and not FP8 and FP4. Has been that way since Turing (go through the whitepapers). I doubt that tensor FP and INT math has remained 1/1 since Turing, but it's impossible to know for sure.
Thanks for the info.
Cute-Pomegranate-966@reddit
Jensen quoted during announcement:
"And of course 125 Shader TFLOPs, there is actually a concurrent shader TFLOPs as well as an integer unit of equal performance, so 2 dual shaders, 1 is for FP one is for integer"
so that leans towards 128 FP32 and 128 INT32 per SM.
TheNiebuhr@reddit
That directly contradicts history. They had that with Turing and decided that 2:1 FP/INT ratio was going to be much more balanced for graphics and rendering, and indeed they stuck with it 4 years.
Anyways LJM explanation was terrible, nothing cant be truly drawn from it. Just wait for the whitepaper.
ResponsibleJudge3172@reddit
They didn't, they had 64FP32+64INT32 on Turing, Ampere and ADA had 64FP32+64FP32/INT32.
Hopper has 128FP32+64INT32+16FP64
Maybe Blackwell has 128FP32+128INT32
TheNiebuhr@reddit
They had the same number of Int and Float, which is the entire point. It's empirically proved that having more FP than Int is better for rendering and GPGPU computing. 6 years of Geforce hardware shows it's the superior design.
Cute-Pomegranate-966@reddit
It's not really relevant if it contradicts history, this is simply what the man said. I am of course interested in the white paper.
ChrisFromIT@reddit
It seems odd to go back to Turing's dual shader setup. As nvidia found that roughly 70% of instructions are FP32, with 30% being INT32. Which was why Ampere had a very good performance jump compared to Turing, as its 32 FP32 + 32 FP32/INT32 better reflected that ratio. Thus allowing the best footprint per cores.
MrMPFR@reddit (OP)
That's only at 4K, 1440p (35% INT) and 1080p are much more integer heavy (\~40%). With the increased reliance on upscaling it actually makes sense to revert to a Turing like SM design.
Yes at 4K, but the gains were smaller at 1440p and 1080p across the entire stack.
ChrisFromIT@reddit
Resolution doesn't change what operations are done in the shaders, so that information is wrong.
MrMPFR@reddit (OP)
Are you sure? Read somewhere (can’t remember where) that integer math rendering workloads scales less with increased pixels vs floating point rendering workloads.
This is also reflected in Ampere’s odd scaling across resolutions where gains vs Turing are much larger at 4K than 1440p and 1080p.
ChrisFromIT@reddit
Yes, I am sure.
The only thing that changes is the number of pixel shading operations. So if there are compute shaders that are heavy integer math in a game and those compute shaders don't rely on the amount of pixels. Then, yes, the integer workload to floating point workload ratio will change based on resolution.
But that isn't exactly common.
MrMPFR@reddit (OP)
Understood. Misleading comments deleted.
MrMPFR@reddit (OP)
Indeed. The other outcome (dual issue) is very unlikely.
It'll be interesting how this doubled concurrent theoretical throughput translates to actual performance across different workloads on a per SM basis vs Ada Lovelace.
Elios000@reddit
has nV given up on Direct Storage. seems like there bit about and nothing else. i feel like if these cards could make use of that the lower vram wouldnt as much an issue
ResponsibleJudge3172@reddit
Microsoft does not even support the vision of bypassing CPU entirely that Jensen talked about, and what they do support literally was years late. I too have given up
MrOmgWtfHaxor@reddit
The tech is there but its up to the dev to learn and choose to fully utilize it. Id assume right now it's not super utilized due to devs focusing more on compatibility with older cards and non nvme drives.
https://steamdb.info/tech/SDK/DirectStorage/
ProjectPhysX@reddit
Blackwell CUDA cores don't have FP32 dual-issuing, according to Nvidia's website. They are still (64 FP32/INT32 + 64 FP32), same as Ampere/Ada. Dual-issuing only is a (not particularly useful) thing on AMD's RDNA3.
MrMPFR@reddit (OP)
Can you link to the part where it says that? 99.9% sure the SM is changed. Jensen confirmed it during the keynote + the laptop 50 series post says the SM has been redesigned for more throughput.
We're prob getting Turing doubled SM: 128 INT32 + 128 FP32, I didn't think dual issue is likely, just added it to be more cautious and avoid takedown of post.
ProjectPhysX@reddit
https://www.nvidia.com/de-de/geforce/graphics-cards/compare/
Here it says:
It's the same CUDA cores as Ampere and Ada. Probably not even 2x FP16 throughput in tensor cores compared to Ada. Only real new thing they added was support for FP4 bit sludge to be able to claim higher perf in apples-to-oranges comparisons with FP8 on Ada.
MrMPFR@reddit (OP)
Doesn't prove anything. Ampere and Ada has a dedicated FP32 path + a shared FP32 + INT32 path (similar to Pascal implementation). This is not reflected in the comparison because it only shows FP32 throughput and not the entire SM implementation.
Jensen said that the integer and floating point were concurrent and they were using dual shaders for both + read the post I quoted. This is not Ada Lovelace CUDA cores, it's Turing doubled.
That remains to be seen, but the AI tops have used INT4 throughput since Turing, the AI tops for ADA is INT4 not FP8. Compare the number in the Whitepaper with the number on their website for the 4090, it's INT4.
I know, which is why I keep referencing consumer as Blackwell 2.0, because that's the leaked name on TechPowerUp.
ProjectPhysX@reddit
The dedicated FP32 + FP32/INT32 path is exactly what Ampere/Ada have, and what Jensen referred to in the keynote. This is not new. Pascal can do FP32/INT32 on all CUDA cores. Nvidia stating "2x FP32" on Blackwell/Ada/Ampere refers to peak FP32 throughput, which is the same for those three architectures.
Your claim of doubled (Turing) throughput is just wrong; Turing was also particularly bad designed as >half of the dedicated INT32 cores were idle at any time. Massive silicon area for nothing.
ResponsibleJudge3172@reddit
Read what was said, AMpere and Ada DO NOT have equal performance of peak integer and float. Float is 2X Int because the design is
64FP32+64FP32/INT32
What we heard is 2 dual independent shaders. INT has not been independent for Ampere or rtx 40. The only scenario that makes sense if he is not lying is
(64FP32+64FP32)+(64INT32+64INT32). Which is something not seen before. Neither does it increase TFLOPS over previous gen but it allows better compute scaling
ProjectPhysX@reddit
https://www.nvidia.com/de-de/geforce/graphics-cards/compare/
Here it says:
It's the same CUDA cores as Ampere and Ada. Probably not even 2x FP16 throughput in tensor cores compared to Ada. Only real new thing they added was support for FP4 bit sludge to be able to claim higher perf in apples-to-oranges comparisons with FP8 on Ada.
AutoModerator@reddit
Hello! It looks like this might be a question or a request for help that violates our rules on /r/hardware. If your post is about a computer build or tech support, please delete this post and resubmit it to /r/buildapc or /r/techsupport. If not please click report on this comment and the moderators will take a look. Thanks!
I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.
tioga064@reddit
Consumer blackwell seems really interesting, lots of uarch changes on every aspect, not just the standard raster and rt improvements. Cant wait for the blackwell whitepaper and some reviews on the encoding/decoding capabilities and a review on the new flip model implemented. Frame reprojection also seems nice. Its incredible that nvidia ia basically adding every feature we always asked for, and they for sure are charging for it lol
MrMPFR@reddit (OP)
Think we're in for a real treat. The Whitepaper is prob going to be as long or longer than the Turing one. There's so much new functionality to explain. Fx. what is the Neural shader integration is about etc...
Jensen would not make the architecture claim (biggest since 1999) if it wasn't a big deal. I've not seen him this bold since Turing which was arguably the biggest architectural change since Tesla.
Indeed tons of advances as requested both in software and hardware.
From-UoM@reddit
Turing will easily be one of the most influential generations of all time.
It introduced hardware accelerated RT and Performance boosts using AI.
Set the standard for GPUs.
It also allowed Tensor Cores to reach the masses.
Vb_33@reddit
And everyone hated it for like 3 years :(
Christian_R_Lech@reddit
The initial RTX 2000 lineup (2080 Ti, 2080, and 2070), when it launched, provided essentially zero performance per dollar improvement over the preceding GTX 1000 series with only the 2080 Ti providing any performance improvement over the top end Pascal Titan.
Personally, I do remember being a defender of Turning, even if I never got one, because I knew back then that Ray Tracing was more than a gimmick and would eventually become the future of real time computer graphics. There was going to be a cost initially incorporating RT hardware into GPUs.
shadAC_II@reddit
So Turing was in this case just like Ada (no/minimal Perf/$ increase), but had more architectural improvements then Ada.
Gaming Blackwell seems to be extremely interesting and I see some similarities to Turing. So far large Backlash due to not enough Performance increase but this time you get all the architecture changes without a large cost increase. Very interesting indeed.
Nicholas-Steel@reddit
because although it could achieve real time Ray Tracing, you had to have a 2070Ti or higher to maybe average 30FPS. It was useless on weaker cards outside of non-real-time gameplay.
Ryankujoestar@reddit
There wasn't a 2070Ti, I assume you meant Super
Nicholas-Steel@reddit
Thanks
Beylerbey@reddit
"Maybe" averaging 30fps is a stretch, unless you're saying without DLSS at 4k, but DLSS was born to make RT possible.
Control (the "richest" RT implementation when it came out) on a 2080 at high RT: 1080p 56fps, 1440p 35 fps, 4K 20 fps; with DLSS on: 85, 61 and 36. At medium 74, 51, 29| DLSS 103, 74, 47.
Quake II RTX with path tracing got to around 60fps 1080p at max settings after a couple patches (4-6x 1080 Ti).
In terms of RT Turing was surprisingly capable for a first gen product, but back then many people still believed Nvidia was scamming everyone with the RT cores, waiting for AMD to demonstrate how there was no need for dedicated specialized hardware, and mostly upset because of the low improvement compared to Pascal in terms of raster performance.
Nicholas-Steel@reddit
iirc it took a year or 2 before DLSS 1.9 released which used AI Training and significantly improved.
Beylerbey@reddit
DLSS always had AI training since it uses a deep learning model to base its reconstruction on, I think you're remembering that the earliest version of Control's implementation didn't rely on tensor cores for upscaling but would run on shaders.
What made it much more viable since 2.0 was that they were able to use a generalized model for all games, before 2.0 models had to be trained by Nvidia for each game and there was no "switch on" method to add it but had to be implemented from scratch, and that made adoption quite slow at first, I still remember when they presented the RTX branch of UE4 with plug and play DLSS, all it needed was ticking a box and that was it (optimizing it was another matter of course).
Nicholas-Steel@reddit
Right, I must have gotten confused. You're right that it was always AI Trained, it's literally what the name stands for lol. I must have been vaguely thinking of the change from game specific to generalized training.
I am also sure I've been told by some people that the change to generalized training happened before v2.0 but most of the marketing around the change was when 2.0 released and that's what everyone remembers as being the start of this change. I could be wrong, maybe the people telling me this were wrong.
ShadowRomeo@reddit
Yet your average r/pcmasterrace folks will only care about this one nothing else matters lol
Renanina@reddit
they only care about games when the subreddit is more than just that. had a guy trying to tell me not to buy a 5090 because "buyer's remorse" What kind of bullshit is that? Especially when the focus is just the occassional LLM work and not something I'd rather buy a 5070 TI for. Before the 2070, I had the 970. I'm not planning to wait another generation and since it's a generation of AI implantation. People hate it but I always liked the things that ppl hate that eventually become bigger in the long run due to the effiency of utilizing that tool... in this case... it's what everyone calls as AI lol
Traditional-Lab5331@reddit
They care about wrong reason gaming. If Blackwell can deliver x4 FG at lower latency than current FG, then it's a win and completely usable for games. Lossless Scaling now provides a usable x3 mode that has no latency increase feeling to it, Blackwell is going to do it so much better. Too many people believe they need 400hz and 2ms latency to play Roblox and they have a reaction time of 300ms.
Gaming isn't competitive online bullshit battle Royale. It's a rich story and crafted worlds. We need to get back to games that are fun and not this quick turn out multiplayer junk.
lubits@reddit
I wonder if the compute capability of 12.8 is a typo, either that or I think Nvidia is going to do something fucky like reserving certain features for data center GPUs from cc 9-11.
MrMPFR@reddit (OP)
They’ve been reserving functionality for Data center for a long time. But yeah this big jump in compute capability is very odd
lubits@reddit
Ah true. In this case, I meant the TMA and warpgroup mma instructions.
MrMPFR@reddit (OP)
Interesting, we'll see. That Whitepaper can't come soon enough.
bAaDwRiTiNg@reddit
So DLSS4 upscaling improvements will come at the price of a higher performance cost then. Could this cost be significantly higher for let's say RTX2000 cards than RTX5000 cards?
MrMPFR@reddit (OP)
Yes I explained that in the post. Fear it is going to run like shit on 20 and 30 series.
Acrobatic-Paint7185@reddit
Here's a table with the cost of DLSS3 upscaling, from Nvidia's documentation: https://imgur.com/qXflrYd
Multiply by 4x to get the cost of the new Transformer model. It won't be cheap, especially at higher output resolutions.
Fever308@reddit
I took the 4x compute statement as it the used 4x more compute to train the model, not that it takes 4x more to run.
Acrobatic-Paint7185@reddit
"4x more compute during inference"
Inference means literally running the model.
Fever308@reddit
Please show me anywhere it mentions "inference" cause every official Nvidia article I've read hasn't mentioned that at all.
Acrobatic-Paint7185@reddit
https://youtu.be/qQn3bsPNTyI?t=4m22s
MrMPFR@reddit (OP)
Thanks for the table. Transformer and CNN models can't be compared apples to apples due to underlying differences in the architecture. With that said we should expect a much larger overhead with DLSS transformer.
RedIndianRobin@reddit
It has a performance overhead of 5% as per an Nvidia spokesperson, so it's not a whole lot.
midnightmiragemusic@reddit
Nobody said that. Stop making stuff up.
RedIndianRobin@reddit
NVIDIA's tech marketing manager:
Alexandre Ziebert
Link to his profile:
Alexandre Ziebert (@aziebert) / X
MrMPFR@reddit (OP)
Doesn't tell me a lot, what's the resolution, FPS and quality of upscaling used. Fingers crossed it'll run as well as he says.
RedIndianRobin@reddit
Yeah 5% is alright. I am guessing the overhead will be less on 40 and 50 series but more so on the 20 and 30 series.
MrMPFR@reddit (OP)
Yep sounds fine. For sure, the newer cards have tensor cores built for transformer processing.
midnightmiragemusic@reddit
Even the 40 series?
MrMPFR@reddit (OP)
Yes they ported the transformer engine functionality from Hopper to Ada Lovelace in 2022.
midnightmiragemusic@reddit
Wow, I stand corrected. Thank you for sharing! I'm even more excited for this tech now!
MrMPFR@reddit (OP)
Here's the full Twitter thread. Does sound like it's less heavy than we feared.
MrMPFR@reddit (OP)
Can you post the link to the statement?
RedIndianRobin@reddit
I can't. It was said in twitter, I tried to find it but my posts got drowned. I will see if I can dig it up when I am on my PC.
Apprehensive-Buy3340@reddit
Better hope there's gonna be a software switch between the two models, otherwise we're gonna have to downgrade the DLL manually
yaosio@reddit
They said you'll be able to switch between the CNN and Transformer versions in the app. They are also adding native support for backporting the newest version of DLSS to older games. All of this is transparent to the game.
Apprehensive-Buy3340@reddit
That's great then
MrMPFR@reddit (OP)
Note when I said run like shit it doesn’t mean it won’t work but a card like the 2060 or 2070 could have a low FPS cap. For cinematic experiences (excluding very high FPS) I still think it’ll be good on 20 series and 30 series.
The Cyberpunk 2077 5090 early footage with Linus from LTT shows the game UI where you can toggle between DLSS Convolutional Neural Network and transformer. Seems pretty likely they’ll continue to support both versions.
We need to think of the transformer mode as DLSS overdrive to better understand the difference I think.
GARGEAN@reddit
I would expect still having noticeable performance uplift with DLSS 4 SR compared to native, even if uplift will be noticeably lower than DLSS 2 SR on 30 and especially 20 series. Might be even negated with better image quality by allowing easier use of lower internal res modes (DLSS Balanced replacing standart DLSS Quality for 1440p ect).
MrMPFR@reddit (OP)
Interesting thoughts, and yes that's certainly a possibility if it's as good as NVIDIA claims. Can't wait for HUB, DF and GN to do some testing on how it'll run on 30 and 20 series.
ResponsibleJudge3172@reddit
The actual effect scales with native FPS. The worse the average FPS, the less DLSS speed will matter to a certain level..
At 120 FPS, the frame time DLSS needs to be done by, is much lower than if native gives you 40 FPS.
That could be why rtx 2080 can do it
midnightmiragemusic@reddit
Hi, thank you for for your post!
Based on everything we know so far, do you think the 40 series will run the DLSS transformer model well? I have a 4070 Ti Super so I'm curious. How do you think it will compare to the 50 series?
MrMPFR@reddit (OP)
Will prob run slower than 50 series but the difference shouldn’t matter. 40 series has already very strong tensor cores.
nukleabomb@reddit
I don't remember where I read/heard it from, but apparently, running dlss (cnn) only resulted sub 20% usage of the tensor cores. This was a while back.
f3n2x@reddit
The current spatial DLSS model is extremely light, basically just a blip on nsight, so even 4x should barely move the needle.
ResponsibleJudge3172@reddit
Maybe those will be stuck with say performance mode
bubblesort33@reddit
I'm still skeptical about its raster performance. Not that it matters that much when you hit RTX 5070 levels. But the fact they haven't shown a single title without ray tracing, is a bit odd.
Vb_33@reddit
Expect 20-35% faster than their precursor.
bubblesort33@reddit
I don't know if I should. If I look at RT results, sure. Given Nvidia's claims that rasterization gains are too hard to get now, and that they are almost giving up on them, in skeptical there is much here at all. Personally I'm expecting under 20%. I know they have slides with RT performance, but there should be reasons why those are the way they are, that don't apply to other games in raster. I think there is a good reason why they dropped prices, and people might feel mislead at release time after reviews.
Vb_33@reddit
20% is worst than 4070 to 4070 super, can't be that bad with an actual architectural change.
bubblesort33@reddit
The architecture changed mainly for machine learning reasons, and maybe ray tracing as well. AMD doubled some compute going from RDNA2 to RDNA3 by having dual issue SIMD32. But their gains in gaming per core was like 5% per clock if not less. I think AMD responded to someone somewhere and even said it was a 4% IPC increase in average in gains. I have no idea how they saw that as worth doing. But RDNA3 is twice as fast in stable diffusion as RDNA2, so maybe that was the idea. But they've done nothing with that.
redsunstar@reddit
I'm not, it's pretty obvious it's going to be lackluster. Taking a few steps backs, raster performance is determined by raw compute power, which is in turned dependent on die size and transistor performance and size. We're on the same lithography process as Ada and transistor numbers haven't ballooned, expecting large improvements on the raster side is a pipe dream.
This is another way of saying that raster performance improves when transistor manufacturing improves. There are some exceptions when a company figures out some computing inefficiency in their architecture, the Pascal generation comes to mind, but as I said, this is uncommon and there comes a point when things are already as efficient as they can be.
To be frank, I'm eagerly awaiting the 4080S vs 5080 benchmarks, we're looking at similarly sized chips, similar frequencies, though with much faster memory on the 5080. If Nvidia manages get more than 20% raster performance out of the 5080, that's a good feat of engineering.
bubblesort33@reddit
Look what op wrote regarding compute, and the large compute jump Blackwell made. That is as pretty big jump if true. Question is if that's the kind of compute needed for games, or needed for AI and other things.
redsunstar@reddit
OP posits a jump in compute due to a doubling up of FP32 units i,n some ways in every SM.
We don't know that, Nvidia says more throughput per SM but they didn't exactly say how they achieved that. Given that GB203 and AD203 are sensibly the same size despite GB203 having more space dedicated to AI and a handful more SMs, I think it is unlikely that a sweeping change such as doubling up the number of FP32 units per SM has been enacted. It is likely that minor tweaks were made to improve efficiency possibly through better utilisation but those are always limited.
MrMPFR@reddit (OP)
"...there is actually a concurent shader teraflops as well as an integer unit of equal performance so two dual shaders one is for floating point and the other is for integer."
And
"The Blackwell streaming multiprocessor (SM) has been updated with more processing throughput"
This sounds a lot like Turing doubled, but it doesn't align with what we know about the die sizes. Hope Blackwell 2.0 Whitepaper explains this properly.
redsunstar@reddit
I agree that "it sounds a lot like", but I'll wait for more information, I'm not that convinced by what has been shown yet.
MrMPFR@reddit (OP)
The gains are theoretical. Just because you double math units doesn't mean double performance. All the underlying support logic and data stores (VRF and cache) also needs to be doubled to see good scaling.
They're reverting to 1/1 instead of 2/1 FP+INT because 1080p is much more integer heavy and NVIDIA is relying on upscaling more than ever.
fogoticus@reddit
Not really. Raster performance is limited by CPU power as well. And GPUs have continued to improve even significantly compared to CPUs. And Hardware Unboxed showed that even with games maxed out, a better CPU will offer better fps even in scenarios where you think you hit the ceiling of the GPU.
That's why someone with a 4090 for example could buy the next X3D CPU from AMD and they will see better fps in a lot of titles.
I don't think Nvidia is hiding anything but just that they wanted to put all their focus on DLSS.
cyperalien@reddit
i don't understand how they managed to add a separate 128 int32 pipe without any increase in transistors per SM especially when they need to beef up other parts of the SM like the warp scheduler and the register files to be able to utilize it.
jasmansky@reddit
I'm no expert on the subject matter but isn't the 2x more parameters and 4x more compute claim for the upcoming DLSS4 transformer model referring to the model training that's done on Nvidia's supercomputers rather than the inferencing done on the GPU?
MrMPFR@reddit (OP)
Inference speed. This is the biggest downside of transformers. Compute requirements scale with number of parameters squared or n^(2). But since CNNs and transformers are not apples to apples, we'll need independent testing to draw any conclusions on ms overhead.
Fever308@reddit
Can you tell me where they say it's inference speed? Every article I've read doesn't mention it.
Fever308@reddit
I took this as it took 4x more compute to train the model, not that it took 4x more to run.
EmergencyCucumber905@reddit
Ignoring the marketing, more pipelines/dual issue is not a bad thing. You need both: the instruction level parallelism (more work per thread) and more cores (theoretically more active threads, provided they can be scheduled).
tyr8338@reddit
I'm hopefully 5970 Ti will be a descent upgrade from my 3080 Ti but raw specs are a bit underwhelming, not that many cores, only rops are increased compared to 4070 Ti. I'm waiting for real time benchmarks, especially with RT and DLSS al- I play in 4k
MrMPFR@reddit (OP)
Do we have rops published anywhere, because I can't seem to find them on NVIDIA's page or anywhere else?
As explained in the post the switch from FP32/INT32 + FP32 to 2 x FP32 + 2 x INT32 is huge. This is especially true for lower resolutions as these involve more integer math. with Ampere the more integer you have the less benefit there is to having the additional FP32 Ada Lovelace's higher clocks somewhat adressed this allowing better scaling at lower resolutions, but it's still not ideal.
A couple of weeks ago I took Hardware Unboxed's FPS numbers for 2080S vs 3070 TI at different resolutions and calculated the gains per SM at iso-clocks, and here's what I got: +38.00% (4K), +28.96% (1440p) and 24.10% (1080p). As you can see the gains are much larger at 4K because this can better utilize the additional FP32. We can infer workload distribution based on the performance uplifts vs Turing: 1080p = 62FP + 38 INT, 144p = 64.5FP + 35.5 INT, 4K, 69FP + 31INT.
Moving to a Turing doubled design should theoretically allow for much higher FPS per SM across all games, with larger gains at lower resolutions: +45% at 4K, +55% at 1440p and +61% at 1080p.This isn't happening since it would require a doubling of all supporting logic vs Turing, which is unfeasible without a node shrink. We're also unlikely to see performance gains per SM large enough for these differences across resolution to matter.
It'll be interesting to see the reviews and if this massive boost to INT math has any benefits at lower resolutions (doubt it).
Fromarine@reddit
Should be able to tell pretty easily from the GPCS like with Ada and ampere . Gb 202 is 16 sm's per gpc with 12. the 5090 has 170 which is less than 11 full gpc but more than 10 like the 4090 so 11 GPCS with 16 rops per gpc is 176 tops again.
5080 has 7 GPCS and uses the full die, 8x16 is 112 rops. The 5070ti very likely cuts one gpc like the 4070tis so you get 6x16 or 96 ROPS
5070 seems to use 10sms per gpc not 12 with 4 of them seeing the laptop 5070ti has 50sms making that impossible. 5x16 =80ROPS
in other words should be equal to Ada Lovelace with the super series instead of their vanilla counterparts for applicable cards
MrMPFR@reddit (OP)
Thanks for the explanation. Your math assumed that SMs per GPC is unchanged vs Ada Lovelace which is likely but unconfirmed as we haven’t got the Whitepaper or any GPU diagrams.
Fromarine@reddit
No it actually doesn't it uses a combination of long rumored gpc count sanity checked by if the sm count is possible with other multiples of sm's per gpc. Like gb202 uses 16sms over 12 from ad102. Now it could be 16 GPCS at Ada's level of 12sms per gpc but that means the 5090 with 170 sm's would have 1 gpc disabled and 1 gpc with 10/12 sm's disabled which makes no sense at all because Nvidia would've just disabled 2 GPCS at that point.
With the 5070/gb205 nothing but 5 GPCS is feasible to get to 50sms flat for the full die
With the 5080/gb203, ad103 had a really weird config where one of the 7 GPCS had 8 sm's instead of 12 so seeing gb203 had 4 more sm's exactly it is almost guaranteed it it just 7 GPCS again but this time everyone gets 12sm's like you'd expect.
Still ofc just guessing as you said but I think it's all but certainly correct.
What I'm curious about most is the cache and if the 5070, 5070ti and 5090 will have it cut down like their predecessors or not
MrMPFR@reddit (OP)
Sorry didn't check before replying. Thanks for explaining the reasoning behinds, it sounds much more plausible now.
This is just speculation but think we'll get this: 128MB on GB202 (5090 rumoured at 112MB), 64MB on GB203, 40-48MB on GB205 and 32MB on GB206. The additional bandwidth and lower latency of GDDR7 + potentially some architectural changes to cache management could help boost performance.
Fromarine@reddit
Apparently with ada it's that they can use either 8mb or cut it down to 6mb per 32 bit of memory bus hence why the 4070 base had 36mb of l2 while the 4070 super had 48mb (6x6mb vs 6x8mb) and the 4070ti super had 48mb instead of 64mb of the 4080 (8x6mb vs 8x8mb), 4090 was 12x6mb etc. So theoretically it should be 36mb or 48mb for the 5070, 48mb or 64mb for the 5070ti and 96mb or 128mb for the 5090 seeing its got a 512 bit bus now.
I'm all but certain the full gb203 will have 64mb, it's just whether Nvidia will cut down the 5070ti to 48mb or not. But yeah the 6/8mb rule works for the entire stack of ada so I'd presume it's the same for blackwell
MrMPFR@reddit (OP)
Didn't know that. Thanks for explaining.
We need that Whitepaper ASAP lol.
Fromarine@reddit
ong
tyr8338@reddit
5070 ti specs
https://www.techpowerup.com/gpu-specs/geforce-rtx-5070-ti.c4243
4070 ti specs
https://www.techpowerup.com/gpu-specs/geforce-rtx-4070-ti.c3950
I used this site for comparisons, no idea if 5070 ti specs are 100% accurate
MrMPFR@reddit (OP)
Wouldn't but too much faith in the TechPowerUp numbers. It's not official info but based on leaks. We need the whitepaper.
tyr8338@reddit
We will know in few weeks.
atatassault47@reddit
Im on a 3090 Ti (got it at end of generation sales for $1000), and 50 series have 4x RT core power is tempting. But 5080 is a RAM downgrade, and 5090 is way to expensive. Hopefully there will be a 5080 Ti/Super that's 24 GB VRAM.
MrMPFR@reddit (OP)
Too early to say if it'll actually matter. We'll need independent testing + that 3090 TI 24GB of VRAM is needed for ultra 4K gaming going forward.
atatassault47@reddit
Yeah, I know lol. Exact reason I got it. I game at 5120x1440, which is 88.89% of 4k, and I already feel the VRAM usage.
Tasty_Toast_Son@reddit
Valid, I can feel the VRAM crunch with my 3080 at 1440p. I have to turn down settings that computationally the card can handle, but it can't fit in memory more often than I would like to admit.
kontis@reddit
Not just future games. The inability to raytrace Nanite is a problem in many UE5 games that use it. They have to use separate lower poly model (proxy) just to RT it.
I think EPIC was the one pushing for this feature. They talked about it since UE 5.0
MrMPFR@reddit (OP)
Future games could also mean upgrades for existing games, just being cautious with the wording.
Oh for sure, this was a feature Epic requested for UE5. Yes you can see the impact of RTX Mega Geometry on traced geometric detail here.
Fromarine@reddit
Idk man doesn't simply the better rt cores + gddr7 explain the gain per sm on blackwell for far cry 6 RT? Both the die size and gddr7 point to 128fp32 +128int32 being too good to be true because you should be seeing even more improvement per sm and much bigger dies per sm
MrMPFR@reddit (OP)
I’m just reporting what NVIDIA has officially said. Theoretical TFLOP doesn’t equate to real world perf.
Like I said it could explain some of the difference, but I suspect the majority of it is due to GDDR7 and possibly smarter caches.
FantomasARM@reddit
So something like a 3080 will be able to run the transformer model decently?
Nicholas-Steel@reddit
At significantly reduced benefit to FPS, yeah. Since it'll require significantly more processing power & VRAM due to needing to be optimized for FP8 instead of FP4.
hackenclaw@reddit
I am still wondering what use my dedicated FP16 in my Turing TU116 GTX1660Ti GPU in common consumer softwares.
GTX Turing doesnt have tensor cores, but Nvidia went out of way to add the dedicated FP16 unit. (which Pascal Architecture dont have). Why.
MrMPFR@reddit (OP)
It accelerates gaming performance. AMD called it Random Packed Math and it was a key feature for the PS4 pro.
gluon-free@reddit
What about FP64 cores? It could potentially be throwed away to save silicon space.
PIIFX@reddit
The FP64/FP32 rate seems to be 1/64.
ResponsibleJudge3172@reddit
FP64 and FP16 I believe I given the tensor cores. Which is interesting that they do non tensor math somehow
EETrainee@reddit
Theyre almost guaranteed to still have them for compatibility, the space savings per SM would be marginal at best. Non-Datacenter SKUs already only had two FP64 pipelines compared to ~128 32bit lines.
Fromarine@reddit
Still I wonder with how light the ms overhead of dlss has become, if the 4x higher compute requirements of dlss4 will be much of an issue on older cards, especially 30 series and up because if you look at this table, the 2080ti having slightly less overhead than the 3070 strongly suggests it is not currently using sparsity correct? Here
MrMPFR@reddit (OP)
Doubt DLSS CNN is using sparsity feature.
Throwawaymotivation2@reddit
Quality post! How did you calculate the compute capability of each gen?
Please update the post when they’re released!
MrMPFR@reddit (OP)
Thank you. Oh I didn't I just took the values from here. Compute capability is just a number that signifies the way the underlying hardware handles scheduling and execution of math. Because we're getting a big number increase it's likely that NVIDIA has been adding a lot of new functionality and changing how things are done.
Don't think we need to wait that long as the Whitepaper should arrive in about a week or two at most. But I'll make sure to add the additional info when it gets released.
XYHopGuy@reddit
Doubt the CNN model made much use of sparsity. Sparsity comes up most in embedding lookups and causal attention masks, mostly seen in gpt-like models, hence the importance for ML. If the new transformer model is causal, it will benefit.
MrMPFR@reddit (OP)
Thanks for the input. I’ve edited the post.
Nvidia said the new transformer implementation is much more temporal and can retain and compare info across many frames? Does this constitute causal behaviour?
XYHopGuy@reddit
Possibly but depends on implementation. I don't think I can explain it well here, but I'll try.
When you compute attention over a sentence, you create a. NxN matrix of attention scored where N in the length of the sentence. With causal attention, this is a lower triangular matrix, meaning words from the "future" have an attention score of zero and thus cannot be used to attend to any previous words.
With vision transformers this relationship isn't so clear to me. As you said though, id it's "much more temporal", they could be attending to a sequence of frames. Also, these code paths tend to be more optimized and so sometimes people use casual attention even when traditional attention is more intuitive.
MrMPFR@reddit (OP)
Thanks for the explanation but I'm afraid it's above my level of understanding.
Kiwi_CunderThunt@reddit
Holy hell you did your homework! Good effort ! Generally speaking though, guys go have a nap and wait. This market is crap so wait for prices to drop, even a little. Budget your card against what games you play. Do I want a new card? Yes...am I going to pay extortion $4000? No. I'll run my card into the ground, then there's frame gen TAA discussions etc etc. just game on and be happy imo
Emperor_Idreaus@reddit
This does not necessarily entail a direct reduction in frames per second (FPS), but it does indicate that older GPUs will have to exert significantly more, potentially resulting in increased power consumption and heat..
The performance impact will vary depending on the game or application and the capacity of the GPU to manage the additional computational load, so not a direct importance for current released titles but likely w/ driver updates and future new game engines implementing the new enhanced functionalities in their game development.
MrMPFR@reddit (OP)
Which part of the post are you replying to?
Or is this general thoughts on the Blackwell architecture?
Emperor_Idreaus@reddit
My apologies, im referring to the following:
[...] **Here’s the FP16 tensor math throughput per SM for each generation at iso-clocks:
And as you can see the delta in theoretical FP16, lack of support for FP(4-8) tensor math (Transformer Engine) and sparsity will worsen model ms overhead and VRAM storage cost with every previous generation. Note this is relative as we still don’t know the exact overhead and storage cost for the new transformer models [...]**
MrMPFR@reddit (OP)
As I thought.
Yes it’s not like performance will tank completely on older cards, the overhead will just be much bigger due to the larger model vs on newer cards. Older cards should still be able to get higher FPS in most cases even with the new DLSS models.
And yes the tensor performance is very workload dependent. Just assuming DLSS transformer models will run better on 40 and 50 series because they literally have a engine (more than just the reduced FP math) for that + stronger tensor cores in general.
9897969594938281@reddit
Great post! Insightful read.
octatone@reddit
Ignore rumors, wait for the embargo to lift.
MrMPFR@reddit (OP)
This is not a rumour, this is officially disclosed info.
Only added rumour tag due to cautious commentary.
AutoModerator@reddit
Hello! It looks like this might be a question or a request for help that violates our rules on /r/hardware. If your post is about a computer build or tech support, please delete this post and resubmit it to /r/buildapc or /r/techsupport. If not please click report on this comment and the moderators will take a look. Thanks!
I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.