Every Architectural Change For RTX 50 Series Disclosed So Far

Posted by MrMPFR@reddit | hardware | View on Reddit | 162 comments

Disclaimer: Flagged as a rumor due to cautious commentary on publicly available information.

There are some key changes in the Blackwell 2.0 design or RTX 50 series that seem to have flown under the radar on Reddit and in the general media coverage. Here I’ll be covering those in addition to more widely reported changes. With that said we still need the Whitepaper for the full picture.

The info is derived from the official keynote and the NVIDIA website post on the 50 series laptops.

If you want to know what the implications are this igor’sLAB article is good+ this article by Tom’s Hardware is good too for additional details and analysis.

Neural Shaders

Hardware support for neural shaders is result of the integration of neural networks inside of the programmable shader pipeline. This is possible because Blackwell has tighter co-integration of Tensor and CUDA cores which optimizes performance. In addition Shader Execution Reordering (SER) has been enhanced with software and hardware level improvements. For example the new reorder logic is twice as efficient as Ada Lovelace. This increases the speed of neural shaders.

Improved Tensor Cores

New support for FP6 and FP4 is ported functionality from datacenter Blackwell. This is part of the Second Generation Transformer Engine. To drive Multiple Frame Generation Blackwell’s tensor cores have doubled throughput (INT8 + other formats) vs Ada Lovelace and 4x with FP4.

Flip metering

The display engine has been updated with flip metering logic that allows for much more consistent frame pacing for Multiple Frame Generation and Frame Generation on 50 series.

Redesigned RT cores

The ray triangle intersection rate is doubled yet again to 8x per RT core as it’s been done with every generation since Turing. Here’s the ray triangle intersection rate for each generation:

Turing = 1x
Ampere = 2x
Ada Lovelace = 4x
Blackwell = 8x

Like previous generations no changes for BVH traversal and ray box intersections have been disclosed.

The new SER implementation also seem to benefit ray tracing as per RTX Kit site:

”SER allows applications to easily reorder threads on the GPU, reducing the divergence effects that occur in particularly challenging ray tracing workloads like path tracing. New SER innovations in GeForce RTX 50 Series GPUs further improve efficiency and precision of shader reordering operations compared to GeForce RTX 40 Series GPUs.”

Like Ada Lovelace’s SER it’s likely that the additional functionality requires integration in games, but it’s possible these advances are simply low level hardware optimizations.

RT cores are getting enhanced compression designed to reduce memory footprint. Whether this also boosts performance and bandwidth or simply implies smaller BVH storage cost in VRAM remains to be seen. If it’s SRAM compression then this could be “sparsity for RT” (the analogy is high level, don’t take it too seriously), but technology behind remains undisclosed.

All these changes to the RT core compound, which is why NVIDIA made this statement:

”This allows Blackwell GPUs to ray trace levels of geometry that were never before possible.”

This also aligns with NVIDIA’s statements about the new RT cores being made for RTX mega geometry (see RTX 5090 product page), but what this actually means remains to be seen. But we can infer reasonable conclusions based on the Ada Lovelace Whitepaper:

”When we ray trace complex environments, tracing costs increase slowly, a one-hundred-fold increase in geometry might only double tracing time. However, creating the data structure (BVH) that makes that small increase in time possible requires roughly linear time and memory; 100x more geometry could mean 100x more BVH build time, and 100x more memory.”

The RTX Mega Geometry SDK takes care of reducing the BVH build time and memory costs which allows for up to 100x more geometric detail and support for infinitely complex animated characters. But we still need much higher ray intersections and effective throughput (coherency management) and all the aforementioned advances in the RT core logic should accomplish that. With additional geometric complexity in future games the performance gap between generations should widen further.

Hardware Advances Powering MFG and Enhanced DLSS Transformer Model

With Ampere NVIDIA introduced sparsity, a feature that allows for pruning of trained weights in the neural network. This compression enables up to a 2X increase in effective memory bandwidth and storage and sparsity allows for up to 2X more math/computations. Ada Lovelace doubles these theoretical benefits with structural sparsity support.

For new MFG, FG and the Ray Reconstruction, Upscaling and DLAA transformer enhanced models it’s likely they’re built from the ground up to utilize all the architectural benefits of the Blackwell Architecture: structural sparsity and sparsity for dense math and FP4, FP6, FP8 support (Second Gen Transformer Engine).

Whether DLSS CNN models use the sparsity feature is undisclosed.

NVIDIA said the new DLSS 4 transformer models for ray reconstruction and upscalinghas 2x more parameters and requires 4x higher compute.How this translates to ms overhead vs DNN model is unknown but don’t expect a miracle; the ms overhead will be significantly higher thantheDNN version. This is a performance vs visuals trade-off.

Here’s the FP16 tensor math throughput per SM for each generation at iso-clocks:

Turing: 1x
Ampere: 1x (2x with sparsity)
Ada Lovelace: 2x (8x with sparsity + structural sparsity), 4x FP8 (not supported previously)
Blackwell: 4x (16x with sparsity + structural sparsity), 16x FP4 (not supported previously)

And as you can see the delta intheoretical FP16,lack of support forFP(4-8)tensor math (Transformer Engine) and sparsity will worsen model ms overhead and VRAM storage cost with every previous generation. Note this is relative as we still don’t know the exact overhead and storage cost for the new transformer models.

Blackwell CUDA Cores

Duringthe keynoteI heard theAda Lovelace SM and a Blackwell SM are not apples to apples at all. Based on the limited information given during the keynoteby Jensen:

"...there is actually a concurent shader teraflops as well as an integer unit of equal performance so two dual shaders one is for floating point and the other is for integer."

NVIDIA's website also mentions this:

"The Blackwell streaming multiprocessor (SM) has been updated with more processing throughput"

How this implementation differs from Ampere and Turing remains to be seen. We don’t know if it is a beefed up version of the dual issue pipeline from RDNA 3 or if the datapaths and logic for each FP and INT unit is Turing doubled. Turing doubled is most likely as RDNA 3 doesn’t advertise dual issue as doubled cores per CU. If it’s an RDNA 3 like implementation and NVIDIA still advertises the cores then it is as bad as the Bulldozer marketing blunder. It only had 4 true cores but advertised them as 8.

Here’s the two options for Blackwell compared on a SM level against Ada Lovelace, Ampere, Turing and Pascal:

Blackwell dual issue cores: 64 FP32x2 + 64 INT32x2
Blackwell true cores: 128 FP32 + 128 INT32
Ada Lovelace/Ampere: 64 FP32/INT32 + 64 FP32
Turing: 64 FP32 + 64 INT32
Pascal: 128 FP32/INT32

Many people seem baffled by how NVIDIA managed more performance (Far Cry 6) per SM with 50 series despite the sometimes lower clocks compared to 40 series. This could explain som of the increase.

Media and Display Engine Changes

Display:

”Blackwell has also been enhanced with PCIe Gen5 and DisplayPort 2.1b UHBR20, driving displays up to 8K 165Hz.”

Media engine encoder and decoderhas been upgraded:

”The RTX 50 chips support the 4:2:2 color format often used by professional videographers and include new support for multiview-HEVC for 3D and virtual reality (VR) video and a new AV1 Ultra High-Quality Mode.”

Hardware support for 4:2:2 is new and the 5090 can decode up to 8x 4K 60 FPS streams per decoder.

5% better quality with HEVC and AV1 encoding + 2x speed for H.264 video decoding.

Improved Power Management:

”For GeForce RTX 50 Series laptops, new Max-Q technologies such as Advanced Power Gating, Low Latency Sleep, and Accelerated Frequency Switching increases battery life by up to 40%, compared to the previous generation.”

”Advanced Power Gating technologies greatly reduce power by rapidly toggling unused parts of the GPU.

Blackwell has significantly faster low power states. Low Latency Sleep allows the GPU to go to sleep more often, saving power even when the GPU is being used. This reduces power for gaming, Small Language Models (SLMs), and other creator and AI workloads on battery.

Accelerated Frequency Switching boosts performance by adaptively optimizing clocks to each unique workload at microsecond level speeds.

Voltage Optimized GDDR7 tunes graphics memory for optimal power efficiency with ultra low voltage states, delivering a massive jump in performance compared to last-generation’s GDDR6 VRAM.”

Laptop will benefit more from these changes, but the desktop should still see some benefits. These will probably mostly from Advanced Power Gating and Low Latency Sleep, but it’s possible they could also benefit from Accelerated Frequency Switching.

GDDR7

Blackwell uses GDDR7 which lowers power draw and memory latencies.

Blackwell’s Very High Compute Capability

The ballooned compute capability of Blackwell 2.0 or 50 series at launch remains an enigma. Normally the compute capability of a card at launch trails the version of the official CUDA toolkit by years, but this time it’s the opposite. CUDA toolkit trails the compute capability of Blackwel 2.0 by 0.2 (12.8 compute capability vs 12.6 CUDA toolkit). Whether this supports Jensen’s assertion of Blackwell consumer being the biggest architectural redesign since the programmable shaders were introduced with the GeForce 256 (world’s first GPU) in 1999 remains to be seen. The increased compute capability number could have something to do with neural shaders and tighter Tensor and CUDA core co-integration + other undisclosed changes. But it’s too early to say where the culprits lie.

For reference here’s the official compute capabilities of the different architectures going all the way back to CUDA’s inception with Tesla in 2006:

- note: As you can see in one generation from Ada Lovelace to Blackwell compute capability takes a larger numerical jump than between three generations from Pascal to Ada Lovelace.

Blackwell: 12.8

Enterprise – Blackwell: 10.0

Enterprise - Hopper: 9.0

Ada Lovelace: 8.9

Ampere: 8.6

Enterprise – Ampere: 8.0

Turing: 7.5

Enterprise – Volta: 7.0

Pascal: 6.1

Enterprise - Pascal 6.0

Maxwell 2.0: 5.2

Maxwell: 5

Big Kepler: 3.5

Kepler: 3.0

Small Fermi: 2.1

Fermi: 2.0

Tesla: 1.0 + 1.3

[-]

EmergencyCucumber905@reddit

CUDA toolkit version and compute capability are two different things.

[-]


Architektur


Architektur