AMD's Post-RDNA 4 Patent Filings Signal Major Changes Ahead

Posted by MrMPFR@reddit | hardware | View on Reddit | 21 comments

(To Mod/Disclaimer) This is a response to the latest patent filings shared by u/Kepler_L2 in the NeoGAF forums. Everything written here is reporting and selective analysis of patent filings and the implications are hypothetical not finalized so please don't take any of it as fact. IDK how many of these will end up in RDNA 5/UDNA or later architectures. But as with my previous post looking through patent filings can reveal the priorities of AMD R&D and signal possible shifts ahead. After all lots of them do end up in finalized silicon and shipping products. And this isn't an exhaustive indication of the possible changes that lie ahead. As we near EoY 2025 and enter 2026 more filings are certain to pop up leading up to the launch of AMD's nextgen GPU architecture.

Once again I'm no expert and trying to grasp some of these concepts was really hard. I could've made some mistakes and if so please tell me. With that out of the way let's look at the AMD patent filings.

Dense Geometry Format (DGF)

Kepler_L2 called this basically HW level nanite, but IDK how accurate that description is. This is the AMD patent filing for DGF, that they announced in February via GPUOpen. The Dense Geometry Format is all about making the BVH footprint as small as possible while reducing redundant memory transactions as per the blog:

"DGF is engineered to meet the needs of hardware by packing as many triangles as possible into a cache aligned structure. This enables a triangle to be retrieved using one memory transaction, which is an essential property for ray tracing, and also highly desirable for rasterization."

It'll be supported in hardware by future AMD GPU architectures. RDNA 4 hasn't mentioned support so this is reserved for nextgen.

Another patent filing adresses RT issues with BW use by adding a low precision prefiltering stage where bulk processing of primitive packets are done by default for prefiltering nodes (an alternative route to DGF) and only for inconclusive results are full precision intersection tests required. Both DGF and Prefilter nodes have major benefits in terms of lowering the area required (low precision), eliminating redundant duplicative data, reducing node data fetching, and increase compute-to-memory ratio of ray tracing. Here is the full quote from the paper:

"In the implementations described herein, parallel rejection testing of large groups of triangles enables a ray tracing circuitry to perform many ray-triangle intersections without fetching additional node data (since the data can be simply decoded from the DGF node, without the need of duplicating data at multiple memory locations). This improves the compute-to-bandwidth ratio of ray traversal and provides a corresponding speedup. These methods further reduce the area required for bulk ray-triangle intersection by using cheap low-precision pipelines to filter data ahead of the more expensive full-precision pipeline."

In conclusion the prefilter and DGF nodes allow a massively reduced load on the memory subsystem while permitting fast low precision bulk processing of triangles resulting in a speedup. All this while having an even lower area cost.

Multiple Ray Tracing Patents Filings

I won't repeat all the patents from the previous post I made months ago so this is only the patent filings shared by Kepler_L2.

One about configurable convex polygon ray/edge testing which allows sharing of results from edges between polygons eliminating duplicative intersection tests. This has the following benefit:

"By efficiently sharing edge test results among polygons with shared edges, inside/outside testing for groups of polygons can be made more efficient."

It can be implemented via full or reduced precision and makes ray tracing more cost-effective.

Three other patent filings leverage displaced micro-meshes (DMMs) and a accelerator unit (AU) that creates them.
The first patent filing introduces prism volumes for displaced subdivided triangles (inferred from DMM). AU creates an bounding volume around DMM mesh, it then adds more bounding volumes thereby creating a prism (3D triangle) shape around the base triangle corresponding to the three corners and the low and high of interpolated DMM normals. The AUs then "...determine whether a ray intersects the prism volume bounding the first base triangle of the DMM"

The second patent filing concerns ray tracing of DMMs using a bounding prism hierarchy. A base mesh is used which can be broken down into micro-meshes which can be adjusted with displacement to accurately showcase the scene detail. Method for intersection described same as in the other filings, except this one also mentions prisms at the sub base triangle level together making one big prism in accordance with first filing.

The third talks about the specific method for detecting ray intersections with DMMs. This method is as follows:

"Instead of detecting intersection with the bilinear patches directly, tetrahedrons that circumscribe the bilinear patches can be used instead. The two bases and the three tetrahedra make fourteen triangles. The device tests for potential intersection with the displaced micro-mesh by testing for an intersection with any of the fourteen triangles. Various other methods and systems are also disclosed."

I cannot figure out how this DMM implementation differs from NVIDIA's now deprecated DMM implementation in Ada Lovelace, but it sounds very similar although some differences are probably to be expected. IDK what benefits are to be expected here except perhaps lower BVH build cost and size.

Streaming Wave Coalescer (SWC)

The Streaming Wave Coalescer implements comprehensive out-of-order execution (RDNA 4 only has OoO memory requests). It does this by using sorting bins and hard keys to sort divergent threads across waves following the same instruction path, thereby coalescing the threads into new waves.

The spill-after programming model offers developers granular control over when and how thread state is spilled to memory when reordering executions to different lanes. This helps avoid excessive cache usage and memory access operations resulting in large increases in latency and costly front-end stalls when leveraging SWC.

Just like SER the SWC would help boost path tracing performance, although the implementation looks different and enabled by default.

Local Launchers and Work Graph Scheduler

One patent filing mentions that each Workgroup Processer (WGP) can now use local launchers generate work/start shader threads independent of the Shader Program Interface (SPI). They maintain their own queues and ressource management but ask for help via SPI and lease ressources for each shader thread. Scheduling and dispatching work locally results in reduced latency, more dynamic work launches and reduced GPU frontend bottlenecks.

This patent filing introduces a hierarchical scheduler made out of a global scheduler and one or more local schedulers called Work Graph Schedulers (WGS) located within each Shader Engine. Tasks are stored in a global mailbox/shared cache fed by the global scheduler and when a task (work item) is ready it then notifies one WGS to fetch it. Meanwhile scheduling and management of the work queue is offloaded to the local WGS. Each WGS independently schedules and maintains its own work queue for the WGPs and has its own private local cache. This resulting in quicker accesses and lower latency scheduling while at the same time enabling much better core scaling especially in larger designs as explained here:

"In an implementation, the WGS 306 is configured to directly access the local cache 310, thereby avoiding the need to communicate through higher levels of the scheduling hierarchy. In this manner, scheduling latencies are reduced and a finer grained scheduling can be achieved. That is, WGS 306 can schedule work items faster to the one or more WGP 308 and on a more local basis. Further, the structure of the shader engine 304 is such that a single WGS 306 is available per shader 304, thereby making the shader engine 304 more easily scalable. For example, because each of the shader engines 304 is configured to perform local scheduling, additional shader engines can readily be added to the processor."

En essence each SE becomes its own pseudo-autonomous GPU that handles scheduling and work queue independently of the global scheduler. Instead of the orchestrating everything and micromanaging, the global scheduler can simply provide work via the global mailbox thereby offloading scheduling of that work to each Shader Engine.

The patent filing also mentions that WGS may communicate with each other and that WGPs can assist in scheduling. The implementation is such that the WGS schedules work and sends a work schedule to the Asynchronous Dispatch Controller (one per Shader Engine). The ADC builds waves and launches work for the WGPs in the Shader Engines.

When a WGS is underutilized it can communicate that to the global scheduler and request more work. When it's being overloaded work items are exported to an external global cache. This helps with load balancing and keeping Shader Engines fed.

It's possible that a local scheduler might become overburdened, but AMD has another patent filing adressing this by allowing each WGS to offload work items to the global scheduler if they overwhelm its scheduling capabilities. These are redistributed to one or more other WGS residing within different scheduling domains/Shader Engines.

A Few Important Patents Filings

The RECONFIGURABLE VIRTUAL GRAPHICS AND COMPUTE PROCESSOR PIPELINE patent filing allows shaders (general purpose) to emulate fixed function HW and take over when a fixed function bottleneck is happening.

Another patent filing talking about ACCELERATED DRAW INDIRECT FETCHING leverages fixed function hardware (Accelerator) to speed up indiret fetching resulting in a lowered computational latency and allows "...different types of aligned or unaligned data structures are usable with equivalent or nearly equivalent performance."

[-]

AutoModerator@reddit

Hello! It looks like this might be a question or a request for help that violates our rules on /r/hardware. If your post is about a computer build or tech support, please delete this post and resubmit it to /r/buildapc or /r/techsupport. If not please click report on this comment and the moderators will take a look. Thanks!

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[-]

NGGKroze@reddit

Feature	AMD Implementation	NVIDIA Implementation
Geometry Compression	DGF: Block-based, cache-aligned nodes	DMM: Base triangles + barycentric displacement maps
Low-Precision Filtering	Bulk prefilter with low-precision pipelines	OMM: Opacity micromaps for alpha filtering
Wave/Thread Reordering	WC with programmer-controlled spill	SER, fully transparent to developers
Work Scheduling	Hierarchical local schedulers (WGS) per Shader Engine	Centralized SM warp scheduler, global work queue

[-]

dudemanguy301@reddit

Would be interested in an implementation trade off summary.

Off the top of my head I know that DGF has worse compression ratios but DMM has topology issues as it tends to decimate spikes, struts, and bevels.

[-]

MrMPFR@reddit (OP)

Likewise but not familiarized enough with specifics to make this.

DGF enables bulk intersection testing of triangles using only one memory transactions + should speed up RT for all cards (it's in the patent filings).
Shrinks BVH in memory and in caches.

It's two very different approaches that could be complementary to each other, and I suspect AMD is exploring both since new DGF and DMM related patent filings continue to become public.

DGF is using existing information in a triangle and mesh to compress it down without requiring serious processing. DMM on the other hand requires tesselation logic to reconstruct the original mesh via interpolation and subdivisions (AMD's proposed method) but allows some very high compression ratios (>10-20X) unlike DGFs (3-5X).
NVIDIA deprecated DMM earlier this year IIRC and didn't mention it for Blackwell at all so it might just die. RTX Mega Geometry is replacing it.

Also u/Kepler_L2 shared both alongside each other in a forum post so AMD is prob exploring both. Too early to say but regardless the pre-filtering idea is brilliant. Use integers by default for ray/tri intersections instead of FP helps by saving cache, area and power draw, while boosting performance (PPA). It's really a free RT pipeline boost.

[-]

MrMPFR@reddit (OP)

Feature	Traditional GPU µarch	Hierarchical Scheduler
Scheduling Model	Centralized or semi-centralized	Fully hierarchical and distributed
Task Dispatch Latency	Higher due to hierarchy traversal and global cache	Lower via local caches and ADCs. Even lower with local launchers.
Scalability	Limited by centralized bottlenecks	Modular and easily extensible
Load Balancing	Often static or coarse-grained	Dynamic via work stealing
PPA Efficiency/WGP	Degrades with scale	Maintained via localized control

[-]

MrMPFR@reddit (OP)

Interesting comparisons.

AMD have prefiltering nodes as an alterantive to DGF in the patents + DMM is mentioned extensive in the Kepler_L2 patents.

AMD almost certain to have both OMM + SER (via SWC) as a result of DXR 1.2. OMM is specific to alpha tested textures while bulk prefiltering could benefit all RT pipelines.

WGPs also have independent work creation and work queues for even more local scheduling and dispatch and can lease ressources from the SPI.

And there's tons of additional optimizations in the patent filings I shared in late April.

[-]

Jonny_H@reddit

The Streaming Wave Coalescer implements comprehensive out-of-order execution (RDNA 4 only has OoO memory requests). It does this by using sorting bins and hard keys to sort divergent threads across waves following the same instruction path, thereby coalescing the threads into new waves.

This isn't really "out of order execution" as you'd see in a CPU. GPUs already switch to other waves to avoid the pipeline stalling an in-order CPU working on a single thread sees.

This patent is more about divergence within a wave - if there's a conditional (like an "if" block) in a shader and the entire wave (32/64 threads depending on mode) doesn't end up taking exactly the same path, all paths need to be executed and the results simply ignored on the paths it "shouldn't" have run in. The same with loops - the entire wave takes as long as the largest iteration count of any single shader. So in cases where you're traversing through data structures that might have significant, unpredictable divergence at runtime (such as, for example, ray tracing BVH nodes) you can end up paying a lot for the long tail of shaders that happen to need a large number of node iterations, as they pin the entire wave they happen to be run with onto the GPU.

Now this would be OK if you could, say, collect all the "long running" shader instances together into the same wave, so the "quick" instances can be completed without the longer running examples slowing it down. That seems to be what the particular patent is about, not really "out of order" execution.

[-]

0101010001001011@reddit

Worth reading Brian Karis' (Nanite lead/creator) thoughts on DGFs.

DGF is very similar to Nanite's encoding minus attributes. Switching Nanite to this format would be some work but is definitely possible.

[-]

MrMPFR@reddit (OP)

Corrected the mistake and great explanation.

[-]

Jonny_H@reddit

No problem, though you should always be wary about assuming these ideas 1) Get implemented and 2) Actually improve performance in whatever use case you care about.

I mean, it's not hard to make a much more capable core, I wouldn't be surprised if an EE student's "HDL 101" core is (individually) faster and/or more capable than a GPU "core" (not that they'll ever get hardware implementations :). The problem is all those capabilities inevitably make the core larger, more power hungry, and often increase critical path length so lower peak clocks. The "Magic" isn't in the feature, but in how simply (and small) they can actually implement it, and what they might give up for that in terms of the feature flexibility or limitations. And patents are so far from that making any sort of judgement there is kinda useless. I feel sometimes that needs to be made clear - I've seen people argue that one specific hardware is better or worse because of some implementation detail "feature".

It may be that they implement all these things, but it ends up the size isn't worth it compared to just laying down more simpler cores for the same area.

[-]

MrMPFR@reddit (OP)

100% PPA is everything. Which is what makes DGF so much more interesting. AMD is behind it has confirmed they're supporting it moving forward and the patent even mentions that their implementation in HW when paired with bundled testing allows for higher PPA.

[-]

Vb_33@reddit

Yeap, 2027 is lining up to be UDNAs year same as 2020 was RDNA2s year. There will be new cards, new gaming consoles, the works.

[-]

imaginary_num6er@reddit

Parents don’t mean much when you had Intel patenting “Adamantine” cache and Red Gaming Tech was claiming it to be better than RDNA 3’s Infinity Cache

[-]

Vb_33@reddit

Oh yea? We'll see who has the last laugh when Arc Druid comes out!/s

[-]

Nicholas-Steel@reddit

I won't repeat all the patents from the previous post I made months ago so this is only the patent filings shared by Kepler_L2.

You could edit a link to that post in to that sentence.

[-]

MrMPFR@reddit (OP)

Ask and you shall receive.

Thought it conflicted with self promotion policy.

[-]

Affectionate-Memory4@reddit

Once again really cool things to dig up from the patents and awesome to see so much active development from Team Red.

These look like the types of changes that might roll out over the course of a few generations before being all checked off, but massive scheduling and RT overhauls are pretty much what I expect for a supposedly brand-new architecture like UDNA.

The distributed scheduling thing is very interesting in the context of a potential chiplet GPU. They call out lower latency and the ability to pass things up to a higher scheduler. That strikes me as maybe each chiplet having it's own scheduler, and a master center die hosting maybe I/O, memory controllers, and global scheduling.

[-]

MrMPFR@reddit (OP)

Decentralized ands modular local scheduling at the Shader Engine level is a complete paradigm shift.

AFAICT the only job the global scheduler has is to prepare work items and store those in a "global mailbox" + do load balancing by stealing work from overloaded SEs and send that to underutilized SEs. ALL scheduling (Work Graph Scheduler) and dispatch (Asychronous Dispatch Controller) is done locally at the SE level or even lower as WGPs can now launch work locally and lease ressource from SPI.

When scheduling and dispatch performance depends on mid-level cache (IDK what AMD will call L1 in UDNA) latencies instead of L2 (via global scheduler) that should results in much improved scheduling latency and granularity. L1 in RDNA 3 is 256kB per Shader Array (two in one SE) instead of 6MB L2. Difference in latency alone is massive + adding cache hit rates and memory transactions between caches in this scenario with current method which WGS and ADC avoids entirely makes the difference even bigger.

Interesting thought about chiplets and I couldn't resist adding that point to the post (see #3). WGS and ADC could allow AMD to have a Zen moment in GPU. Shader Engine chiplets paired with one master die with global scheduler, L2 and higher caches and PHYs + rest (PCIe, encoders, decoders, various ICs).
Could even allow them to mix IPs and make heterogenous GPUs that's mix and match RDNA chiplets with various ASIC chiplets. If we assume the global scheduler is fine they could keep the IODs virtually unchanged (new PHYs might be needed) between gens similar to Zen to save on cost.

Alternatively AMD could break master die up to save even more on cost. One L2 and controller hub die, one Media Interface Die, 2-4 Memory PHY + perhaps MALL LCC dies. All connected using InFO or perhaps something better like silicon bridges.

If WGS gets implemented in UDNA then I'll look forward to testing of the higher end GPU die configs assuming they launch. Patents mentions WGS local scheduling helps a lot with core scaling. OoO selective scheduling for UDNA would be desirable as well but we'll see. This GhOST paper mentions 6.9% median uplift at +0.007% area cost. Basically free IPC.

[-]

MrMPFR@reddit (OP)

TL;DR for patents

Dense Geometry Format/DGF: Block-based geometry compression. Reduces VRAM usage, BVH build size and reduces RT's load on memory and cache subsystems. Leveraging DGF nodes alongside prefiltering nodes to do bulk processing of triangles results in a significant speedup of triangle intersection testing at a lower area overhead than current method.

Configurable convex polygon ray/edge testing: Allows shares of results from edges between polygons eliminating duplicative inside/outside testing.

DMM patents: Three patents about AMD's Displaced Micro Meshes implementation. Looks related to NVIDIA's deprecated DMM implementation in Ada Lovelace. It uses bounding prisms on top of interpolated DMM (made from base triangle) to find ray intersections and alternatively subprisms within the prism.

Streaming Wave Coalescer (SWC) circuit: Does thread coherency sorting of divergent threads + spill-after programming model for devs to maximize benefits of SWC. Accomplishes same thing as NVIDIA's SER and Intel's TSU. Very important for path tracing.

Workgroup self-launch: Local launchers allows workgroup processors (WGPs) to launch work on their own independent of Shader Programming Interface (SPI). Also maintain queues and ressource management on their own but can lease ressources from SPI.

Work Graph Scheduler (WGS): Shader Engine (SE) level scheduler that operates independent from global scheduler except sharing mail box (global work item store) benefitting from load balancing across SEs. WGS results in finer grained scheduling and lower latency improving performance.

If a WGS is overloaded or WGPs within a SE are underutilized work items are migrated to other WGS/SEs via the global scheduler and global data share/mail box ensuring load balancing across SEs.
In addition one Asynchronous Dispatch controller operates under WGS within each SE. It builds waves and launches work for WGPs within each SE.

The WGS change allows GPU to be far more scalable as each scheduling domain is limited to a group of WGPs within a SEs. The global schedulers only job is to prepare work for the SEs and ensure even load balancing across Shader Engines. Scaling efficiency is no longer dictated by the frontends ability to schedule for entire GPU but how quickly it can load and prepare work items and do load balancing.

The scheduling changes proposed in patents are quite significant. They could be beneficial to branchy code such as work graphs and WGS would probably enable superior scaling for wider GPUs.

Fixed function emulation: Shaders could emulate fixed function HW and take over via a reconfigurable virtual graphics and compute processor pipelines when FF HW becomes overwhelmed.

Accelerated indirect draw fetching: Using accelerators would help speed up indirect draw fetching leading to lower computational latency and no longer making alignment of data structures important for performance.

[-]

fibercrime@reddit

RemindMe! next saturday

[-]

AutoModerator@reddit

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.