AMD's Post-RDNA 4 Patent Filings Signal Major Changes Ahead

Posted by MrMPFR@reddit | hardware | View on Reddit | 21 comments

(To Mod/Disclaimer) This is a response to the latest patent filings shared by u/Kepler_L2 in the NeoGAF forums. Everything written here is reporting and selective analysis of patent filings and the implications are hypothetical not finalized so please don't take any of it as fact. IDK how many of these will end up in RDNA 5/UDNA or later architectures. But as with my previous post looking through patent filings can reveal the priorities of AMD R&D and signal possible shifts ahead. After all lots of them do end up in finalized silicon and shipping products. And this isn't an exhaustive indication of the possible changes that lie ahead. As we near EoY 2025 and enter 2026 more filings are certain to pop up leading up to the launch of AMD's nextgen GPU architecture.

Once again I'm no expert and trying to grasp some of these concepts was really hard. I could've made some mistakes and if so please tell me. With that out of the way let's look at the AMD patent filings.

Dense Geometry Format (DGF)

Kepler_L2 called this basically HW level nanite, but IDK how accurate that description is. This is the AMD patent filing for DGF, that they announced in February via GPUOpen. The Dense Geometry Format is all about making the BVH footprint as small as possible while reducing redundant memory transactions as per the blog:

"DGF is engineered to meet the needs of hardware by packing as many triangles as possible into a cache aligned structure. This enables a triangle to be retrieved using one memory transaction, which is an essential property for ray tracing, and also highly desirable for rasterization."

It'll be supported in hardware by future AMD GPU architectures. RDNA 4 hasn't mentioned support so this is reserved for nextgen.

Another patent filing adresses RT issues with BW use by adding a low precision prefiltering stage where bulk processing of primitive packets are done by default for prefiltering nodes (an alternative route to DGF) and only for inconclusive results are full precision intersection tests required. Both DGF and Prefilter nodes have major benefits in terms of lowering the area required (low precision), eliminating redundant duplicative data, reducing node data fetching, and increase compute-to-memory ratio of ray tracing. Here is the full quote from the paper:

"In the implementations described herein, parallel rejection testing of large groups of triangles enables a ray tracing circuitry to perform many ray-triangle intersections without fetching additional node data (since the data can be simply decoded from the DGF node, without the need of duplicating data at multiple memory locations). This improves the compute-to-bandwidth ratio of ray traversal and provides a corresponding speedup. These methods further reduce the area required for bulk ray-triangle intersection by using cheap low-precision pipelines to filter data ahead of the more expensive full-precision pipeline."

In conclusion the prefilter and DGF nodes allow a massively reduced load on the memory subsystem while permitting fast low precision bulk processing of triangles resulting in a speedup. All this while having an even lower area cost.

Multiple Ray Tracing Patents Filings

I won't repeat all the patents from the previous post I made months ago so this is only the patent filings shared by Kepler_L2.

One about configurable convex polygon ray/edge testing which allows sharing of results from edges between polygons eliminating duplicative intersection tests. This has the following benefit:

"By efficiently sharing edge test results among polygons with shared edges, inside/outside testing for groups of polygons can be made more efficient."

It can be implemented via full or reduced precision and makes ray tracing more cost-effective.

Three other patent filings leverage displaced micro-meshes (DMMs) and a accelerator unit (AU) that creates them.
The first patent filing introduces prism volumes for displaced subdivided triangles (inferred from DMM). AU creates an bounding volume around DMM mesh, it then adds more bounding volumes thereby creating a prism (3D triangle) shape around the base triangle corresponding to the three corners and the low and high of interpolated DMM normals. The AUs then "...determine whether a ray intersects the prism volume bounding the first base triangle of the DMM"

The second patent filing concerns ray tracing of DMMs using a bounding prism hierarchy. A base mesh is used which can be broken down into micro-meshes which can be adjusted with displacement to accurately showcase the scene detail. Method for intersection described same as in the other filings, except this one also mentions prisms at the sub base triangle level together making one big prism in accordance with first filing.

The third talks about the specific method for detecting ray intersections with DMMs. This method is as follows:

"Instead of detecting intersection with the bilinear patches directly, tetrahedrons that circumscribe the bilinear patches can be used instead. The two bases and the three tetrahedra make fourteen triangles. The device tests for potential intersection with the displaced micro-mesh by testing for an intersection with any of the fourteen triangles. Various other methods and systems are also disclosed."

I cannot figure out how this DMM implementation differs from NVIDIA's now deprecated DMM implementation in Ada Lovelace, but it sounds very similar although some differences are probably to be expected. IDK what benefits are to be expected here except perhaps lower BVH build cost and size.

Streaming Wave Coalescer (SWC)

The Streaming Wave Coalescer implements comprehensive out-of-order execution (RDNA 4 only has OoO memory requests). It does this by using sorting bins and hard keys to sort divergent threads across waves following the same instruction path, thereby coalescing the threads into new waves.

The spill-after programming model offers developers granular control over when and how thread state is spilled to memory when reordering executions to different lanes. This helps avoid excessive cache usage and memory access operations resulting in large increases in latency and costly front-end stalls when leveraging SWC.

Just like SER the SWC would help boost path tracing performance, although the implementation looks different and enabled by default.

Local Launchers and Work Graph Scheduler

One patent filing mentions that each Workgroup Processer (WGP) can now use local launchers generate work/start shader threads independent of the Shader Program Interface (SPI). They maintain their own queues and ressource management but ask for help via SPI and lease ressources for each shader thread. Scheduling and dispatching work locally results in reduced latency, more dynamic work launches and reduced GPU frontend bottlenecks.

This patent filing introduces a hierarchical scheduler made out of a global scheduler and one or more local schedulers called Work Graph Schedulers (WGS) located within each Shader Engine. Tasks are stored in a global mailbox/shared cache fed by the global scheduler and when a task (work item) is ready it then notifies one WGS to fetch it. Meanwhile scheduling and management of the work queue is offloaded to the local WGS. Each WGS independently schedules and maintains its own work queue for the WGPs and has its own private local cache. This resulting in quicker accesses and lower latency scheduling while at the same time enabling much better core scaling especially in larger designs as explained here:

"In an implementation, the WGS 306 is configured to directly access the local cache 310, thereby avoiding the need to communicate through higher levels of the scheduling hierarchy. In this manner, scheduling latencies are reduced and a finer grained scheduling can be achieved. That is, WGS 306 can schedule work items faster to the one or more WGP 308 and on a more local basis. Further, the structure of the shader engine 304 is such that a single WGS 306 is available per shader 304, thereby making the shader engine 304 more easily scalable. For example, because each of the shader engines 304 is configured to perform local scheduling, additional shader engines can readily be added to the processor."

En essence each SE becomes its own pseudo-autonomous GPU that handles scheduling and work queue independently of the global scheduler. Instead of the orchestrating everything and micromanaging, the global scheduler can simply provide work via the global mailbox thereby offloading scheduling of that work to each Shader Engine.

The patent filing also mentions that WGS may communicate with each other and that WGPs can assist in scheduling. The implementation is such that the WGS schedules work and sends a work schedule to the Asynchronous Dispatch Controller (one per Shader Engine). The ADC builds waves and launches work for the WGPs in the Shader Engines.

When a WGS is underutilized it can communicate that to the global scheduler and request more work. When it's being overloaded work items are exported to an external global cache. This helps with load balancing and keeping Shader Engines fed.

It's possible that a local scheduler might become overburdened, but AMD has another patent filing adressing this by allowing each WGS to offload work items to the global scheduler if they overwhelm its scheduling capabilities. These are redistributed to one or more other WGS residing within different scheduling domains/Shader Engines.

A Few Important Patents Filings

The RECONFIGURABLE VIRTUAL GRAPHICS AND COMPUTE PROCESSOR PIPELINE patent filing allows shaders (general purpose) to emulate fixed function HW and take over when a fixed function bottleneck is happening.

Another patent filing talking about ACCELERATED DRAW INDIRECT FETCHING leverages fixed function hardware (Accelerator) to speed up indiret fetching resulting in a lowered computational latency and allows "...different types of aligned or unaligned data structures are usable with equivalent or nearly equivalent performance."