Is a dedicated ray tracing chip possible?
Posted by upbeatchief@reddit | hardware | View on Reddit | 81 comments
Can there be a raytracing co processor. Like how PhysX can be offloaded to a different card, there dedicated ray tracing cards for 3d movie studios, if you can target millions and cut some of enterprise level features. Can there be consumer solution?
Plazmatic@reddit
No, infact the slowest part of ray tracing today is no longer the ray intersections and BVH memory operatios, but the material evaluation (regular shaders), and it's so much the case that since Ada, Nvidia has stagnated the ratio of RT cores on their GPUs to cuda cores. We've basically already hit the RT hardware wall architecturally, and we should expect increases in RT performance to scale with compute/regular rendering performance from here on out.
billkakou@reddit
That's what Bolt graphics doing with Zeus gpu, it's in early stage for now, but let's see.
nanonan@reddit
That's a raster/raytracing soution much like nvidia and amd, just with more emphasis on rays.
moofunk@reddit
Zeus isn't a GPU per se (even if they claim it is), as there is no mention of rasterization, but an old fashioned general purpose HPC chip with no AI features.
The FP64 performance of it is much more interesting than the raytracing part. Supposedly over 20 TFLOPS, where a 5090 offers 1.6 TFLOPS.
I'm actually thinking they are talking about gaming to get more attention, but so far, this is the least interesting part of the chip.
nanonan@reddit
It's a GPU in every sense of the word, and they do mention rasterisation. Not sure what you mean by AI features, but it certainly can be utilised for AI.
moofunk@reddit
Where do they mention rasterization? It's not present in any of their promotional material. Did Tom's Hardware chat with an employee off the record?
AI is particularly left out, giving no meaningful benchmark comparisons with Nvidia cards.
FP32 and FP16 vector operations for their smallest chip perform at 30-50% of a 5080, according to their benchmarks. Even their biggest chip is benchmarked as slower than a 5090 for FP32 and FP16 on their own promotional material.
But, FP64 is over 10x faster than a 5090.
These chips are clearly for offline raytracing and FP64 work and networked in larger clusters with lots of slow memory, even if they claim a gaming interest.
This is closer to a Tenstorrent chip than any GPU. It would absolutely have its uses, but the gaming angle is very dubious.
nanonan@reddit
The promotional focus is on the path tracaing as that where it outperforms, but it is designed as a general gpu with full directx and vulkan support.
moofunk@reddit
There's only a single speculative statement from PCGamer, about DX and Vulkan support.
No official talk about supporting these two frameworks in any capacity.
nanonan@reddit
It's listed as coming soon in the docs, but support is planned. Their aim is a standalone product, so it wouldn't make much sense if you needed a gpu alongside it.
https://bolt-graphics.atlassian.net/wiki/spaces/EAP/pages/324468810/Zeus#GPU-APIs
moofunk@reddit
It doesn't say that. I still have no firm source that they are actually going to support these APIs, or if they are using any of them for Glowstick.
You can have a good HPC standalone product without considering gaming or traditional GPUs.
YairJ@reddit
I'm pretty sure that's not for consumers, though.
Strazdas1@reddit
Currently they dont have a single prototype. Its all just software simulation. Its not for anyone yet. And given their seed capital, i dont think they will actually produce anything.
IanCutress@reddit
There super early, but have FPGA based demo PoC already. Silicon due Q4 2026.
IanCutress@reddit
Yes. Bolt Graphics is a seed stage startup expecting silicon by end of 2026. Consumer is a way off, and they still need to fund raise a couple of rounds, but they're working on it. /selfpromotion
https://youtu.be/-rMCeusWM8M?si=r07A7t2kID1kGsT3
upbeatchief@reddit (OP)
Hey doc, thanks for the vid, it is exactly what i was thinking of. did they ever talk about how a consumer device might work?
And in general if things like nvlink never return to the consumer side, would it be possible for sperate chip to handle raytracing or is the latency too high like many commenters here have said?
ThePresident44@reddit
Ray tracing is so deeply ingrained into the rendering process that it would work even worse than multi-GPU (which could split work by alternating frames for example)
PhysX cards only really worked (somewhat) because physics are their own contained thing that mostly runs at fixed intervals
skycake10@reddit
That latency is also one of the biggest reasons why dedicated PhysX cards died as a concept pretty quickly and it was rolled into GPU compute after Nvidia bought them.
poorlycooked@reddit
Afterwards testing also showed that even with the 4090, a dedicated PhysX card improved framerates greatly. The physics calculation is too inefficient to be done by the main GPU nowadays.
SharkBaitDLS@reddit
Except ironically now it’s back since you can’t run older PhysX games on new NVIDIA GPUs. I’ve got a slot-powered 750Ti that I’m keeping around to use as a coprocessor for when I upgrade off my 3080Ti.
Strazdas1@reddit
you cannot run a certain set of games that used 32 bit PhysX that had PhysX version bellow 3.0. Except you actually can, just with CPU emulation. Or if you disable PhysX options it runs just like it did before as well. And in some games you dont even feel the change. AC:BF for example the physx being rendered on CPU are so light it makes no real impact on performance.
Trzlog@reddit
It runs like before ... you know, except for the complete lack of physics effects. The difference with PhysX off (or even on normal) is staggering, especially in the hallucination with big Joker.
upbeatchief@reddit (OP)
Can a raytracing element be fixed to low number of intervals, like the sun light updating on 33ms interval while other elements like reflections neing allowed faster intervals?
Gachnarsw@reddit
Yes this is already done, and usually reflections update at a lower rate already. AFAIK lighting in most games, and especially raytraced games is performed at a lower resolution and accumulated over multiple frames.
Realtime raytracing requires massive hardware resources and games that use it are held together by gobs of rendering tricks to trace as few rays and shade as few pixels as possible.
Strazdas1@reddit
you still have a lower resolution trace in frame 1 though. You just keep improving it over time with accumulation. If you skip frames though it becomes quite visible.
jcm2606@reddit
They're "updated" at a lower rate across space, not time, which is what OP was talking about.
ThePresident44@reddit
Accumulating is different than the crude interpolation between time steps done on physics. Ray tracing is still happening every frame but the results are added together to reduce noise
ThePresident44@reddit
Not really. Players, NPCs, the camera, something will always be moving which will affect light bounces and change reflections/shadows
Ray traced elements would be “jittering” around the place or look “stuttery” if they desync’d from the native frame rate, leave ghosts when objects get destroyed, etc. etc.
Strazdas1@reddit
We had RT where bounces would be only calculated every X number of frames. It looked like the lighting was stuttering behind. Very visible.
Strazdas1@reddit
only if the scene is static. as soon as there is movement it needs to update more often. There were some experimentation with updating tracing bounces every X frames, but that tended to be quite visible to the player.
Jawesome1988@reddit
It already does that automatically
Strazdas1@reddit
note that the alternating frame thing has been tried in SLI and keeping frame pacing consistent was something they never managed to solve until SLI got abandoned. Its not really a real option unless you bufffer a lot of frames and ignore input latency.
ibeerianhamhock@reddit
All RT/PT you've seen in the last 20+ years of movies and less recently video games has been a hybrid raster/RT render pipeline. So there's really no point.
KARMAAACS@reddit
In theory it could happen, but it won't purely because of latency. By the time any raster calculations are done, the dedicated ray tracing chip would probably hold up the rest of the pipeline.
What is more likely is NVIDIA and AMD in future will make a chiplet architecture where they can part out the GPU into different sections. That way you could have one chiplet be the RT part and the other chiplet does raster, texture mapping etc and then there's a tensor chip. This would improve yields, potentially allow for faster GPUs because now you don't have to worry about reticle limits and it will also give a better opportunity to mix and match capabilities, meaning you could keep "bad" professional and AI parts and move them to consumer.
Considering we don't have high speed and low power interconnects yet for real time rendering, it will be a while before that ever happens and we need even better interconnects to make any of that happen.
Shadow647@reddit
Maybe, but GPUs are quite good at it, so whats the point?
ghenriks@reddit
We’ll see if they can deliver, but Bolt Graphics is claiming lower power requirements
KARMAAACS@reddit
Bolt is a whatever company at this point, they can claim anything and everything, they have no product out and by the time they do NVIDIA or AMD will have something better at the same cost or slightly more expensive. I wouldn't take Bolt seriously, Intel has a higher chance of creating a successful GPU product than Bolt does.
Strazdas1@reddit
I can claim even lower power requirements than Bolt. We would be equally correct because Bolt, just like me, has nothing to show for it.
upbeatchief@reddit (OP)
There are frame breakdown apps that shows how long a frame is taking to render and raytracing is a big chunk of a frame, if you could half the frame cost of raytracing you could very well double your framerate, or add more raytracing elements( reflections, shadows, sounds, etc etc) or go full path tracing more easily.
onetwoseven94@reddit
The easiest way to do is buying a better GPU. There is no scenario where the combined price of a regular GPU and an RT accelerator gives better performance at a lower price than just getting a 5080 or 5090.
And as others have said, the GPU needs the ray trace result back immediately for shading. The latency over PCI.e is absolutely unacceptable. It can’t work for the same reasons SLI and CrossFire don’t work on modern titles.
Strazdas1@reddit
well, there is the case of what if you already have a 5090 and need even better ray tracing (like real time CGI production for example).
wrosecrans@reddit
Sure... Now, how do you do that?
Just having a chip that does raytracing doesn't mean it does raytracing faster than a GPU that does ray tracing and a lot of other stuff as well. Nvidia is basically selling the current state of the art in hardware for raytracing, so if you wanted to make something faster, you'd need to be doing something fundamentally different from the current state of the art to outdo nVidia's advantages with R&D scale and having had years to refine their engineering. And good luck with that. If there was easy low hanging fruit left with how nVidia is doing raytracing, they'd quickly adopt that method inside of their RTX GPU's.
UsernameAvaylable@reddit
But modern GPUs DO already accelerate raytracing in hardware. Rippign it out of the GPU and putting it into an external chip (or worse card) with all the need for data transfer would make it slower, not faster.
So your problem boils down to "If i had GPU twice as fast we could have twice the fps",
anders_hansson@reddit
Over twenty years ago there was an attempt by SaarCOR, but I don't think they made it to production.
I think there have been other attempts too, and it's certainly possible, but I think that it's very, very hard to break through in actual software products. E.g. the rendering pipeline would be quite different from what you get from DriectX/Vulkan/... so you would need new APIs, and adoption from game engines and/or 3D authoring software etc.
BigPurpleBlob@reddit
Agreed. SaarCOR was fast for ray tracing but used axis-aligned binary space partitioning (BSP) trees. I don't think there was a quick way at the time of generating axis-aligned BSP trees.
Zaptruder@reddit
It'd make more sense to replace raster with ray tracing only but that wouldn't work for backwards compatibility... which is why we're seeing this slow hand off over many years. A path trace only solution is more efficient than a hybrid solution, but hard to sell it to the market that wants to play existing games as well.
Die4Ever@reddit
you'd still need to do texture sampling and filtering
also mesh/geometry/pixel shaders (CUDA cores)
rddman@reddit
With RTX NVDIA went straight to the next logical (and better) thing to do: integrate raytracing hardware in the GPU.
sahui@reddit
I loved the article, thanks a lot
Jonny_H@reddit
PowerVR had a dedicated RT card (well, Caustic [0] who were purchased by PowerVR), though after the purchase they quickly started trying to integrate it into their GPU as, like others have stated here, often you want to be running something like a shader on the RT results anyway, and transferring data between the GPU and RT accelerator quickly becomes a bottleneck.
I think they mostly tried to sell it into "professional"/visualization sectors, though don't think it ever actually shipped many units. I think the plan was always to integrate it into the GPU IP, but they "may as well" sell devices they already had before that was complete.
[0] https://en.wikipedia.org/wiki/Caustic_Graphics
surf_greatriver_v4@reddit
slot-to-slot communication takes too long
AssBlastingRobot@reddit
Any modern GPU uses an AI accelerator specifically for ray tracing, so yes.
Using another entire GPU specifically for ray tracing is certainly possible, but the framework needed to achieve that doesn't exist right now.
You'd need to write a driver extension that tells the GAPI to send ray tracing requests to a separate GPU, then you would need to write an algorithm that re-combines the ray traced elements back into the final frame before presentation.
There would be significant latency costs, as the final rendered frame would constantly be waiting for the finished ray tracing request, since that specific workload is resource heavy, compared to generating a frame. (there might be ways around it, or ways to reduce the cost)
Ultimately, like with most things, it's better to have a specific ASIC for that task, on a single GPU, in order to achieve what you're asking, which is exactly what AMD and Nvidia have been doing.
Gachnarsw@reddit
Raytracing is not done in AI accelerators (matrix math units).
Hardware accelerated ray tracing is fundamentally done on ray/triangle intersection units that are their own hardware block, but that's just one step. Generation and traversal of the BVH tree that is done on dedicated hardware or in shaders. Now the denoising stage can be done on matrix math units, and done quickly, but that's just one step.
Raytracing gets complicated to understand, but it's not accurate to say it is done on AI accelerators.
AssBlastingRobot@reddit
An RTU is a type of AI accelerator.
Instead of using a tensor core, the physics of light is specifically offloaded to an RTU, to allow the tensor core to calculate when and how it's applied.
So if you want to be technical, a ray tracing core is an AI accelerator, for an AI accelerator.
jcm2606@reddit
Maybe if you're using NRC or the newer neural materials, but with traditional ray/path tracing, tensor cores are not used during RT work. Also, RTUs are not AI accelerators at all, they're ASICs intended to perform ray-box/ray-triangle intersection tests and traverse an acceleration structure. If you consider RTUs AI accelerators, then by the same logic texture units, the geometry engine, load store units, etc are all AI accelerators.
AssBlastingRobot@reddit
They technically are, the entire graphics pipeline is driven by lots of different algorithms.
Infact, it wouldn't be incorrect to call all ASICs AI accelerators, at least when GPU's are concerned.
Traditional RT work is tensor core specific, but parts of it is offloaded to another ASIC specifically for the physics calculations of light.
The RT core does the math, but the tensor core does all the rest, including the position points of rays relative to the view point.
Henrarzz@reddit
Tensor cores don’t do “position points of rays relative to the viewpoint”
AssBlastingRobot@reddit
An incorrect assumption.
https://developer.nvidia.com/optix-denoiser
You'll need to make an account for an explanation, but in short, you're wrong, and have been since atleast 2019.
Henrarzz@reddit
OptiX is not DXR. Also it’s using AI cores for denoising not for what you wrote.
AssBlastingRobot@reddit
What part of "all the rest" did you not understand?
I used "positions of rays relative to view point" as an example.
Henrarzz@reddit
Which AI cores don’t do. They also don’t handle solving materials in any hit shaders, closest hit shaders or miss shaders, which are the biggest RT work besides solving ray-triangle intersections.
AssBlastingRobot@reddit
I mean, I just gave you proof directly from Nvidia themselves, that says they do.
It's not like it's a secret that tensor cores have been accelerating GAPI workloads for some time now.
What more proof would you possibly need? Jesus Christ.
Just read what the OptiX engine does and you'll for yourself.
Henrarzz@reddit
Except you didn’t. You’ve shown that OptiX denoiser uses tensor cores, which nobody here argued.
DXR SDK is available, Nsight is free, I encourage you to analyze a DXR/Vulkan RT samples to see what units are used for RT.
AssBlastingRobot@reddit
https://developer.nvidia.com/blog/flexible-and-powerful-ray-tracing-with-optix-8
Holy shit, why am I spoon feeding you, isn't this embarrassing for you??
Henrarzz@reddit
Are you actually reading the contents of the links you post? Lmao
AssBlastingRobot@reddit
Yes.
Multi-level instancing: Helps you scale your project, especially when working with large scenes.
NVIDIA OptiX denoiser: Provides support for many denoising modes including HDR, temporal, AOV, and upscaling.
NVIDIA OptiX primitives: Offers many supported primitive types, such as triangles, curves, and spheres. Also, opacity micromaps (OMMs) and displacement micromaps (DMMs) have recently been added for greater flexibility and complexity in your scene.
Henrarzz@reddit
So please tell me, from these points, what parts of RT work in OptiX are handled via tensor cores and not SMs (aside from denoise/neural materials, which nobody argued against). I’m waiting.
Also please do tell us how OptiX relates to real time ray tracing with DXR.
AssBlastingRobot@reddit
Here's an entire thesis on that subject.
https://cacm.acm.org/research/gpu-ray-tracing/
You should be extremely embarrassed.
Henrarzz@reddit
So where does this thesis mentions tensor cores as units that handle execution of various ray tracing shaders?
You’ve pasted the link, so you’ve obviously read it, right? There must be a suggestion there that some new type of unit that does sparse matrix operations is suitable for actual ray tracing work. Right?
AssBlastingRobot@reddit
https://developer.nvidia.com/blog/essential-ray-tracing-sdks-for-game-and-professional-development/
The three different models RTX 2000 and onward use for RT acceleration. Which details how they work, gives examples of how they work, and even gives you a fucking github repo to try it yourself.
You very obviously don't understand what you're talking, literally all AI accelerators don't just use one operation algorithm, infact tensor is good for basically ALL operational formats.
https://developer.nvidia.com/blog/programming-tensor-cores-cuda-9/
I mean, how much proof do you actually need?
This is just rediculous at this point.
Henrarzz@reddit
First link doesn’t mention anything about tensor cores. The second:
I’ll ask again: which part of ray tracing beyond denoising and neural materials is executed on tensor cores?
AssBlastingRobot@reddit
I think at this point I've given you access to pretty much all you need to find the information you're looking for, it's a shame that you literally don't have the base level of intelligence needed to understand what's in front of you, but that really isn't my problem.
Sucks to be you. 🤷♀️
jcm2606@reddit
No, it doesn't. The rest of the SM (the ordinary shader hardware) does all of the lighting calculations. This is literally why NVIDIA introduced shader execution reordering, as the SM wasn't built for the level of instruction and data divergence that RT workloads brought to the table, even with the few opportunities that the RT API provided to let the SM reorder threads.
AssBlastingRobot@reddit
You have a lot of learning to do.
https://developer.nvidia.com/rtx/ray-tracing?sortBy=developer_learning_library%2Fsort%2Ftitle%3Aasc
jcm2606@reddit
Right back at ya since you're just stringing together terms with no understanding of what they mean. Maybe give these a read to learn how GPUs actually work.
https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf
https://images.nvidia.com/aem-dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf
https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf
https://images.nvidia.com/aem-dam/Solutions/Data-Center/l4/nvidia-ada-gpu-architecture-whitepaper-v2.1.pdf
https://images.nvidia.com/aem-dam/Solutions/geforce/blackwell/nvidia-rtx-blackwell-gpu-architecture.pdf
https://simonschreibt.de/gat/renderhell/
https://developer.nvidia.com/content/life-triangle-nvidias-logical-pipeline
https://microsoft.github.io/DirectX-Specs/d3d/Raytracing.html
https://docs.vulkan.org/guide/latest/extensions/ray_tracing.html
AssBlastingRobot@reddit
Lmao, you posted the architectural papers for specific graphics cards, which don't even explain your point at all, or how the GAPI interacts with the RTU.
You also posted software RT algorithms, which do not explain your point, or how the GAPI interacts with the RTU.
You needed to post how the hardware RT accelerators interact with the software RT algorithms, but you let hubris get the better of you.
A real shame, because what I posted would've explained it for you, but I guess you won't even check out those videos and blog posts even though they're directly made by Nvidia, who explain how the GAPI interacts with the RTU, oh well. Idc.
jcm2606@reddit
Because you obviously don't know what's actually inside of a GPU. You're just stringing together terms that you read in articles that you half understood, making it sound to others like you know what you're talking about, when anybody with even a little bit of experience in the graphics development space can tell you have no idea what you're talking about.
RTUs don't do math. Tensor cores don't do "position points of rays relative to the view point". That's not even close to what these units do. Had you read the actual DXR spec (which is the API that hardware RT implementations actually use) or a breakdown of what tensor cores actually do (which, by the way, are fused multiply-add operations on matrices that may be sparse), you'd know that. But you didn't. You'd rather string together terms to make yourself sound smart.
Read what I linked. Start with Render Hell and A Life of a Triangle so that you actually know what the GPU does when you issue a draw call, then look up how compute pipelines work since raytracing pipelines are a superset of compute pipelines, then read the DXR spec since it details how raytracing pipelines work.
AssBlastingRobot@reddit
Yikes, you're getting a lot wrong here and frankly, I can't be bothered to correct you anymore, since I literally cannot simplify this any further.
I think I'll stick with the explanation I was given, directly from Nvidia, made for developers, specifically to make RT hardware visible to the GAPI, but thanks.
Gachnarsw@reddit
Well then I misunderstood what you were saying. My mistake.
f3n2x@reddit
No. A modern RT pipeline isn't just ray intersections but intersections weaved into shader operations, texture filtering etc. Seperation generally doesn't make sense.
YairJ@reddit
I imagine that if it was practical, we'd see GPUs where this function has its own chip on the same card or package, so the main one could be smaller.
Aggrokid@reddit
https://chipsandcheese.com/p/raytracing-on-amds-rdna-2-3-and-nvidias-turing-and-pascal
For current rendering paradigm, short answer is no due to latency sensitivity