Tailslayer: a hedged reads solution for DRAM refresh latency

[-]

vancha113@reddit

I wonder why they don't use the tech with more transistors but without the need for a refresh? It's not even sold as more expensive high performance ram :o is it not feasible?

[-]

happyscrappy@reddit

It's hard for me to see how it is higher performance. Doing everything 3 times to evade the occasional refresh cycle feels like a losing proposition overall to me.

Also she passes the "final work" (i.e. completion routine) to a function. This might result in a call to a code pointer (indirect call) and that's going to be a loser every time compared to just doing the read in place.

She marks the completion routine as inlined, despite passing it by reference. Wonder if that works.

Anyway, here is the implementation:

https://github.com/LaurieWired/tailslayer/blob/main/include/tailslayer/hedged_reader.hpp

Any time someone tries to tell you that using a couple hundred lines of code instead of a single load instruction is a speed up it's best to be very skeptical. ...before you even see the 10ms usleep in there (yes, I know that is in the setup path, not read path, but still).

I'm not saying this example doesn't work, that it doesn't produce the graphs she links. But I would suggest I cannot comprehend a way to use this code in such a way as to improve overall performance for common operations.

I do admit there is some cleverness to being able to figure out how to create code in C++ that will complete when any one of multiple reads completes instead of waiting for them all. Making it work cross multiple processor architectures might take some more work.

[-]

semi-@reddit

It's hard for me to see how it is higher performance. Doing everything 3 times to evade the occasional refresh cycle feels like a losing proposition overall to me.

It's optimizing for lowest latency, not highest throughput or most concurrent streams.

It really is the same concept as the Google work she cites earlier in the video; it's just as wasteful to make redundant queries against multiple servers and throw away the slower results, and you'd handle more total RPS not doing that. But doing it gets you faster results more reliably.

[-]

happyscrappy@reddit

It's optimizing for lowest latency, not highest throughput or most concurrent streams.

Let me clarify:

Doing everything 3 times to evade the occasional refresh cycle feels like a losing proposition overall to me. Even on observed latency.

it's just as wasteful to make redundant queries against multiple servers and throw away the slower results

The details differ enough that the math is completely different. Latency is high enough on internet queries that you make different tradeoffs. On internet queries you can even have significant (in time) retries due to packet loss. Not so with RAM.

You never do a query to your RAM and find out that an Iranian missile took out your DRAM (and not your CPU).

But doing it gets you faster results more reliably.

Right. Now you're talking about something else. You're talking about maximum experienced latency. Not overall performance. Your total latency for all your operations is all your latencies added up. I'm saying you're going to lose enough on the ones that you don't "win" on that you'll fall behind overall.

Sure, your slowest operation is quicker. But you never catch up.

Again, I'm not saying the code doesn't work. I'm not saying it doesn't produce the results it says it does. I'm saying I cannot comprehend a way to use this code in such a way as to improve overall performance. You're just not going to "lose" enough to make the wins dominate.

If you've got that one load that has to be quick than just put some SRAM in your machine and put that data there. Similar to how routers use CAM for the "really hot spots".

[-]

admalledd@reddit

Doing everything 3 times to evade the occasional refresh cycle feels like a losing proposition overall to me.

If you are making this argument, it is likely you are unfamiliar with the use cases where it does make sense and thats OK.

Your total latency for all your operations is all your latencies added up. I'm saying you're going to lose enough on the ones that you don't "win" on that you'll fall behind overall.

Sure, your slowest operation is quicker. But you never catch up.

This is you not understanding the value proposition of this technique nor its niche. The latency at measure is often only a very specific call-and-response action path. This path being faster by nano or microseconds can mean millions of dollars or more. There is "the rest" of the big system whose response times, while still DEADLINE/REALTIME, can often be far more OK with their latency/processing being over the span of 100ms to a few seconds. But it is all in support of that critical hot-path being as fast a response as possible.

If you've got that one load that has to be quick than just put some SRAM in your machine and put that data there. Similar to how routers use CAM for the "really hot spots".

You can't build or buy systems with enough SRAM to compete with DRAM backed systems in these use cases. The machines I have seen have been often 2TB RAM machines because that is as much as was viable to exist in one system at each time. Granted, those programed LUTs into a rotating fabric of FPGAs reading events over a non-IP based network and packaging a buy/sell/whatever order back over to the exchange.

[-]

admalledd@reddit

The actual hot-loop is required to still be only a few dozen instructions maximum with all the other info desired already in caches, and the sample code can achieve that presuming what you pass in takes the inline-hint correctly. It is reasonably trivial if in actual production code to disassemble/validate the required hotloop properties. Lastly, there is high presumption that if you were wanting to do this in some HFT or other house, you'd be executing this hotloop either directly in NIC FPGA or kernel space/XDP/etc where fewer guard rails exist to get in the way.

The idea is that only a critical section of code would need this, in a very specific use case. All others can carry on ignoring DRAM refresh cycle latency stalls like they have been.

Hardware wise, as posited "why don't we do this for all RAM?" is as others mention, better is done for L1/L2/L3 caches, but those all have space and power demands far in excess to the current DRAM. Sadly for all that I wish for a 10x (or more!) in memory performance, the economics just aren't there, it is easier/cheaper for hardware to paper over and developers to develop tools (such as XDP) for maximally threaded/parallel logic where single-threaded would have been "trivial".

[-]

happyscrappy@reddit

The actual hot-loop is required to still be only a few dozen instructions with all the other info desired already in caches

Only a few dozen. Compared to one inlined instruction. I don't see why that was even worth attempting a rebuttal over.

And as to the already in caches stuff, that's not how caches work. You are thinking of IRAM or closely-coupled RAM I guess. Which is SRAM.

Which is also the answer to your below "this is for just that one specific use case" point. If you really have just that one load that has to be quick in your system and you are making the system then put in a little bit of SRAM and put that little bit of data in that SRAM. No need for any of these shenanigans.

is reasonably trivial if in actual production code to disassemble/validate the required hotloop properties

For one architecture.

Lastly, there is high presumption that if you were wanting to do this in some HFT or other house, you'd be executing this hotloop either directly in NIC FPGA

Then it has nothing to do with this then. If you are coding your own hardware then put this process in the hardware, not the code.

kernel space/XDP/etc where fewer guard rails exist to get in the way

I don't know what XDP or etc is but this is immaterial. This isn't an OS-level issue. It's hardware. Running that much more code just takes longer. In the kernel. Outside of the kernel.

Hardware wise, as posited "why don't we do this for all RAM?" is as others mention

Because it doesn't actually make sense to do so. Take a look at what I linked elsewhere which indicates how hiding this latency is possible for eve what is now pretty close to a toy usage of it. With as much RAM as you have in a system you really can't hide refreshes because you cannot have everything you are thinking of accessing sitting around in a cache elsewhere that you pre-read. The cache would have to be on the RAM itself, because to do the trick MoSys does you need to have a massive parallel read. And that means it's going to have to be on the same die. Problem is ... putting SRAM (the cache) and DRAM on the same die isn't something we do. I'm not saying it's impossible, but it seems like it is. Or close enough no one tries it.

MoSys was able to do this because they use 1T-SRAM. And 1T-SRAM is on a normal process node. It can be put on the same die as regular SRAM. Problem is 1T-SRAM just isn't dense enough compared to DRAM so we can't just switch to 1T-SRAM.

Sadly for all that I wish for a 10x (or more!) in memory performance

It's impossible to see how this could produce a 10x increase in performance given the infrequency of refresh accesses. What is your metric? Maximum latency instead of average latency or throughput?

[-]

admalledd@reddit

Only a few dozen. Compared to one inlined instruction. I don't see why that was even worth attempting a rebuttal over.

For any case where you are counting nanoseconds you can do a lot in a few dozen instructions. This is the whole world of HFT/XDP and other related stuff, where they still need for second-over-second "big" compute, but dynamically respond "faster is better/more money".

For one architecture.

There are really maybe three, x64, ARM and RISC-V and all have trivial tools to include in linking pipelines to assert assembly logic. This isn't new tooling, I was using it in middle school in the 2000s cause I thought it was hip to know the assembly my C code gen'd.

Then it has nothing to do with this then. If you are coding your own hardware then put this process in the hardware, not the code.

The concepts of how to interface with DRAM in latency-sensitive manners is the same, especially that in such system it would likely be a memory-bus attached (via PCIe, with direct in-path NIC, I've used them) FPGA that would still host an otherwise "normal" rest-of-the-server being AMD x64 EPYC or whatever. This code on host CPU is more if for reasons you couldn't do the memory lookup plus whatever else critical logic in the FPGA (its not easy actually) then you'd use FPGA or other inline-network tooling (in this hypothetical HFT case) to do the de-duplication.

[last two points]

you are... agreeing with me right? Or are you just not reading what I've written clearly in the rest of those sentences? DRAM is the most economical (currently) method of mass random-access low-ish latency working-memory storage. That doesn't mean the memory wall doesn't suck ass to have to program around. I've written about this recently elsewhere, the 10-20ns fetch latency hasn't really improved since the late 90s, and memory bandwidth is multiple orders of magnitude behind the relative improvements to CPU. If there was a magical replacement for DRAM that had any of its properties 10x better (preferably bandwidth IMO) there wouldn't be the billions spent on Processor-In-Memory architectures (none have made it to market, though AI hype re-kindled high investment, so who knows).

[-]

mennydrives@reddit (OP)

So, the GameCube's RAM actually did something even better. It was called "1T-SRAM', and it had an SRAM bank (6 transistors per bit instead of 1, but zero refresh penalty) that it used to regularly swap out whichever memory bank was being refreshed at the moment, so it behaved like "SRAM" in that it never had a refresh penalty.

[-]

happyscrappy@reddit

You got your 6 and 1 backwards.

The trick you mention covers like 90% of the cases. Another trick covers another 9%. This still leaves one last case and MoSys solves this by confining the size of the memory in certain ways.

Regardless of all this the real advantage of 1T-SRAM is not hiding refresh but instead being able to put the memory on the same die as the CPU. DRAM uses specialized process design rules that preclude putting general logic on the same die. 1T-SRAM gets around this. And so IBM/Nintendo were able to avoid having separate DRAM in the system and save money. I'm pretty sure this would not be feasible in a PC-style system where there is so much memory onboard and a much larger CPU. GameCube only had 27M of RAM.

[-]

mennydrives@reddit (OP)

Yeah, I just realized I missed a lot of details from what I half remembered about 1T-SRAM from a couple decades ago, versus the info available on it today. But I don't think it was on the same die as the CPU, at least not in anywhere near the same way it is for Intel's Lunar Lake or Apple's M SoCs today.

[-]

happyscrappy@reddit

Apple's RAM is not on the same die as the CPU. They use package-on-package technology. There are at least 2 packages on top of each other in that one spot on the board. One is DRAM and one is the CPU. The two connect to each other with balls just as if connecting to the motherboard. It's similar to how HBM works, but not the same.

https://en.wikipedia.org/wiki/High_Bandwidth_Memory#Interface

[-]

mennydrives@reddit (OP)

On a side note, apparently Samsung is re-branding their wide-IO RAM interface as "Mobile HMB" and Apple might be using it, which would be pretty cool to see in future laptops.

[-]

ShinyHappyREM@reddit

You can hide the refresh with PSRAM.

Not sure if it's still used today, but it was a thing at least in the '90s and 2000s.

[-]

zsaleeba@reddit

PSRAM is still used in ESP32 MCUs

[-]

happyscrappy@reddit

I admit I am not 100% sure. But the value of PSRAM is more closer to making DRAM have a bus interface similar to SRAM than it is to hiding refreshes. As long as you have refresh (and PSRAM does) it's impossible for me to understand how it's impossible for a read to come in to an address that is busy and cannot be read. And with sufficient pseudorandomness that the hardware cannot predict that data will be needed "soon" and so load it into an SRAM cache before the refresh cycle starts.

The only sure-fire way I see around this is to define the cycle time to be so long that it includes the time needed to finish a refresh and still do a read within the cycle time. And that would make such a design pointless for the purposes of performance.

True SRAM does avoid this completely of course.

[-]

elperroborrachotoo@reddit

We do - in L1 and L2 cache.

It's much more expensive per bit, requires more power (running hotter, requiring more cooling), and at some point, the lesser density means longer wires increases latency.

[-]