Why is writing software with SSDs in mind so undocumented

[-]

veritron@reddit

as a developer you will interact with the file system by calling apis and system calls that use it, and so the nature of the device that you are writing to is abstracted to the point where you don't know or care about the nature of the disk that you're using. perhaps the performance characteristics of a system will differ depending on whether an ssd is installed or not, but generally all the dev knows about ssd vs. non-ssd is that ssds perform better than hdd, and they don't even need to know that to write a program that uses the file system. there are some games that will check if an ssd is installed and maybe show a warning that the game won't perform well, but not much changes at the api level ssd vs. non-ssd.

[-]

z_latent@reddit (OP)

If what I understood about them is correct, it could change.

Like, HDDs can only do \~100 independent read operations per second, due to the head's seek time. But because an SSD is parallel, it can do 1000x that or more, but only if your program has concurrency.

It seemed as though doing blocking read operations, for instance, is much worse for an SSD than an HDD since you lose on the parallelism you'd otherwise have. But I don't know for sure, and that's why I wanted more material on this.

[-]

ShareACokeWithBoonen@reddit

I mean your basic premise is somewhat flawed. Any developer in any circumstance that actually cares about performance in any kind of program is going to figure out how to perform that operation in cache or RAM, full stop.

If 'we can improve code and thus better utilize non-volatile memory, and therefore unlock new business cases / sell more to consumers' was ever a thing in any major codebase, then 3D Xpoint wouldn't have died out.

Take for example the single most performance-heavy edge case / niche for NAND: video editing. Even with all the software and hardware advancements surrounding NAND, in 2026 the singular focus is still 'how can we use NAND less in this program so it doesn't suck as much' (i.e. proxies, compression-efficient formats, etc etc etc).

[-]

iluvchromosomes@reddit

Any developer in any circumstance that actually cares about performance in any kind of program is going to figure out how to perform that operation in cache or RAM, full stop.

NVMe SSDs are effectively RAM when using the appropriate storage controller. This is a perfect microcosm that explains the lies and incompetence of developers. They do not want to put in the effort. They don't care. Do as little work as possible.

[-]

wankthisway@reddit

Brother. Just stop talking. Your comment history is just full of confidently incorrect statements with a huge amount of hubris and pent up irritation. Take a freaking break man.

[-]

cakemates@reddit

... Mate SSDs latency is so ridiculously high that the cpu has time to go on a vacation to hawaii, drink some Pina coladas and come back before the ssd responds with the data. There is no way an ssd can provide data to the cpu even half fast as ram.

[-]

Zironic@reddit

I mean your basic premise is somewhat flawed. Any developer in any circumstance that actually cares about performance in any kind of program is going to figure out how to perform that operation in cache or RAM, full stop.

That's not even slightly true. There's many workloads which can not fit into RAM, especially not on consumer hardware which forces developers to find ways to efficiently stream data from disk into RAM.

In the case of texture streaming, there's even dedicated API's to allow you to bypass the CPU entirely to read the data from disk faster.

[-]

pac_cresco@reddit

Yeah, just to give an example, ]the install size for Helldivers was shrunk](https://www.neowin.net/news/helldivers-2-game-size-goes-from-154gb-to-mere-23gb-by-removing-an-hdd-tech/) because it used to have duplicated data so loading times were faster when reading off an HDD.

[-]

Narishma@reddit

That's what they thought when they initially designed it that way, but they didn't bother benchmarking it and it turned out that the duplicated data wasn't even faster on HDDs. That's why they removed it later to reduce install size.

[-]

pac_cresco@reddit

Welp, better example then would be when the devs of Myst manually allocated the chapters of the game on the tracks of the CD/ROM so that the seek times when loading one after the other would be minimal.

[-]

ShareACokeWithBoonen@reddit

Unless I’m completely misreading OP’s between the lines, there’s this idea from them that NAND is somehow underutilized because people don’t realize how fast they are - what I’m saying is that spoiler alert, they’re not.

Your own example of the video game world shows this perfectly - if you’re literally forced to from the sheer size of X stream from disk, then yeah, there’s no alternative, but games and engines today are much more often saying “how much ram is available? Ok shove as much of the level in there as possible” rather than the super complex streaming setups of the past (e.g. devs literally writing custom filesystems for their game).

[-]

Zironic@reddit

Your own example of the video game world shows this perfectly - if you’re literally forced to from the sheer size of X stream from disk, then yeah, there’s no alternative, but games and engines today are much more often saying “how much ram is available? Ok shove as much of the level in there as possible” rather than the super complex streaming setups of the past (e.g. devs literally writing custom filesystems for their game).

Well, the thing about SSDs is that they make that so much less neccesary then it used to be. When you had to worry about physicals heads moving around a spinning disk, optimizing reads involved manually packing your data to be next to eachother so the head didn't have to move between reading related data.

With the way SSDs work, there's very few ways to create a custom filesystem that can squeeze out any kind of efficiencies. The only thing you want to avoid is very small files since SSD's can only do block size writes.

[-]

MWink64@reddit

The only thing you want to avoid is very small files since SSD's can only do block size writes.

No, SSDs write in pages (often 16KB each) but can only erase in blocks (which can contain thousands of pages).

[-]

z_latent@reddit (OP)

Hey, let's not talk about 3D Xpoint, you'll make me cry /hj

I get your point, NAND is slower so whatever it is, you'd rather do it on RAM. Still, you can't fully avoid the SSD (that would be a pretty bad video editor), so you'd want to make that part as efficient as possible as well.

[-]

Fun_Fault4035@reddit

Bioinformatics: i'm i joke to you?

Context: datasets are so huge they can’t fit in RAM, so storage is the working memory.

[-]

doscomputer@reddit

the person with 300 upvotes literally doesn't know what they're talking about even in the slightest

easy experiment anyone can do, go copy a minecraft world from one disk to another, now, just stick that world in a zip file and copy it (its much faster than just from compression alone)

Operating systems do 'just handle' the IO and stuff, but when it comes to actually maximizing performance of a data transfer there is a lot of smaller details. I think why they're being upvoted and you're being downvoted is programmers and coding in general has a lot of gate-keepers inside. Its not easy to get into as a hobby, but its easier to learn as a job.

[-]

notam00se@reddit

https://blog.cloudflare.com/speeding-up-linux-disk-encryption/

Should be a good read. Biggest takeaway relevant to this discussion is that when dm-crypt was created, SSD's didn't exist, so nobody cared about anything more than ~100MB/s at the time. Cloudflare has some amazing linux developers, and this blog the results of them taking a pass to get dm-crypt and everything it relies on modernized for SSD.

None of this is relevant for applications themselves though, as pointed out they just interact with the storage API

[-]

rouen_sk@reddit

I like what Postgres did: let you set "cost" of sequential reads and random reads (in abstract numbers, so only the ratio matters). So they abstract the storage away from you, but let you give hints about it's characteristics, that query planner will use if there are options.

[-]

BFBooger@reddit

and yet, the best 'cost' values in postgres for SSDs ends up similar to HDDs because postgres operates in 8k blocks, and reading 1000 8k blocks at random is a lot slower than reading 1000 8k blocks sequentially from an SSD anyway. If the I/O size is 64k, or 256k, then the difference with HDD is massive, as the random access on SSD reaches almost sequential speeds. But for small I/O like in a random index seek? random_io_cost should still be quite a bit higher than sequential cost.

[-]

meodd8@reddit

Well, if you just make sure the whole DB can just exist in RAM, what’s the difference? /s

[-]

CMBDSP@reddit

In-memory databases are very much a thing (especially for analytical workloads). The largest such systems i have come across have a total of ~100 TB of RAM with a few thousand cores spread across a couple of ginormous NUMA machines.

[-]

Fosteredlol@reddit

Unfortunately, RAM is also extremely slow. I need everything to fit into L3 at a minimum

[-]

kivimango23@reddit

L3 ? lol Bro, L3 is 2016. You are better off putting the whole DB into the L2 cache.

[-]

callanrocks@reddit

We just need vcache on both sides of the die, it's so crazy it might work.

[-]

Alphasite@reddit

Can’t Postgres be compiled with huge pages? That should resolve this issue

[-]

farnoy@reddit

Ironically for this thread, Postgres is stuck in the synchronous single-tuple pull-based iterator model and unable to feed higher queue depth from a single worker. You need the query planner to choose parallel evaluation of a specific node in the plan to get real I/O parallelism, or one of the few nodes with internal I/O batching, like Bitmap Scan. But those have trade offs that a different iterator model could avoid.

[-]

htj@reddit

Postgres 18 added support for async IO and uses it in readahead, though only in certain parts. 19 uses async in a lot more places, and has better tuning of it.

[-]

farnoy@reddit

I don't think it makes a major difference for the bread and butter of OLTP queries like a Nested Loop of two Index Scans. It can't evaluate the inner index scans concurrently because of the iterator model it uses. For each tuple tuple from the outer scan, it synchronously looks up the inner index before moving to the next iteration. No I/O concurrency to be seen in this workload.

[-]

RogueHeroAkatsuki@reddit

Yeah exactly this. Just use system API as its well maintained and polished sometimes for decades. Even if you understand deeply difference it doesnt change much because you will still get to conclusion that system API is very optimized and implementing custom solution with risk that it will bugged doesnt make much sense.

[-]

randylush@reddit

Looking at the actual post here. OP is pointing out that the system API hasn’t been updated to optimize for SSDs, even though SSDs have been around for decades. So assuming it’s polished is actually incorrect.

[-]

fireflash38@reddit

It matters a lot if you care about immediate persistence. There's a lot that is elided away, even in system calls, about whether something is actually written or not. Drives can also lie to the kernel about if something was persisted or not...

[-]

ProfessionalPrincipa@reddit

Just use system API as its well maintained and polished sometimes for decades.

Relevant because Microsoft didn't have a native driver for NVME devices until very recently.

[-]

iluvchromosomes@reddit

Sony had it for playstation a decade ago :p

[-]

SethDusek5@reddit

as a developer you will interact with the file system by calling apis and system calls that use it, and so the nature of the device that you are writing to is abstracted to the point where you don't know or care about the type of the disk that you're using.

Getting to whatever MB/s Read/Write figures are written on your SSD box is not trivial in a lot of cases unless you specifically try to parallelize as much of your file I/O as possible, or you're reading one gigantic file in which case your OS will probably prefetch blocks which will lead to high queue depths. If for example you're reading a bunch of smaller files like doing some sort of batch ingest/parsing then you need to find a way to parallelize so your CPU time isn't dominated by syscall + IO latency. That's a large reason why newer APIs like io_uring exist, but writing code that takes full advantage of them is not trivial

[-]

iluvchromosomes@reddit

where you don't know or care about the type of the disk

No. You do not care. You know that most systems are using SSDs now, and that is who should be prioritized. This is why we have issues.

This is also why console gaming is surpassing PC gaming in performance. Sony is vertically integrated and they force their devs to care. Everyone else making games for PCs doesn't care.

[-]

0xe1e10d68@reddit

The bigger issue might be that some (mainly older, and less consumer-focused) filesystems aren't necessarily designed with SSDs primarily in mind.

ZFS (not a consumer fs ofc) and btrfs are old enough to have been exclusively or primarily designed with HDDs in mind and are therefore known to be able to cause significant write-amplification on SSDs. And it doesn't make them any less wasteful that you can use enterprise-grade SSDs for longer lifespan. You can mitigate it to a degree but you can't fix the core architectural decisions that were made without considering flash storage.

Flash-native filesystems like F2FS and APFS (exFAT for other flash media) do somewhat have an advantage here. I don't know how much they can improve in regards to performance thanks to their flash-first design, but I can imagine there have been improvements made there as well.

[-]

SwiftAndDecisive@reddit

A ton of abstraction; not many people deal with low-level hardware nowadays, just like few, aside from university, need to know about memory address abstraction nowadays.

[-]

SwiftAndDecisive@reddit

Can send you compiled notes from my freshman level powerpoint my school provides if you need.

[-]

pdp10@reddit

Too long, won't read. Here's the tea:

Everything on the system side of the syscalls/libc/APIs is the kernel and filesystem's bailiwick. If you're not writing kernel or filesystem, you can ignore all this!
Use vectored I/O calls, also called "scatter-gather" in Wintel-land.
mmap(2) syscall, or equivalent on your system of choice.

[-]

anor_wondo@reddit

I assume its just a very small subset of devs. database, message brokers, cache with non volatile fallback(redis) and game engines

how many other systems operate with large loose files that they have to optimise for the physical storage?

[-]

realcoray@reddit

I agree with this, and even some of these cases have been improved and it wasn't necessarily because of the switch from HDD to SSD, but more like, games used to ship on CDs and DVDs which were way worse, so like those people had to do all sorts of crazy things to optimize their games for that limitation and now have to do way less of that.

[-]

Herve-M@reddit

Archiving, It forensics, any software with large file requiring mapped file loading?

Also low level like file system :)

[-]

admalledd@reddit

Generally it is this that is going on. The majority of stuff that developers develop doesn't do high amounts of disk IO directly, that is the responsibility of the database, brokers, cache services, etc. Majority of the time, developers are operating in a way that is largely in-memory or network-IO bound. For those (bioinfomatics, "large but not yet big data", data transform/import tooling, etc) that do maybe deal with lots of on-disk data directly you will start to find developers cluing in more on using the more advanced IO APIs for performance (io_uring, etc etc).

[-]

htj@reddit

As other have mentioned you typically uses abstractions on top of it.
However, not SSDs are the same. They use different internal algorithms for wear-leveling that makes their behavior different in subtle, but important ways. Not much is published on their internal behavior. There is a paper called ssd-iq, that discusses this and studies how different drives act differently in various scenarios.

[-]

steik@reddit

I read over the thread and I believe your problem is that you don't have a use case. This post might as well be "Why is writing software with RAM in mind so undocumented?" - because for 99.9% of software it doesn't really matter. Everything is abstracted away and plenty good enough for almost all scenarios.

What you need to dive deeper is a use case that actually make sense, where the simple way of interfacing with the hardware through abstraction layers isn't good enough. Unfortunately I don't have an example to provide because this is an extremely specialized area that realistically just isn't necessary except for very niche software.

It might benefit you to think about this in terms of the Cell processor on the PS3. It couldn't be used for just anything and it doesn't make sense to try to understand it without a use case in mind. We didn't use it at all on the 1st game I shipped, it was confusing to everything and the documentation was super sparse. But later on one of our programmers came up with an idea to offload a particular workload on to the Cell. At that point the wheels started turning for all of us - no amount of documentation or "examples" helped, we needed to find a use case that made sense for us to put everything into context.

Assuming you aren't already working on software that handles massive amounts of data, I'd recommend try to search github for projects that do. See how they do things, benchmark the program, record the SSD utilization - how far from the hardware specs are you? Does the program have other CPU bottlenecks that are blocking disk read/write or are they simply doing disk IO inefficiently? Now that you have context and a problem to solve you'll have a much easier time moving forward.

[-]

z_latent@reddit (OP)

Well, I was avoiding talking about my "main" use-case as some people could have prejudice against it (I'll say in advance: no I didn't use AI for my writing at all and I'm sorry for the SSD price hikes), but I'm realizing now that this use case is so specific that most people fail to understand why I even care unless I explain it lol, so here:

It's related to local/self-hosted LLMs. Large language models have an enormous amount of parameters (individual numbers that make up their matrices and vectors), sometimes much bigger than what you can fit in consumer VRAM or RAM. Even smaller models can use 8+ GB.

My personal interest was in streaming some of the parameters straight from the SSD, a specific subset of the parameters which is large in total (like 4 GB) but only a small amount is needed per token (like <1 MB).

Currently, most inference engines seem to keep all parameters in memory. I wanted to see if you could keep just those parameters in disk and save on memory for more important things, or if it would be far too slow. Based on the theory, it sounds feasible, but I don't know how to test the theory in practice, because

Your suggestion is probably the best I can do I guess. Since it's a new idea, I can't look up what other people are doing, but I can try and think of programs that do "equivalent" things... I'll give it some thought, thanks for the idea.

[-]

martijnonreddit@reddit

You don’t need to optimize that scenario for SSDs. Just memory map your file, maybe ensure that you use 4kB pages, and let the OS handle the rest.

[-]

is-this-a-nick@reddit

They keep it in ram because even the fastest SSD is an order of magnitude slower than even cheap consumer dual channel DDR5 memory (and GPU is another order of magnitude up), so by streaming from SSD you are just wasting all your expensive compute ressources sitting around waiting for the data.

[-]

doscomputer@reddit

dude you legit aren't reading what they're saying

no SSD is too slow for 1mb of data

[-]

z_latent@reddit (OP)

Thanks for reading!

Admittedly my replies are a bit too long... but it's not an easy concept to explain.

[-]

z_latent@reddit (OP)

That is not an issue for these specific parameters. They are so sparse that 99.999% are not used per token, even though their total footprint is still large.

The speed of generation is limited by how much memory you move per token, and here, that'd be very small, \~100 000x less than the rest of the parameters. Even if SSDs were 1000x slower, this would still take 1/100th of the time that the rest of the parameters (the ones in memory) would take to be moved into registers.

The main limiting factor is more likely the latency of SSDs, but even then, it would only become a problem at 1000+ tok/s of decode speeds. For local inference, you rarely reach that.

Now sure, if you have infinite (V)RAM, there's no reason to use the SSD. But we don't, and since the difference in performance would be negligible, that memory is better spent on other things.

[-]

goldcakes@reddit

Are you exploring things like spiking neural networks? Those seem pretty promising actually (even though obviously very early), given the high sparsity I can see NVMe working well for this use case.

[-]

doscomputer@reddit

This post might as well be "Why is writing software with RAM in mind so undocumented?"

this literally is very highly documented and memory address management isn't uncommon

[-]

gamebrigada@reddit

There are several reasons:

Most applications just... don't care. They don't do enough on storage to give a shit.
Applications that would benefit, usually still don't care. Why?
An SSD inherently makes your system feel more responsive. Accessing storage requires use of other resources. Maximizing the capability of a modern SSD will absolutely bring the rest of your system to its knees. In some scenarios this is desired. I've gone down this rabbithole where optimization of the speed of an installation was highly desired, even at the cost of the system being almost entirely unusable during the installation. Doing this in a general application will get people to come out with pitchforks. People like to multitask on their system now. Everyone wants the install to run in the background, not take over your entire system, even if its 10 times faster. You're basically taking away a benefit of how a system feels from the end user and they get upset.
SSD's aren't just faster and lower latency, they also can multitask. Spinning rust is inherently single threaded, because the read/write head can only access one stream at a time. SSD controllers for the most part will either allow multiple operations simultaneously or have latency fast enough that its all the same for your application. Multithreading in applications is increased complexity, and if it isn't deemed worthwhile.... single threaded is easy.
Then you're left with applications that absolutely do this. Disk IO heavy apps. Databases, services, games etc. They never do it perfectly, but they absolutely optimize for storage.

[-]

Nicholas-Steel@reddit

One of the issues I've seen in games like Assassin's Creed: Odyssey is the Thread handing I/O operations is set to too high a priority, on a HDD this is perfectly fine but on an SSD the games Dynamic Asset Streaming system can dominate the CPU's time and negatively affect the frame rate, with the situation worsening as the speed of the SSD increases (SATA, NVMe 3.0, NVMe 4.0, NVMe 5.0 etc).

The low latency, high speed performance of a SSD results in fewer opportunities for other Threads to gain sufficient time with the CPU.

Ways of solving it is to either de-prioritize the I/O thread or add some artificial pacing management of the I/O thread.

[-]

Hagelslag5@reddit

I have noticed that the game Starbound performs much better on a SSD over HDD. I haven't looked at the code myself, but it is out there if you want to find it.

[-]

omegafivethreefive@reddit

Development is first and foremost about shipping features that sell as fast as possible for as cheaply as possible.

Everything else is treated as secondary.

I'm not saying this is a good thing but as someone who manages 50 SDE and has designed and led well into the 9 figure range in software development spend over the last decade, that's just how it's done.

These types of optimizations are done as needed, very often very slight performance increase on the whole doesn't lead to more revenue but it does increase TCO.

Essentially nobody cares until they have to.

[-]

goldcakes@reddit

Yep. There's a few niches where it makes sense to optimise lower level, for example if you are building a highly scalable database engine like MongoDB or something. But for gaming, video editing, heck, even most 'professional' use cases like scientific computing, SSD speeds are not your bottleneck.

[-]

jhenryscott@reddit

check out the branch education video on youtube.

But yeah, people don't understand how SSDs work at all. Their is no reason for any normal consumer to own a gen 5 nvme. (I say that but I also own a WD 8100 so maybe take my own advice). The difference between 4k random IOPS and sequential speeds is mostly misunderstood by consumers,

[-]

goldcakes@reddit

Yep, and just because a NVMe is Gen5 doesn't mean it's faster; in the same way that a PCIe Gen5 GPU (e.g. the RTX 5060) can be slower than a PCIe Gen4 GPU (e.g. the RTX 4090).

Heck, there are even QLC Gen5 NVMes hitting the market now. Lol. Pure marketing BS.

Really, there's only a few cases where Gen5 makes ANY difference, and they're not average consumer use cases. For example...

Professional video editors who transfer RAW video footage (we're not talking about your phone or consumer camera), we're talking about a 300gb file for say 30 mins of footage all the time. But even so, your NVMe is unlikely to be your bottleneck -- how fast is your CFExpress reader?
People who run home-labs with large, multi-user (and I mean like 100+ concurrent user) databases.

[-]

z_latent@reddit (OP)

I say that but I also own a WD 8100

That immediately makes you a not normal consumer then :)

But Branch Education is spectacular indeed, I've watched their videos on manufacturing of CPUs, and on GPUs as well. Great for the "design of an SSD" part as well.

[-]

jhenryscott@reddit

Yeah I mean I’m an enthusiast, souped up gaming rigs and a large home lab device stack.

[-]

ethanjscott@reddit

So I program mainframes with source codes from the 80s and possibly farther or newer. It’s a weekly occurrence that I have to work on a program that hasn’t been changed in a decade or two.

What you are missing, and this will explain why you can’t find much. Is there is code that minimizes disk reads and writes and helps when your on an hdd.

An ssd is in some cases 1000s times faster then an hdd, that when a programmer is writing programs with sql, the query returns the full set of results immediately. Not really an opportunity to improve performance when we’re measuring sub second performance.

Now when I do have a query that runs a long time now days, that usually means I’m writing a query like an idiot and I need to rethink the problem.

Now this is my experience working on data intensive environment. Your personal computer won’t get these improvements ever the datasets just aren’t big enough.

[-]

TheImmortalLS@reddit

why don't game designers make games exclusively for 5090's? everyone else outside of target can enjoy slideshow, thanks

[-]

Top-Vermicelli-6495@reddit

About 15 years ago, organizations that use super computers had some people suggest that a concept called "burst buffers" would be a good idea. These days, they're pretty common for systems at that scale because they solve certain problems and accelerate certain operations that are only relevant in that space.

Having said that, the themes OP raises are all present in burst buffers. Take a look and consider your future in supercomputing, I guess OP?

[-]

apudapus@reddit

p-mem was a thing but it didn’t take off. DirectStorage from DirectX and PS5’s equivalent are great for gaming. I used to write SSD firmware and worked in the FTL layer. I now work on the system software layer and deal with fast local storage and clustered network storage. Feel free to DM me if you want to chat.

[-]

Kat-but-SFW@reddit

Maybe check out y-cruncher, it uses disk space (into the petabyte range) to calculate huge numbers and has a lot of tuning to maximize IO performance for HDD, SSD etc. As well as built in benchmarks to tune to your individual SSD and setup. There has been a lot of work recently on optimizing for SSDs now that there are SSDs with enough write endurance to not die from running it.

https://numberworld.org/y-cruncher/news/2023.html#2023_11_13

https://numberworld.org/y-cruncher/guides/swapmode.html

[-]

z_latent@reddit (OP)

Thank you! This is really interesting...

Queue Depth: SSDs need a high queue depth to achieve maximum bandwidth. This requires parallel I/O. y-cruncher has never supported parallel I/O since it's bad for hard drives due to it causing disk seeks.

CPU Bottleneck: y-cruncher's disk (raid) management requires CPU computation to perform sector alignment and raid interleaving. This CPU work is done on the same thread as the disk I/O DMA calls - thus are serialized with the disk I/O. On hard drives, the CPU work is negligible since the disk I/O is so slow. But modern NVMe SSDs are so fast that the I/O no longer dominates the run-time compared to the CPU work. Thus on SSDs, a lot of the time is wasted waiting for CPU work rather than performing disk I/O.

They are points that I and a few other people here were trying to convey, but these were majorly dismissed as if using the system APIs meant that your I/O was great. No, these APIs were made much before SSDs were popular. They did not account for concurrent I/O being faster, and the code for calling them was built for devices 100x slower, which meant it could afford to be 100x slower.

But anyways, I'm glad they improved it, and I'll look deeper into it! Thanks for sharing.

[-]

Kat-but-SFW@reddit

I'm not sure how much would be applicable for your use case, but there's also different algorithms it uses in a calculation depending on the current FFT size. Basically it can read/write less total data by maximizing seeks, or seek less by reading/writing increasing amounts of total data, to optimize the tradeoffs of different memory/storage to minimize runtime (since that can be days/weeks/months)

[-]

Mister__Mediocre@reddit

Interesting discussion. I’m no expert but I think the main reason is how most programs interact with the disk.

Usually the first step would be to bring over the content to the RAM and then operate on it. Libraries must already solve for this. So you’re restricting yourself to a class of programs that are doing reads from parts of a file on disc often enough for it to be the bottleneck. The reads must be random enough that the pages are usually not in the ram already at the steady state of the program. Who is this program?

I would suspect that for most applications, even if you’re doing random reads from disk, there’s a pattern that lends itself to decent page caching.

[-]

z_latent@reddit (OP)

Yeah, it has become clear to me, I more or less am dealing with one of the few applications where this makes sense, which is likely why for anyone else it feels so weird a thing to care about.

I answered in this other comment here, but tl;dr, for running LLMs locally, there are certain model parameters that can go in the SSD without affecting performance, to save on memory usage. While most parameters are like reading a multi-GB file per token, these special parameters are like querying a large database for just a few 10's of kB, so streaming from storage is fine.

[-]

Mister__Mediocre@reddit

That's a fairly mainstream use case these days. Doesn't llama.cpp have support for disk offloading by default? I'm sure they do it fine?
Also, what are these parameters that are so rare? Usually even in MoE models, don't they make it so that all the experts are getting equally used usually?

I do RAM offloading with half layers CPU half layers GPU. Usually the entire layer gets offloaded, so I'm curious what are these parameters for you that are so "rare" that you can keep them on the SSD?

[-]

z_latent@reddit (OP)

Oh, nice! Yeah it isn't viable for MoEs, those are too large.

It's a new architecture introduced in Gemma 4, specifically the E2B and E4B ones. Per-layer embeddings, they're the reason why the model may be "4B" in name but the total parameters are 7.5B. There's 2.8B PLEs in total, plus 0.7B normal token embeddings (but the latter you don't offload since they're the same parameters used at the end of the model to get the token probabilities).

Google added them because they improve the model's performance, for nearly "free". Since they're token embeddings, each layer only needs one vector per token, out of 262k (Gemma's vocabulary size). Hopefully you can see now why this is good!

(Also, because they depend only on token id, you can immediately fetch the embeddings for all layers, which allows for higher parallelism/read size)

[-]

Mister__Mediocre@reddit

Actually, you might find insights in database literature, since they actually care about random reads from disk.

[-]

reveil@reddit

On a M5 Macbook Air you have 153GB/s memory bandwidth while the SSD that just got twice as fast as the last model got the 6 GB/s. On what planet is 6 4x slower than 153. It is more than 25x slower and the biggest difference is not even in the throughput but the latency.

[-]

z_latent@reddit (OP)

Apple admittedly has crazy memory bandwidth, makes me envious.

For clarity, that comparison was single-channel DDR4-3200 RAM vs PCIe Gen 4.0 SSD. It's 12x slower if you're using dual channel or DDR5, and 24x slower if both (though chances are, if you have DDR5, you can also have a PCIe Gen 5.0 SSD)

[-]

sboyette2@reddit

Wear-levelling, correction for bad sectors, and other such minutia of SSD use are accounted for in kernel drivers of all modern OSes, and/or at the firmware level of modern SSDs themselves.

This is why the resources you found are so old. As a developer, now, you don't need to care unless you're developing a new storage driver.

[-]

lorimar@reddit

Not a developer myself, but I think this is what the DirectStorage portion of newer DirectX versions is all about

https://github.com/microsoft/DirectStorage
https://devblogs.microsoft.com/directx/directstorage-1-4-release-adds-support-for-zstandard/

[-]

BFBooger@reddit

have a look at what Apache Cassandra did for their newer disk format. They went from B-tree like to disk based Trie and optimized for more for total data read than raw iops. Though some of the motivation is also due to their common case being large variable length keys with a lot of common prefixes.

See https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-25%3A+Trie-indexed+SSTable+format

[-]

dbxp@reddit

With a lot of modern development you would have a virtualisation layer and SAN between you and the actual storage so the performance is unlikely to be the same as accessing an SSD directly. If you go into niches like HFT, OSs or supercomputing you might find more hardware optimisation.

[-]

BoringElection5652@reddit

Memory mapping the file and accessing it with multiple threads gives you surprisingly good SSD speed, while also being the by far easiest approach to reading files.

[-]

mkaypl@reddit

You can take a look at something like SPDK (https://spdk.io/doc/), there's a lot of bits and bobs inside of it, though at its core it's a way to code access to storage devices (mainly NVMe (over fabric or local)) from userspace..

[-]

Candid-Border6562@reddit

Simple answer: most programs (99%+) can just treat SSDs as fast HDs. Very few of us ever have to go any deeper than that.

[-]

DeliciousIncident@reddit

Because there isn't much to writing software with SSDs in mind. You just write software however you want.

Now, if you were talking about HDDs, then yes, there is such a thing as writing with HDD in mind, i.e. you might want to place the data on the disk in the same order you will be reading it, since sequential reads are faster on an HDD. This is typically done by packing the data in a single file, like how game engines pack assets into a single file.

[-]

clearlybreghldalzee@reddit

Read the f2fs filesystem paper

[-]

Maxorus73@reddit

For almost all programming purposes, your interaction with storage systems will be a black box. You're gonna call APIs that someone else made, and hope the benefits of better hardware are visible in whatever you're doing

[-]

Sopel97@reddit

because it's quite simple, you just use async APIs to saturate the queue as much as the algorithm allows

[-]

z_latent@reddit (OP)

I suppose using async gets you most of the way there.

But you can tell that, on this proposal (awesome share btw), they realized that SSDs were so fast that they demanded you reconsider what was acceptable overhead.

I can't help but imagine all the code out there, that similarly has large overhead around I/O simply because the developers have not yet realized how fast the I/O part itself is nowadays. As this author put it, I/O is no longer the bottleneck.

[-]

Sopel97@reddit

yea, though this article is pretty bad, as even the optimized go code is absolutely terrible

[-]

z_latent@reddit (OP)

Could be better, but not that bad in practice really. It's only about 2x slower than using wc on Linux.

I even tested a (supposedly faster) Rust version called ripwc and here are the results: (using drop_caches before every one)

ripwc  kjvbible_x100.txt  1.06s user 0.24s system 94% cpu 1.370 total
wc     kjvbible_x100.txt  1.50s user 0.14s system 94% cpu 1.724 total
theirs kjvbible_x100.txt  2.79s user 0.16s system 97% cpu 3.017 total

[-]

symmetry81@reddit

All this might be relevant when you're writing code for a particular computer where you know the quantity of RAM, the particular SSD, and what other programs might be present and contending for those resources.

Most of the time, though, the programs we write are for running on a server, PC, or phone that will be running other tasks and where we have to accommodate a variety of hardware. In that case the OS is managing the block storage and you don't know if your write is being cached to RAM, going to a fast SSD, or ending up on spinning rust.

I currently work with robots where I know the exact hardware configuration my code is going to be running on this year, but I don't try to specialize it to the particular memory setup we have because I also want my code to be performant on future hardware that might be introduced later.

[-]

wretcheddawn@reddit

I think optimizing for ssds is exceedingly rare because the SSDs are so fast and have such high throughput that they are almost certainly not the bottleneck in the overwhelming majority of cases.

Early SSD optimization was basically just removing the Spinning Disk optimizations.

Optimizing for ssds means you have to also optimize your program so that it can actually process data that fast, at multiple gigabytes per second, which is probably a more difficult optimization problem. Most file system apis are not even close to maxing out an ssds performance.

[-]

sccocrwn@reddit

I don't know you but I do think about cache sizes from CPU to disk including page misses and read time, actually I've been insisting on changing HDDs for ssds at work cause they want performance you would never get from HDDs (and I did some napkin math to prove it) they did let me do a couple of tests in real and they got convinced. So .... Some of us do think about it

[-]

z_latent@reddit (OP)

Damn, how much of a difference did it make in the end?

[-]

sccocrwn@reddit

20x for some queries as an example

[-]

iBoMbY@reddit

It only matters if you want to really optimize your code. Today pretty much nobody does that anymore, because it costs time, and time is money.

For example even today you still see a lot of games being shipped with all the assets in simple ZIP archives, and then they wonder why it still takes a very long time to load, even on extremely fast SDDs - it's because usually the zlib is the biggest bottleneck, and a very simple first thing to do would be to use something like zstandard as a replacement for the compression.

[-]

SaviorX@reddit

There's a lot of dismissive chatter in this post about abstractions via the available APIs. The available APIs often include support for scatter read/writes, which I presume can be used to optimize I/O for SSDs. OP's question seems valid.

[-]

Mina_Sora@reddit

Abstraction, thats it

[-]

Glad_Courage_5063@reddit

You found the entire internet's worth of SSD programming content. All 3.5 sources of it.

[-]

AutoModerator@reddit

Hello! It looks like this might be a question or a request for help that violates our rules on /r/hardware. If your post is about a computer build or tech support, please delete this post and resubmit it to /r/buildapc or /r/techsupport. If not please click report on this comment and the moderators will take a look. Thanks!

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.