dd block size
Posted by etyrnal_@reddit | linux | View on Reddit | 60 comments
is the bs= in the dd parameters nothing more than manual chunking for the read & write phases of the process? if I have a gig of free memory, why wouldn't I just set bs=500m ?
I see so many seemingly arbitrary numbers out there in example land. I used to think it had something to do with the structure of the image like hdd sector size or something, but it seems like it's nothing more than the chunking size of the reads and writes, no?
kopsis@reddit
The idea is to use a size that is big enough to reduce overhead while being small enough to benefit from buffering. If you go too big, you end up largely serializing the read/write which slows things down. Optimal is going to be system dependent, so benchmark with a range of sizes to see what works best for yours.
DFS_0019287@reddit
This is the right answer. You want to reduce the number of system calls, but at a certain point, there are so few system calls that larger block sizes become pointless.
Unless you're copying terabytes of data to and from incredibly fast devices, my intuition says that a block size above about 1MB is not going to win you any measurable performance increase, since system call overhead will be much less than the I/O overhead.
EchoicSpoonman9411@reddit
The overhead on an individual system call is very, very low. A dozen instructions or so. They're all register operations, too, so no waiting millions of cycles for fetch data to come back from main memory. It's likely not worth worrying too much about how many you're making.
It's more important to make your block size some multiple of the read/write block sizes of both of the I/O devices involved, so you're not wasting I/O cycles reading and writing null data.
That being said, I agree with your intuitive conclusion.
dkopgerpgdolfg@reddit
Sorry, but that's a lot of nonsense.
Yes, you've shown the register preparing for the "syscall" statement. You've not shown how long context switching takes, and how much impact the MMU cache has. This "one instruction" (syscall) can cost you a five-digit amount of cycles easily, and that's without the actual handling logic within the kernel code.
As the topic here is dd, try dd'ing 1 TB with bs=1 vs bs=4M.
Otherwise, syscall slowness is a serious topic in many other areas. Some specific examples include eg. reasons why things like DPDK and io_uring were made, CPU vuln mitigations (eg. spectre), ...
EchoicSpoonman9411@reddit
That's kind of harsh, man.
That's... not a lot. It's a few microseconds on any CPU made in the last couple of decades.
Almost none of the overhead in that example will be because of system call overhead.
So, the average I/O device these days has a write block size of 2K or 4K, something. Let's call it 2K for the sake of argument. When you dd with bs=1, you're causing an entire 2K disk sector to be rewritten in order to change 1 byte. Then again for the next, until each 2K disk sector is rewritten 2048 times before it goes on to the next one, which is also rewritten 2048 times, and so on.
Of course that's going to take a long time.
dkopgerpgdolfg@reddit
It's thousands of times more than those "12 register operations". And as syscalls aren't a one-time thing, it adds up over time.
About the dd example: Try it with /dev/zero, so you don't have any disk issues.
Btw. I just tried on that computer I'm using currently. The difference is a factor of about 29000x.
EchoicSpoonman9411@reddit
Of course it does. But the system call overhead under real-world conditions, meaning for bs= values which actually make plausible sense, is negligible compared to the I/O load.
What's the point of doing that? Of course if you eliminate the I/O load from the equation, the system call load becomes relevant, because the CPU isn't idle waiting for I/O to finish, but then it's not germane to the original problem.
lelddit97@reddit
just a wandering subject matter expert: the other person knows what they are talking about and you don't
EchoicSpoonman9411@reddit
There is sufficient demand for Linux kernel expertise so that SMEs don't need to live in their parents' basement.
You're that other guy's alt. You have the same rudimentary skill at reading comprehension.
lelddit97@reddit
no, i am engineer who knows what they are talking about and you are arguing for the sake of arguing
EchoicSpoonman9411@reddit
If you're not that guy's alt, then you wandered into a discussion thread for... what, exactly? You could have read it and not commented, it's really fucking easy. So who's arguing for the sake of arguing?
You sound like a fucking toddler. Communication is an important skill for "engineers" too.
dkopgerpgdolfg@reddit
Then just forget about hard disks and look at everything else. Page faults, pipes, ...
And I once again point you to the projects etc. mentioned above. It's everywhere. ... If you see someone saying they set mitigations=off so that there computer gets faster, and they can accept the reduced security because they only play games, then their problem was syscalls overhead.
Afaik, the topic was not if hard disks IO is slow, the topic was that a syscall takes much more than just a dozen register operations.
In any case, I said what I want, not going to fight about semantics. Bye.
DFS_0019287@reddit
My understanding is that the overhead of a system call is more than just the instructions; there's also the context switch to kernel mode and then back to user mode. A system call is probably 10x more expensive than a normal user space function call.
But as you wrote, this is still pretty negligible overhead compared to disk I/O.
LvS@reddit
I believe the larger problem is that you blow the CPU caches. If /u/etyrnal_ sets size to 500M, then each read will fill up the whole L3 cache multiple times over which means once you start writing you need to get the memory to write back from RAM.
And avoiding the detour through RAM is kinda important for performance.
etyrnal_@reddit (OP)
isn't ram many times faster than any of the current storage media?
LvS@reddit
Yes. The problem here is that by exhausting the cache, you can also evict other cachelines - like the ones containing the application's data structures. Plus, you access the data multiple times - once for writing, once for reading, no idea if it's used elsewhere.
So you're using RAM (or caches) much more frequently, while the disk is only accessed once for reading and once for writing.
etyrnal_@reddit (OP)
Are we mixing up the terms cache and buffers?
considering the speed of RAM (ram space set up by the process to be buffer(s) for reading/writing in/out the bs=500m) and cache (cpu for the instructions of the dd process), i would think that RAM speeds would be so much faster than storage like microSD cards, hard drives etc, that this wouldn't cause noticeable slowdowns with slower media? I could understand in a data center like Meta's, or whatever, every single processor cycle, and resource become hyper critical to have forensic level accounting for... but for reading / writing images to/from microSD cards?
I remember back during 16bit cpu & floppy drive days, we used a file manager / disk copier that just read the entire floppy into RAM, it sped up copy operations a LOT.
LvS@reddit
Those numbers are a lot less different than they used to be. Since the introduction of SSDs, disks got a lot faster.
Google has this page comparing the speeds of the different layers, though it matters a lot what kind of hardware you have: SSD vs HDD vs microSD and desktop vs mobile phone vs rpi and so on.
If you're working with this, I'd recommend checking those numbers for updates every 5 or so years, because there's always new inventions that change the differences between those layers (or even introduce new ones).
Note that I don't know if I'm actually right with my assumption. It might be useful to look up your cache size and see if setting block size to half of cache size (so you're sure it fits) versus twice the cache size (so you're sure it doesn't fit) makes a big difference, compared to 1/3 and 3x respectively.
If it does, my idea sounds very plausible, if it doesn't I'm likely wrong.
etyrnal_@reddit (OP)
i think this can be sort of tested. somebody mentioned to me a few tools to actively test best dd buffer size... it's be interesting to see if that number lines up the way you are suggesting.
etyrnal_@reddit (OP)
i appreciate the additional clarifying insights
lelddit97@reddit
I do 1MB for < 1TB copied, then some multiple of two otherwise. I think I did 16MB for cloning an NVME SSD which worked well. Maybe 1MB would have worked better even then idk
triffid_hunter@reddit
In theory, some storage devices have an optimal write size, eg FLASH erase blocks or whatever tape drives do.
In practice,
cat
works fine for 98% of the tasks I've seendd
used for, since various kernel-level caches and block device drivers sort everything out as required.The movement of all this write block management to kernel space is younger than
dd
- so while it makes sense fordd
to exist, it makes rather less sense that it's still in all the tutorials for disk imaging stuff.Yes
Maybe you're on a device that doesn't have enough free RAM for a buffer that large.
Conversely, if the block size is too small, you're wasting CPU cycles with context switching every time you stuff another block in the write buffer.
Or just use
cat
and let the relevant kernel drivers sort it out.etyrnal_@reddit (OP)
cat gives no progress indicator
fearless-fossa@reddit
Then use rsync.
etyrnal_@reddit (OP)
rsync can write images to sd cards?
fearless-fossa@reddit
Yes, why wouldn't it?
SteveHamlin1@reddit
rsync can write a file to a file system. As far as I know, rsync can't write a file to a device, which is what u/triffid_hunter was talking about.
For an unmounnted device named '/dev/sdX', do "rsync testfile.txt /dev/sdX" and see if that works.
etyrnal_@reddit (OP)
i has no reason to assume it was intended to be adapted to that purpose. I was under the impression is was a file-level tool.
triffid_hunter@reddit
Then use pipeworks
etyrnal_@reddit (OP)
how does cat deal with errors?
triffid_hunter@reddit
It doesn't.
That's why I said 98% rather than 100% 😉
ConfuSomu@reddit
Or even
cp
your disk image to your block device!smirkybg@reddit
Isn't there a way to make dd benchmark which block size is better? I mean who wouldn't want that?
etyrnal_@reddit (OP)
would be great if that was baked in and invokable by some cli-passed option.
natermer@reddit
'dd' was originally designed for dealing with tape drives. Some of which have very specific requirements when it comes to things like block sizes.
It isn't even originally for Unix systems. It is from IBM-land. That is why its arguments are so weird.
The block devices in Linux don't care about "block size" argument in DD. You can pretty much use whatever is convenient as the kernel does the hard work of actually writing it to disk.
If you don't give it a argument it defaults to a block size of 512, which is too low and cause a lot of overhead. So the use of the argument is just to make it big enough to not cause problems.
A lot of times the use of 'dd' is just because it is cargo cult command line. People see other people use it so they use it. They don't stop to think as to actually why they are using it.
Many times use of 'dd' to write images to disk can be replaced by something like 'cat' and not make any difference. Except maybe to be faster.
'dd' is still useful in some cases. Like you can specify to skip so many bytes and thus do things like edit and restore parts of images... (like if you want to backup the boot sector or replace it with something else) but it is a very niche use and there are usually better tools for it.
Try using cat sometime. See if it works out better for you.
dkopgerpgdolfg@reddit
Try working without the page cache (direct flag in dd) and see.
asp174@reddit
And do that with blocks smaller than the storage systems' chunk size, where the storage has to read a chunk, change a few bits, write it back - multiple times over.
dkopgerpgdolfg@reddit
No, it doesn't do that. When O_DIRECT is used with a too-small block size, it just fails.
asp174@reddit
When you have a RAID controller that runs without write cache, it will do exactly this.
Just the same as controllers without cache have the read-before-write penalty when dealing with unaligned drive numbers for a RAID5 or 6.
dkopgerpgdolfg@reddit
Ok, if you put it that way... Afaik we were talking about Linux kernel behaviour here.
If the storage (whatever it is) wants a certain block size, because it can't handle anything else, then Linux with O_DIRECT will not help in any way.
marozsas@reddit
Controversial subject. Fact: it's a ancient tool designed specifically designed to handle tape drivers. Fact: in nowadays, kernel and device driver handle very well with the specifics of writing and reading on modern devices.
I've abandoned the use of dd in favor of using cat and redirect stdin and stdout making the command line much simpler as possible.
etyrnal_@reddit (OP)
and you don't care that you cannot get a status/progress or or control error handling that way?
marozsas@reddit
In general, no. If I want badly to get the progress of a large copy I use the command pv. And if there's an error, there's no much that one can do about, regardless he is using dd or another equivalent command. Remember, I am talking about ordinary devices like HDD, sdd, directly attached to a sata interface or USB, not a fancy tape SCSI tape writer.
etyrnal_@reddit (OP)
I'm just cloning microSD cards to an image on the computer, and then to another microSD card later.
marozsas@reddit
Yes, I work with orangePi devices professionally and I have the same need to copy to/from USB connected SD cards and cp is just fine to use /dev/sdX as source or destination.
etyrnal_@reddit (OP)
i'm going to try it sometime. for small copies. but for huge copies where i can't tell if something is hanging or whatever, i'll prob stuck with what's familiar. I think the only reason i decided to use it this time was because some users had reported a certain popular sd car 'burner' was somehow turning non-working copies of the sd card. So, i did it to avoid whatever that rumor was about. It was probably some userland pebkac, but for a process that takes hours, i just didn't want to lose time to some issue like that.
I normally just use balena etcher, or rufus, or whatever app depending on the platform i'm using (windows/macos/linux/android/etc).
Thanks for the insights
marozsas@reddit
I suggest you learn about pv.
You can use it to write an image of 3G in size, previously compacted by XZ, to an SD disk at /dev/sda with something like that:
xzcat Misc/orangepi4.img.xz | pv -s 3G > /dev/sda
If the image is not compacted, you can use pv directly, no need to specify the size of input, and both give you the feedback you want.
pv Misc/orangepi4.img > /dev/sda
michaelpaoli@reddit
Most of the time what's notable is obs, which if not explicitly set uses bs, which if not explicitly set generally defaults to 512. So, quite depends what one is writing, but, e.g. for most files on most filesystems these days, [o]bs=4096 would be an appropriate minimum, and should generally use powers of 2 to avoid block misalignment and problems/inefficiencies thereof. If writing directly to a drive, most notably solid state rather than hard drive, generally best to pick something fair bit larger - the larger of either erase block size or physical write block size - so that would typically be the erase block size that would be larger. If unsure, an ample power of 2, e.g. [o]bs=1048576 will generally quite suffice.
No, not only not well aligned, but that's going to be eating almost half a gig of RAM and won't be that efficient, it may well want to buffer that full amount before writing out same, and if it's not multi-threaded that's likely also to be pretty inefficient and slow, as it switches back and forth between such long large reads, and then writes. Much better would generally be a much smaller but ample block size, e.g. in the range of a suitable power of 2 between 4096 and 1048576, and that will likely also be much more efficient - swallow up a whole lot less RAM, and as the writes will generally be buffered, will typically be switching back and forth between reads and writes pretty quickly and efficiently, and mostly only limited by I/O speeds - so probably by whatever's slower, the reads, or (often) the writes (depending on media type, etc.). With much large/excessive bs, buffers/caches will fill on the writes, so one will typically spend most of the time waiting on I/O on the writes, but it will be inefficient, as with also such large reads, same will happen on the read side, while the write side goes idle.
And if you're writing, say, e.g. to device that's RAID-0 or RAID-10 or RAID-5 across multiple drives, you'll want integral multiple of whatever size covers an entire "stripe", e.g. say you have 5 drives configured as RAID-5, so that's 4 data + 1 (distributed) parity. You'll want integral multiple (minimum multiplier of 1) to whatever fully covers those 4 chunks of data - so you write that, and all that and the parity is calculated and written in one go - if you do less than that at best you'll be recalculating and rewriting at least one data chunk and the parity data multiple times, likewise if you're not an integral multiple of that size. When in doubt, pick something that's "large enough" to cover it, but not excessive.
If you're dealing with particularly huge devices, may be good to test some partial runs first. But note also that buffering may make at least initial bits appear artificially fast. One may use suitable (if available) dd sync option(s) and/or wait for completion of sync && sync after dd, and include that in one's timing, to be sure one waits for all the data to be flushed out to media, and to be sure one includes that in one's timings.
So, yeah, [o]bs does make a difference. Pick a decent clueful one for optimal, or at least good, efficiency.
dkopgerpgdolfg@reddit
Other than the performance topic, another possibly important factor is how partial r/w is handled.
In general, if a program wants to read/write to a file handle (disk file, pipe, socket, anything) and specifies a byte size, it might succeed but process less byte than the program wants. The program could then just make another call for the rest.
And dd has a "count" flag, that only a specific amount of blocks (with "bs" size each) is copied, instead of everything in a file etc.
If you specify such a limited "count", and dd gets partial reads/writes by the kernel, by default it will not "correct" this - it will just call read/write "count" times, period. Because of the partial io, you'll get less total bytes copied than intended.
With disk files, this usually doesn't happen. But with network file systems, slowly-filled pipes, etc., it's common. There are additional flags that can passed to dd (at least for the GNU version) so that the full amount of bytes is processed in each case.
s3dfdg289fdgd9829r48@reddit
I literally only used a non-default bs once (with bs=4M) and it completely bricked a USB drive. I haven't tried since. It's been about 15 years. Once bitten, twice shy, I suppose. Maybe things have gotten better.
etyrnal_@reddit (OP)
i was recommended this read, and it tries to explain dd behavior. i wonder if it could explain what happened in your scenario.
https://wiki.archlinux.org/title/Dd#Cloning_an_entire_hard_disk
s3dfdg289fdgd9829r48@reddit
Since this was so long ago, I suspect it was just buggy USB firmware or something.
etyrnal_@reddit (OP)
was that on READING the device, or writing to it?
s3dfdg289fdgd9829r48@reddit
Writing to it.
etyrnal_@reddit (OP)
interesting. i am using it to clone a new microSD card that came from OEM loaded with operating system and files for an OS to an image i can later use to restore it to another microSD if necessary, so this is especially interesting since i want a working image, and i do NOT want to brick devices/microSD cards.
BigHeadTonyT@reddit
https://www.baeldung.com/linux/dd-optimal-blocksize
https://github.com/theAlinP/dd-bs-benchmark
To test it.
FryBoyter@reddit
Regarding block size, I think the information at https://wiki.archlinux.org/title/Dd#Cloning_an_entire_hard_disk is quite interesting.
daemonpenguin@reddit
I don't know what you mean by "chunking", but I think you're basically correct. The bs parameter basically sets the buffer size for read/write operations.
Try it and you'll find out. Setting the block size walsk a line between having a LOT of read/writes, like if BS is set to 1 byte vs having a giant buffer that takes a long time to fill BS=1G.
If you use dd on a bunch of files, with different block sizes, you'll start to notice there is a tipping point where performance gets better and better and then suddenly drops off again.
e_t_@reddit
If you don't specify block size, then
dd
will go 512B sector by 512B sector. There are... a lot... of 512B sectors on a modern hard drive. At the same time, whatever bus you connect to your hard drive with has only so much bandwidth. You want a number that effectively saturates the bandwidth with a minimum of buffering.MsInput@reddit
You'd be able to hold 2 big files instead of many smaller files