How Jennifer Aniston and Friends Cost Us 377GB and Broke ext4 Hardlinks

[-]

MatchingTurret@reddit

I have just one question: Why are there only 4 of the friends in the headline picture?

[-]

ottovonbizmarkie@reddit

Ross, the largest Friend has begun eating the other ones.

[-]

LousyMeatStew@reddit

It is true what they say. Men are from Omicron Persei 7. Women are from Omicron Persei 9.

[-]

Kevin_Kofler@reddit

Maybe because the developer's workaround resulted in 4 copies of the file in their test case (and the remaining ones as hardlinks to one of the copies). :-)

[-]

Infinity-of-Thoughts@reddit

Because no one really likes Ross, and Pheebs is kind of wacky.

[-]

stillalone@reddit

The image looks like it's from a specific episode where those four were outside on the balcony looking up at something (I don't remember). It was an early episode.

[-]

MatchingTurret@reddit

Seems to be from Season 4: https://friends.fandom.com/wiki/The_One_With_The_Embryos

At least that's what Google Lens says...

[-]

Striderfs@reddit

Nah, I’d say it’s when Ross prepares the questions to see who knows each other better the girls or the guys and they bet on switching apartments.

The “Okay! Somebody call it this time…”

[-]

vagrantprodigy07@reddit

I love how rather than fix the actual root cause:

When a file moves between security contexts (say, from a private message to a public post), the system creates a new copy with a randomized SHA1. The original content is identical, but Discourse treats it as a new file.

They decided instead to try to code a workaround. If the root cause if fixable (and it was here), then fix the root cause, rather than getting creative with workarounds.

[-]

vividboarder@reddit

I took this to to be intended functionality. That they want to avoid any kind of leaking between context. That said, I can't imagine what the risk would be.

[-]

It is, but I also fail to see how that intended functionality could possibly cause this. Discourse created 250,000 duplicates. There were 250,000 different security contexts? Is each individual PM thread a new context?

Honestly it seems like it'd be worth it to just disable this feature. I don't know how worthwhile it is to hide the fact that XYZ file has been uploaded before, at least in most cases.

[-]

trunicated@reddit

That said, I can't imagine what the risk would be.

For a reaction gif, probably nothing. For some specific file that might be shared between say two people working and then one of those people whistleblowing to someone else? Having that link can be an issue if it's linked in a table somewhere.

[-]

vagrantprodigy07@reddit

Oh, I'm sure it's intended functionality. But there are better ways to do what they are trying to do, especially once you determine it's causing a major problem.

[-]

cpitchford@reddit

At 65,000 hardlinks, it started failing. Turns out ext4 has a limit: roughly 65,000 hardlinks per inode.

iirc its exactly 65000 (not 65534 which is 64k) What's also interesting is folders

If you create a folder in linux ext4

test/

2 folders link to it

test/ test/.

and if you create a folder inside you get more links

test/ test/. test/other/..

So the test folder has hard-links in 3 places

This means you can't create more than 64998 folder in a folder because ".." in those folders need to link back to the folder itself... and that reaches the limit... you can add more files, but not more folders..

That blew up a project I used to work on

[-]

Salamandar3500@reddit

Not really. ".." is handled by the VFS. Here's a proof :

mkdir -p a/b mount /dev/xxx a/b

a/b/.. Points to a, but a is not stored in /dev/xxx, so it CANNOT be a hardlink.

[-]

cpitchford@reddit

you're kind of right about the root of a mounted file system. there it is virtual because it needs to be. For example, in a bind mount (where the same filesystem is mounted in multiple locations simultaneously ".." needs to be dynamic

On top of this, "." needs to be dynamic too, outside the mount, the entry (b for example) needs to be virtually replaced with "." from the mounted file system. fstat a/b needs to show the same as fstat a/b/.

This isn't the case for subdirectories.. or to admit my error, isn't the case on subdirectories in ext2/3. I've just shown my age....

# Make a empty file 
touch skip=100000000 seek=100000000 bs=1024 count=1 if=/dev/zero of=/ext3.img
# put a file system in it
mke2fs -t ext3 /ext3.img
# mount it
mkdir /ext3
mount -o loop /ext3.img /ext3
# make the folders
mkdir /ext3/test
cd /ext3/test
for num in {1..65000} ; do
    echo "Making folder $num"
    mkdir $num || break
done

....
Making folder 65497
Making folder 65498
Making folder 65499
mkdir: cannot create directory ‘64999’: Too many links

ls -dil /ext3/test
4014081 drwxr-xr-x 65000 root root 1417216 Apr 15 09:45 /ext3/test

The inode is `4014081` and the link count is 65000.. we've hit the limit

lets mount and look in debug2fs:

debug2fs: cd test
debug2fs: ls
 4014081  (12) .    2  (4084) ..    4022597  (12) 836    4030662  (12) 1222   
debugfs:  cd 64998
debugfs:  ls
 4972823  (12) .    4014081  (4084) ..

The folder really contains those inode entries. I can unlink them, or change them

If you have an old kernel (let's say 15y+ older? I can't remember) ln -d -f ./a/b ./x worked, you could hard link folders.. undoing it was via the command line tool 'unlink'

So, no... .. isn't always virtual in the VFS. It is mostly virtual in the filesystem driver (i.e ext4 adds it back in)

However, ext4 wouldn't suffer the same issue I did about 15 years ago on a really old system I helped support

[-]

Salamandar3500@reddit

Interesting, thanks for the explanation ! I've always thought . and .. were provided by the kernel itself and not the filesystem, but now I know it wasn't always the case.

[-]

moljac024@reddit

but why hard links over soft links? soft links have no limitations

[-]

we_come_at_night@reddit

soft links still create data garbage

[-]

onyx1701@reddit

Given it's Discourse, I'm not surprised.

I used to be active on a certain forum that migrated to Discourse while it was still in beta. We were basically beta testing the software for them. The big cheese founder of the project who goes by codinghorror online even joined the forum and had admin access IIRC (mod access at the very least), even though he was never a member before.

I should point out, the forum was full of IT professionals, nerds, hackers, you name it. So a large portion of people knew what they were talking about in the context of forum or web software in general.

We broke the software in so many ways it's not even funny, from accidental breakage just trying to use it, to intentional and targeted breakage. Which is all fine, that was the point, the software was unfinished and we were stress testing the thing into oblivion.

What was not fine were responses to our reports and complaints. Obvious low-hanging fruit bugs would get fixed, but any complaints about the architectural or UX failures were always dismissed, even if we demonstrated it will obviously cause problems. Mostly because out complaints didn't fit the philosophical ideas of the creators.

You have a thread with more than 1000 posts and it's breaking the forum software because Redis can't handle the load? You're doing it wrong, no thread should be longer than 100 posts anyway because it won't stay on topic for that long, you should split them because every thread ~~should~~ must be topical. No fun allowed.

You want to use more than 20 emojis in a post but the rendering engine breaks with anything more than that? You're doing it wrong, emojis should be used sparingly, more than 5 per post is just spam, not a joke or a creative use of emojis that your particular community might enjoy and engage with. No fun allowed. Et cetera, et cetera.

Seems like nothing changed in all these years, bad technical decisions are still being made and are hacked around and/or users are getting blamed.

I'd go take a peek at what's happening on the main Discourse support forum, but like most of the members of the aforementioned forum that tried to report bugs there, I'm banned until 2238 or something and I just can't be bothered.

[-]

RetroGrid_io@reddit

This post shows everything wrong with "cloud"... anything. And I say this as someone who has managed cloud resources for decades. I read this and my blood ran cold. What do they think they are doing!?!?

The user of Discourse has enabled a feature called "secure uploads". The words themselves speak something - "Hey, if I upload this file, it's secure!".

But they aren't. The Admins at discourse can read them so trivially that they can deduplicate them at will and the information to do so is built in by default! They have no trouble whatsoever downloading and viewing the uploaded file, and further, don't seem to have a problem with showing everybody how stupid the uploaded file is.

They published a "secure upload" file for all to see. Ha ha! Funny!

But what if you need the upload to be actually secure and trust Discourse to do what the words imply?

Cloud services are convenient; they are often cheap; but don't believe for a second that they are really, actually secure. "Cloud" just means you're renting somebody else's computer and it's a fool who thinks they don't have the same rights over their computer as you have over yours.

[-]

Teknikal_Domain@reddit

It's not meant to be secure from the developers, if you wanted that you could self-host your own system and lock it down to within the inch of its life.

The point of secure uploads is so that random users or unauthenticated users cannot start grabbing stuff off of your forums.

[-]

RetroGrid_io@reddit

... And it doesn't get posted openly?

[-]

SunlightScribe@reddit

Anyone remotely serious is either hosting their own infrastructure or has conditions written in a B2B contract with a third party.

I don't think anyone is under the illusion that a free third party service makes the data unreadable except the intended recipients. Especially since they are expected to be able to moderate content even within private chat rooms.

I'd lean towards either apathy or acceptance of those terms in return for being free. Very few users expect privacy without caveats on free services, especially nowadays.

[-]

mina86ng@reddit

I honestly don’t understand how the problem happened in the first place. Why is the second copy of the file created in the first place? It’s just an entry in a database mapping a random identifier¹ to the file contents. Not to mention that the proper way to deal with user uploads is by hashing the content with SHA256 and using that as identifier. You get deduplication basically for free and there’s no need for ‘secure upload’ feature.

¹ I assume by ‘randomised SHA1’ they just mean 160-bit random identifier.

[-]

vividboarder@reddit

It looks to me like the live system doesn't store multiple copies and is possibly even de-duped across customers storing data once and storing pointers to that data for each client. This is an issue with backups instead since you're backing up the pointers but need to make sure you have at least one copy of the original content for the client to restore.

[-]

lunchbox651@reddit

Most backup software is smarter than to treat pointers as new data.

[-]

vividboarder@reddit

Yes, and that's what they are doing.

They didn't document the schema, but it looks like the database has something like an ID, original ID, and a file URL. They want to download and backup the files from the URLs without duplicating.

Thus, they are proactively identifying when it's a pointer and using a link instead so they don't store a duplicate. Then they hit the problem documented in the article.

[-]

mina86ng@reddit

This is an issue with backups instead since you're backing up the pointers but need to make sure you have at least one copy of the original content for the client to restore.

How is that an issue? Back up the database with the pointers and then back up the file storage so you have copy of all the contents. The only complication is figuring out which files need to be saved in case a partial backup is performed, but that’s trivial to address.

[-]

CmdrCollins@reddit

Why is the second copy of the file created in the first place?

Probably because duplicate uploads were never a major problem for them (you'd mostly have users copying the asset link around, rather than specifically uploading another copy) until they added context specific links - at which point it was likely easier to just create duplicate objects.

Probably also why the feature appears to be specific to deployments using S3 as the storage backend - this would quickly get out of hand with local storage.

[-]

m0ntanoid@reddit

just wait... In a few years he will post another article where he will discover there is no need for hard links, soft links and multiple duplicated files and will give us amazing solution - databases!

[-]

vividboarder@reddit

Do you store your database backups in another database?

[-]

m0ntanoid@reddit

as a base64 encoded string. Don't you?!

[-]

KlePu@reddit

Ah, finally there's an opposite to incremental backups: recursive backups! <3

[-]

Martin8412@reddit

I use base58 so I don’t confuse o with 0 and l with I when typing it in.

[-]

NoTime_SwordIsEnough@reddit

These sarcastic back-and-fourth /r/Linux's version of "Lisa need braces・Dental Plan".

[-]

PracticalPersonality@reddit

So they started with a poor architectural design that allowed users to create an unlimited number of file copies without even knowing that's what they were doing. Then they combined that tragedy with a poor understanding of the underlying storage medium (of course "filesystems have opinions") and a move-fast-break-things mindset.

The future of tech will undoubtedly be riddled with easily preventable bugs that never should have made it past an ARB. I should learn to farm.

[-]

EnUnLugarDeLaMancha@reddit

With btrfs, you can run deduplication tools at any time that scan all files and deduplicate them, without dealing with hardlinks. Same for zfs, except that it does it at runtime

[-]

LousyMeatStew@reddit

Never use deduplication with ZFS unless you know what you’re doing. And if you still think you should use dedupe, you probably don’t know what you’re doing. And even if you’re 100% positive you want dedupe on ZFS, you’re probably wrong. Don’t do it.

Don’t ask me how I know.

[-]

Klutzy-Condition811@reddit

Fast dedupe is new and works much better without needing huge amounts of memory.

[-]

LousyMeatStew@reddit

The memory use is only part of it. Block-level de-dupe is rarely the right tool for the job. If it were as transparent as zfs set compression=lz4, I’d say it wouldn’t hurt but even fast dedupe is too much overhead.

For use cases like the one OP is presenting (backups with lots of redundant data), you could look at options like tar+lrzip or wimlib which will dedupe and compress more efficiently than the filesystem can.

Here’s a good blog post from the folks that wrote fast dedupe that goes into the complexity of it, with IMO the perfect title: OpenZFS deduplication is good now and you shouldn't use it

[-]

Clean_Experience1394@reddit

This actually convinced me

>So for a table of 11.7M entries, the number of those that represent something we actually managed to deduplicate is a literal rounding error. It’s pretty much entirely uniques, pure overhead. Turning on dedup would just add IO and memory pressure for almost nothing.

But the real reason you probably don’t want dedup these days is because since OpenZFS 2.2 we have the BRT (aka “block cloning” aka “reflinks”).

[...]

If you compare to the dedup simulation, I’m not saving as much raw data as dedup would get me, though it’s pretty close. But I’m not spending a fortune tracking all those uncloned and forgotten blocks.

Now yes, this is not plumbed through everywhere. zvols don’t use the BRT yet. Samba has only just gotten support for OpenZFS very recently. Offloading in Windows is only relatively new. The situation is only going to get better, but maybe it’s not good enough yet. So maybe you might be tempted to try dedup anyway, but for mine, I can’t see how the gains would be worth it even without block cloning.

[-]

LousyMeatStew@reddit

Yup, good point. BRT could be a good solution but then again, in their write-up, they say they don’t know for sure what the target filesystem is for their backups which makes me think they are accounting for people who self-host their software.

De-dupe is really hard and there’s no one-size-fits-all which is why we end up with kludgy solutions like what’s described here.

[-]

HighRelevancy@reddit

To be fair that's about general purpose workloads, which this problem is not.

[-]

LousyMeatStew@reddit

According to the source, this is a problem only impacting a single customer and the process needed to be filesystem agnostic. The specific problem may not be general purpose, but only because it’s a corner case of a larger, general purpose backup process.

[-]

Klutzy-Condition811@reddit

You dont need to use Btrfs either. If they still need the performance characteristics of ext4, XFS can do this too :)

[-]

pnutjam@reddit

Yeah, I'm of the opinion you should not use ext3/4 unless you have a reason to use it. XFS is the default and btrfs is preferred.

[-]

vividboarder@reddit

Inside a tar file backup?

[-]

wodes@reddit

Probably yes if it works at the block level?

[-]

throwaway234f32423df@reddit

This is probably better than hardlinks. With hardlinks I've had oopsies where I forget the file is hardlinked, modify one copy of the file, and all the other copies get modified as well when I would have preferred them not to be. With btrfs deduplication, modifying one copy of a file just breaks the linkage between that copy and the others, writing the new file to disk but leaving the others untouched.

[-]

dinominant@reddit

Warning! ZFS theoretically has a much higher limit. In practice it will heavily fragment, write amplify, deadlock, and the entire server will OOM and crash. Then the developers and community will tell you to restore a full backup of the entire filesystem, even if it is hundreds of TB of data.

[-]

veghead@reddit

Congratulations, you have invented single instance storage. Again.

[-]

whamra@reddit

Wouldn't this scenario also vastly benefit from compression?

[-]

Seven-Prime@reddit

most image files are already compressed.

Deduplication would probably be a better fit. But sounds like they are rolling their own storage.

[-]

whamra@reddit

It's not about compressing the content of a single file. But he mentioned backups. An archive containing multiples of the same content will be vastly smaller when compressed.

[-]

Seven-Prime@reddit

Yes that is an important nuance. Many backup solutions already can perform block level deduplication. I would make the assumption they are not using a modern backup solution. It does cost money, and could be out of reach of many opensource projects.

[-]

CmdrCollins@reddit

I would make the assumption they are not using a modern backup solution.

The files in question are stored externally (in a S3 bucket) and only downloaded temporarily so this backup script can pack them into a tarball - the tarball itself will deduplicate them (probably why this hasn't been spotted almost instantly), but only if the system to be backed up has enough spare local storage to store all of those duplicates first.

[-]

Seven-Prime@reddit

Yeah it's a bit of a pickle. Designing for the low-end for at scale is always a challenge. It's easy for me, when I wear my sys arch hat, to point out many other solutions. But that wouldn't scale down to 'make it easy for the HAM radio club.'

[-]

vagrantprodigy07@reddit

Dedup might work, but first they probably need to stop creating a new SHA1 signature for the file.

[-]

mina86ng@reddit

Theoretically, the compression algorithm might notice when compressing something that it has seen already and encode it as reference to the old data.¹ However, in this case that would require an impractically large compression window.

¹ Consequence is that if you compress a tar archive, you might get slightly better compression if you sort files in the archive by file type.

[-]

cornmonger_@reddit

or ... don't use hardlinks

[-]

fellipec@reddit

Me, looking at restic backups deduplicating automatically the same picture folder I have in 3 different computers

[-]

ImpertinentIguana@reddit

I can mail them a USB stick to store all that if they need me to.

[-]

omniuni@reddit

Hash the file. If it's a new hash, save it, otherwise, don't. Store a database record of the hash and get back an ID. To get the file, pass the ID to a file proxy script that also checks security permissions (you should do this anyway) before returning the file. No duplicate files, no filesystem dependencies or weirdness, and properly secure.

[-]

vividboarder@reddit

It looks like this is actually what they do in their live system and the reason that they have the sha1s for each of the files indicating the need to hardlink. This post is about how they perform backups for clients of that data.

[-]

Glad-Weight1754@reddit

Imagine if it was ZFS. You would never know it was 200K copies of the same sex tape until storage gave out.