Replacing duplicate files with hard links to save space?

Posted by Zarquan314@reddit | sysadmin | View on Reddit | 38 comments

Whenever I go from one computer to another, I always copy my important directories from my home folder to a backup location (separate from my standard backup solution as a sort-of snapshot of that computer when I stopped using it, which has been very useful). However, these folders often contain backups of previous computers, some of which have been unpacked and placed in the correct location on the computer I am moving out of.

For example, I looked through my backup and found 7 different copies of my entire music library. Most of the songs are exact copies, with some being added over time.

This hasn't been a problem, as storage sizes were increasing faster than my backups were (see XKCD 1718), but I've noticed that this trend has slowed down or stopped, so I was wanting to go through the many generations of old computer backups and do something about the duplicate data.

My thinking that it would be nice to have something that replaces identical copies of files with read-only hard links. That way, everything is still where I expect it in the directory tree, but there aren't a bunch of copies taking up actual disk space. And it being read-only prevents me from accidentally changing my "historical records". Is there a utility that can do that for me so I don't have to do it manually?

Or is there a better solution?

EDIT: I posted this earlier, but accidentally had the wrong title, so I deleted my first post and replaced it.

[-]

antiduh@reddit

Sounds like you need a personal cloud solution.

I use ResilioSync. You pay once for the software, run it on as many nodes as you want. They sync to each other over the internet using direct connections. I keep a master database of all my files on my personal server that all of my devices sync to and from. That way I have one canonical database of my files and it's always synced to my 7 (?) devices.

It works over a private airgapped networks just as easily over the internet. And it can manage databases of millions of files in the terabytes no problem.

[-]

Zarquan314@reddit (OP)

I have been wanting to do something like that for regular backups, but I do like the ability to open a folder and effectively step in to the past, seeing what my computer at the time looked like when it failed, with all the files exactly where I left them in the structure.

Could it help me if I had two copies of my music directory on my computer by recognizing that they are the same and consolidating them?

[-]

antiduh@reddit

No, if you want a proper backup solution you're going to want something else. I work in small scale non-profit situations, so I like Bacula, if that helps you.

[-]

glassmkr_@reddit

Two operational gotchas missing from the thread:

Backup software and hard links don't always play nice. Tools that preserve hard links often require explicit flags (rsync needs -H), and not all consumer backup software handles them at all. Many cloud-backup services upload each link as a full file, which undoes your dedup at the backup-target level. Check your backup tool explicitly handles hard links before relying on rdfind/jdupes/rmlint.

ZFS dedup is RAM-expensive in a way the thread is glossing over. The dedup table (DDT) needs to live in RAM for performance, around 5GB of RAM per TB of deduped data. If the DDT spills to disk, write performance crashes. For a home setup with TBs of backup data, that's significant memory commitment. ZFS snapshots are cheap. ZFS dedup is not. The two are often mentioned together but have very different cost profiles.

[-]

Zarquan314@reddit (OP)

I certainly would prefer a system that I can just stick on my external drives and leave in certain places. My backup solution is pretty passive and is usually unpowered and offline. I have three drives in places I go to regularly that my computer attaches to a couple times a week, so no constantly running server, but fairly regular verification that each individual backup is intact. I access this backup pretty infrequently, maybe a couple times a month to find an old file or program. This is for more historical things and not my current work. I don't intend to ever back this up to anything I or someone I personally know doesn't own.

I also use these drives for other things, as my backups aren't that big yet (but they are getting large enough to interfere with the drive's other functions.

My current plan: I think my hard link idea has been dethroned for a dedup setup, as it is more robustly supported by common file systems. My current plan is to partition the drives to something like NTFS and use their built-in deduping, as that seems pretty simple, and then I can have rest of the be for other purposes that demand speed over compactness.

Does this sound sensible to you, or would you recommend a different course? Would it be possible for the backup to be a bit more dynamic than a partition, so that it can grow and shrink with the amount of space it actually takes? And do you think I should use something other than NTFS?

[-]

TwistedStack@reddit

I've been using VDO on RHEL for years for block level deduplication. It doesn't need a lot of RAM unlike ZFS deduplication so running it on a laptop isn't a problem. In combination with LVM thin provisioning, I've allocated over 1 TB of storage backed by a 256 GB partition. My space saving is currently sitting at 48%.

On RHEL 9, it breaks at the beginning of a year when the maintainers forget to make sure it works for a new kernel. It gets fixed eventually. Meanwhile, you can run an older kernel. I hear it's not a problem on RHEL 10 since it's been merged into the kernel for that. I still haven't gotten around to upgrading.

There's the equivalent ReFS deduplication on Windows but I've never tried it. I don't know how reliable it is.

[-]

pdp10@reddit

jdupes has Windows, Linux, Mac binaries. It can dedupe via hardlink, symlink, or filesystem-specific CoW mechanism.

My thinking that it would be nice to have something that replaces identical copies of files with read-only hard links. That way, everything is still where I expect it in the directory tree, but there aren't a bunch of copies taking up actual disk space.

For this we use soft-links, or "symlinks". Symlinks can span between filesystems, whereas hard links cannot. Symlinks are considerably more obvious to the end-user than hard links.

It's less work to proactively manage storage than to reactively manage it, even with excellent dedupe software. Good luck.

[-]

LOLBaltSS@reddit

If you have a Windows Server based system floating around with storage, the deduplication built into it works fine.

https://learn.microsoft.com/en-us/windows-server/storage/data-deduplication/install-enable

[-]

sdoorex@reddit

If you use it on a file server, though, beware that Windows search cannot index deduplicated files.

[-]

coffee_poops_@reddit

This tool works for NTFS. Obviously the files need to be on the same volume to hard link. You can remove write permissions in acls, but probably more trouble than it's worth.

https://jensscheffler.de/dfhl

[-]

notR1CH@reddit

https://github.com/pkolaczk/fclones is great for this, it can make block clones on ZFS even without dedupe enabled.

[-]

Sroni4967@reddit

rmlint has saved me on this before, it can output a script that replaces dupes with hardlinks. just be aware hardlinks across Windows/Linux is messy since NTFS handles them differently than ext4/btrfs.

[-]

Demented_CEO@reddit

Manual deduplication of files, not on the filesystem level...? I'd say consider a different structure for the data instead!

[-]

Zarquan314@reddit (OP)

I really like being able to open a folder and see what I would see if I opened that old computer when I'm looking for something old. If I just delete the duplicates, I won't have that anymore.

[-]

mattelmore@reddit

ZFS snapshots are probably what you want.

[-]

Zarquan314@reddit (OP)

This sounds like it could work. So I'd take my current back-up and do a ZFS snapshot with some kind of deduplication?

I haven't used ZFS before, so I don't know much about it.

[-]

Dje4321@reddit

Its basically RAID that sits at the file system level so it can decode the data stream and optimize it directly.

It supports file-dedupping, copy-on-write, snapshots, disk striping, remote replication, etc

[-]

gsmitheidw1@reddit

Or btrfs which has similar concepts.

[-]

Vino84@reddit

My solution to this was: * Set up a NAS * Set up a backup solution for the NAS * I used an external HDD to start with. * I also have a sync to an extra HDD in my desktop *I want to re-do this with ZFS snapshots at some point. * Create starting folder structure * Copy the latest versions onto the NAS * Only access the files from the NAS from that point * Use an app to find duplicate files and delete extras as needed

It takes time and it got me to where I needed.

[-]

Zarquan314@reddit (OP)

I do like being able to open a folder and see what I would have seen if I opened my computer the day I made the backup. In my mind, each computer is its own thing containing their own snapshot of a phase of my digital life, and I tend to keep them separated and only bring over what I want or need when I want or need it, so it isn't like a continuous filesystem. I would prefer there to not be holes in the "historical snapshot" of the old computer, which is why I prefer replacing them with hard links over deleting them.

But ZFS snapshots are interesting. I think I would like to learn more.

[-]

Vino84@reddit

For your use case, I'd look for a file system that supports dedupe. Most will look for duplicate blocks and use a database of sorts to link them. This would retain the file/folder structure and minimise space usage. File are then opened from the pattern of blocks that make them up. Hard links like you have suggested will break when the destination file is moved or deleted.

I know that ZFS, NTFS, ReFS, and btrfs all support deduplication. It's up to you to work out which is best for

[-]

Zarquan314@reddit (OP)

The deduping features you are suggesting looks like what I'm looking for. But I didn't realize hard linked files could break like that.

Do you know how to reproduce a hard link breaking? Because I created a hard link, moved both sides around, and deleted the destination, and the other one still functioned normally.

[-]

Vino84@reddit

It's been ages since I've messed around with them. I vaguely recall a drive issue, maybe. Once bitten, twice shy as they say.

[-]

Zarquan314@reddit (OP)

Yeah, I get it. I often avoid things that have hurt me like the plague. But from my understanding of file systems, the thing you are saying the file systems are doing is actually just hard linking, but maybe with an extra step where they make a copy if you change one of the duplicates instead of overwriting all the versions.

[-]

Vino84@reddit

It's block level usually. So if most of a file is similar, it will dedupe the parts that are. The unique blocks are still slept unique

File 1: 1, 2, 3, 4
File 2: 1, 2, 3, 5

Blocks 1, 2, and 3 are deduped. Blocks 4 and 5 aren't. So 8 blocks of files take 5 blocks of space.

Hard linking, you'd want exact duplicates. So Files 1 and 2 would still take their full drive space, e.g. 8 blocks. Just another thing to consider.

[-]

Zarquan314@reddit (OP)

Ah, so deduping can be more efficient if I have multiple files that are close, as it looks at individual blocks instead of the file as a whole, whereas hard linking looks at the whole file.

Sounds like deduping is a more efficient solution for my backup use case then, but hard linking could be better if I were reading the data often and expected it to be fast (even modern SSDs prefer sequential reads IIRC).

[-]

AcornAnomaly@reddit

Deduping is also good because it can save you from accidentally (permanently) corrupting the file.

With hard links, all copies of the file point to the same data. If you accidentally do something to change one of them, ALL of them change.

With a deduping filesystem, if you change one of the files, that specific file's data is split off into a separate version. The other "copies" of the file remain unchanged.

[-]

AcornAnomaly@reddit

As far as I'm aware, hard links won't break like that.

Symbolic links/soft links will break if the "target" file is deleted, but hard links don't have a concept of a target file. It's just another entry in the filesystem pointing to the same data.

[-]

AmazingHand9603@reddit

If you want to manage this over time, moving to something like a NAS and storing all your files with snapshots works better. Deduplication at the filesystem level like with ZFS or Btrfs will automatically handle duplicate blocks so you don’t need to deal with hard links by hand. After you do one big clean-up to dedupe your current stuff, you only need to keep a process for copying new files in and running the dedupe tool every so often. I’d only use hard links if you really need the file paths to stay the same in all your historical structures. Otherwise it’s cleaner to flatten archives and structure your data so that you have a single canonical music library with snapshots or versioning, and then thin out all those backup directories. You’ll avoid confusion and it’ll be way easier to manage going forward. For keeping things read-only, just set file permissions after deduping and you’re safe.

[-]

Zarquan314@reddit (OP)

Sounds like deduplication is probably the way to go, now that I understand how it works.

However, I don't really have the resources right now to have a NAS. My current backup solution is a few external drives with copies of the data (among other things). One I keep at home, one I keep at my desk at work, and one (SSD) I keep in my backpack.

Keep in mind that my music library isn't actually the focus here, it is just an example of wasted space. The different versions have not yet confused me, as I know the one in my latest laptop's Music folder (and its backup) is the most recent.

I do like going through my old computers' data by looking at their home directories to find things, and I find it pretty effective.

[-]

BarracudaDefiant4702@reddit

There are a couple of utilities that can do this. Another option is fdupes, but I have had better luck with rdfind for large (tens of TB) of files.

rdfind -makehardlinks true /path-to-backups

[-]

Zarquan314@reddit (OP)

Cool! So I can run this and it will consolidate my identical files without disturbing the original directory/file structure?

[-]

BarracudaDefiant4702@reddit

Yes.

[-]

Zarquan314@reddit (OP)

Great! If I add another backup later, can I just run the same command again to clean up the duplicates?

[-]

BarracudaDefiant4702@reddit

Yes, it will replace any new duplicates with hard links. Doesn't really help your read-only so you have to do that outside of it. I don't know if there is any loss of timestamp handling (probably fine) when hardlinks are put in place but you might want to verify if you are concerned about the 3 timestamps when it deduplicates files.

[-]

Zarquan314@reddit (OP)

Ehhh, I can make the entire thing read-only myself, and I shouldn't be altering any files in that drive anyway. I think the duplicate identification and hard linking is the hard part.

That's a good point about the time stamps. I do occasionally use time stamps to find a file from a specific year, so it would be annoying if it changed them to the present or some intermediate time from some copy. I think the ideal behavior would be for it to apply the oldest time stamp, as if it is truly identical and it hasn't been altered, the last time it was really modified was far in the past (unless I altered it then altered it back, which I don't think I've done).

[-]

BarracudaDefiant4702@reddit

I suspect it does try to preserve them, but I haven't tested, so you might want to run a test. Linux does have 4 timestamps for each file, the creation date (btime), last modified (mtime) which is the default with ls, the last accessed date (atime), and change time (ctime).

[-]

So_average@reddit

Git?