Backup Question
Posted by sdns575@reddit | linuxadmin | View on Reddit | 13 comments
Hi,
I'm running my backups using rsync and python script to get the job done with checksumming, file level deduplication with hardlink, notification (encryption and compression actually is managed by fs) . It works very well and I don't need to change. In the past I used Bacula and changed due to its complexity but worked well.
Out of curiosity, I searched some alternatives and found some enterprise software like Veeam Backup, Bacula, BareOS, Amanda and some alternative software like Borgbackup and Restic. Reading all this backup software documentation I noticed that Enterprise software (Veeam, Bacula....) use to store data in form of full + incr backup cycles (full, incr, incr, incr, full, incr, incr, incr....) and restoring the whole dataset could require to restore from the full backup to the latest incremental backup (in relation of a specified backup cycle). Software like borgbackup, restic (if I'm not wrong), or scripted rsync use incremental backup in form of snapshot (initial backup, snapshot of old file + incr, snaphost of old file + incr and so on) and if you need to restore the whole dataset you can restore simply the latest backup.
Seeing enterprise software using backup cycles (full + incr) instead of snapshot backups I would like to ask:
What is the advantage of not using "snapshot" backup method versus backup cycles?
Hope, I explained correctly what I mean.
Thank you in advance.
WildFrontier2023@reddit
TL;DR: Not sure for others, but Bacula uses full + incremental cycles for efficiency and scalability in large setups. Snapshot-style backups (like rsync) are simpler but can get resource-heavy for big datasets. It's simply not an enterprise-grade solution.
Why Bacula prefers backup cycles:
Why snapshots are different:
Bottom line:
If you’re managing large, complex environments, Bacula’s cycles make sense. For smaller, simpler setups, snapshot tools like rsync or Borg are easier to use and restore from. Stick with what works for your needs! :)
michaelpaoli@reddit
Alternatively to full+incrementals, many may also do/offer full+differential - that way a full restore to most current only takes at most two sets - set of full, and set of differential between that and most current. So, advantage of that is fewer backups/media to load and read to restore, disadvantage is the size of each differential may grow relatively quickly - so for some that may not be feasible, or to compensate, may need to increase the frequency of the fulls.
There are various flavors of snapshot, but most work as something that continues to maintain a current differential. So, most of the time they're not really a proper full "snapshot", per se, but rather at the given snapshot time - and generally done at - or below - the filesystem layers, now all changes are tracked and recorded - generally at the block layer - at least between time of snapshot and current. So, on the filesystem (or block device or what have you), on the live, any time a block changes, the original is written/added to the snapshot ... except if original has already been written there, it won't be written again, and for some, if the current write happens to duplicate what was originally there, they may then remove that block from the snapshot (as it's no longer needed). And depending upon the technology, some will only do/hold one given snapshot at a time (e.g. LVM), whereas others can have multiple numbers/layers of snapshots (e.g. ZFS) ... in fact ZFS even has capabilities to flip what is a snapshot of what - e.g what's the base reference and what's the snapshot of that reference - that relationship can in fact be flipped around if/when desired to do so. So, when something says "snapshot", one is often well advised to read and pay attention, to be sure one knows exactly what type of "snapshot" one is getting, and what it does and doesn't do, and how it works.
Note that what you're doing there, and how, may not give you protection for some scenarios - e.g. farm of hardlinks with the originals isn't really a backup - e.g. change that data on original file - well the "backup" link to same also changes. However, yes, hardlinks can be used to greatly reduce redundant unneeded storage of backups - just those shouldn't be links to the original live active locations, otherwise writes there likewise change the data on the backup(s). See also: cmpln (Program I wrote that very efficiently deduplicates via hardlinks. It's also highly efficient as it only reads blocks of files so long as there's a potential match, never beyond the point (by block) of potential match, and never reads any block from any file more than exactly once. But note that it doesn't consider differences in, e.g. ownerships, permissions, etc.).
sdns575@reddit (OP)
Hi and thank you for your answer. I'm late so I'm sorry.
Why not? When an hardlink will be substituted with another copy of the file, the previous version is saved in the same place with the same content and metadata.
What are drawbacks of "farm of hardlinks"?
michaelpaoli@reddit
E.g.:
So, tell me where you'll be restoring your important data from, hmm? Like I said:
Also, bunch of hardlinks on separate copies of the file(s) can also be inefficient use of the space. E.g. for many filesystems, small files are stored quite inefficiently, doing links you've got no compression of the files themselves ... unless you're fist compressing and then linking those. Also chews up a lot of inodes and directory space on the filesystem. So, file example I gave above eats up, on most filesystems, 4KiB just for the data of that tiny file itself, plus space for a directory slot. Whereas, e.g. if I use tar ... the additional space per file is much smaller than 4KiB. Just the data in file itself, and moderate bit for header for the file - that's the incremental space per file (but tar does take it's own space for header, and may pad out to tar block size for the tar file itself).
esgeeks@reddit
Although snapshots appear more efficient in restoration, space management and integrity verification complexity can be greater. I would opt for full-cycle and incremental backup systems as they provide redundancy against data corruption and facilitate integrity verification.
SurfRedLin@reddit
It saves space and time.
sdns575@reddit (OP)
Hi and thank you for your answer.
Do you mean the "snapshot" method?
SurfRedLin@reddit
No backup cycles as u call it.
SurfRedLin@reddit
It saves space and time.
bityard@reddit
There are multiple approaches to writing backup software and these are two of the main ones.
To put it simply, most mature "enterprise" products do full+incremental backups because it's a very straightforward process and is very flexible to work into almost every infrastructure. They just copy all the files they find and save them to a self-contained archive somewhere. Like rsync on steroids. This is extremely flexible but it turns out to be pretty wasteful in terms of disk space and backup times. They support all kinds of different storage and quite often the simple storage means that you can do disaster recovery even if your backup software is offline because the underlying archives are just tarballs or something similar.
The "snapshot" style backups like Borg, Restic, and Kopia store their backup data in highly structured repositories for better speed, compression, and deduplication. Think of it as a mashup of a database and blob storage. The trade-off with these is that you can only do things that the backup software directly supports. You also have ZERO chance of recovering data from those backups "by hand" but that is less of an issue because they tend to be CLI programs instead of a big centralized server like Veeam.
meditonsin@reddit
Borg and Restic only backup to disk or disk like online backends. You can get away with only doing incrementals after the first full backup in this model, because you always have direct access to the previous file versions to merge it all together via hardlinks, deltas and whatnot, without wasting space.
But Bacula and Veeam and such can also backup to tape, where you can't just merge backups together like in a live filesystem. An incremental backup is just the changes since the last backup appended to the end of the tape. So you if you ever want to reuse or rotate out old tapes, or send a useful set of tapes to off-site storage or whatever, you have to go cyclical.
Though they do genereally have the option to make synthetic full backups, by actively merging a full backup and all of its descendants onto a new tape or whatever, so you don't have to hit your workload for an actual full backup.
sdns575@reddit (OP)
Hi and thank you for your answer.
So the method difference is due to writing on support like tape.
ErasmusDarwin@reddit
Bacula et al. are better if you're using tapes or otherwise have your backups spread across more than a single filesystem. It also lets you better manage retention and rotation across multiple locations.
The other solutions are great until your data starts exceeding the size of a single filesystem, at which point you're kinda in trouble if you can't easily grow your filesystem or migrate to somewhere bigger. But if you aren't running into problems, then the convenience of the more casual solutions is a win.
I used to use BackupPC (essentially rsync+perl with built-in dedup, compression, and periodic integrity checks) for a single server, and it was a lot easier to deal with than the Bacula setup I use for multiple servers these days.