KVM geo-replication advices
Posted by async_brain@reddit | linuxadmin | View on Reddit | 61 comments
Hello,
I'm trying to replicate a couple of KVM virtual machines from a site to a disaster recovery site over WAN links. As of today the VMs are stored as qcow2 images on a mdadm RAID with xfs. The KVM hosts and VMs are my personal ones (still it's not a lab, as I serve my own email servers and production systems, as well as a couple of friends VMs).
My goal is to have VM replicas ready to run on my secondary KVM host, which should have a maximum interval of 1H between their state and the original VM state.
So far, there are commercial solutions (DRBD + DRBD Proxy and a few others) that allow duplicating the underlying storage in async mode over a WAN link, but they aren't exactly cheap (DRBD Proxy isn't open source, neither free).
The costs in my project should stay reasonable (I'm not spending 5 grands every year for this, nor am I allowing a yearly license that stops working if I don't pay support !). Don't get me wrong, I am willing to spend some money for that project, just not a yearly budget of that magnitude.
So I'm kind of seeking the "poor man's" alternative (or a great open source project) to replicate my VMs:
So far, I thought of file system replication:
- LizardFS: promise WAN replication, but project seems dead
- SaunaFS: LizardFS fork, they don't plan WAN replication yet, but they seem to be cool guys
- GlusterFS: Deprecrated, so that's a nogo
I didn't find any FS that could fulfill my dreams, so I thought about snapshot shipping solutions:
- ZFS + send/receive: Great solution, except that COW performance is not that good for VM workloads (proxmox guys would say otherwise), and sometimes kernel updates break zfs and I need to manually fix dkms or downgrade to enjoy zfs again
- XFS dump / receive: Looks like a great solution too, with less snapshot possibilities (9 levels of incremental snapshots are possible at best)
- LVM + XFS snapshots + rsync: File system agnostic solution, but I fear that rsync would need to read all data on the source and the destination for comparisons, making the solution painfully slow
- qcow2 disk snapshots + restic backup: File system agonstic solution, but image restoration would take some time on the replica side
I'm pretty sure I didn't think enough about this. There must be some people who achieved VM geo-replication without any guru powers nor infinite corporate money.
Any advices would be great, especially proven solutions of course ;)
Thank you.
michaelpaoli@reddit
Can be done entirely for free on the software side. The consideration may be bandwidth costs - vs. currency/latency on data.
So, anyway, I routinely live migrate VMs among physical hosts ... even with no physical storage in common ... most notably virsh migrate ... --copy-storage-all
So, if you've the bandwidth/budget, you could even keep 'em in high availability state, ready to switch over at most any time. And if the rate of data changes isn't that high, the data costs on that may be very reasonable.
Though short of that, one could find other ways to transfer/refresh the images.
E.g. regularly take snapshot(s), then transfer or rsync or the like to catch the targets up to the source snapshots. And snapshots done properly, should always have at least recoverable copies of the data (e.g. filesystems). Be sure one appropriately handles concurrency - e.g. taking separate snapshots at different times (even ms apart) on same host may be a no-go, as one may end up with problems, e.g. transactional data/changes or other inconsistencies - but if snapshot is done at or above level of the entire OS's nonvolatile storage, you should be good to go.
Also, for higher resiliency/availability, when copying to targets, don't directly clobber and update, rotate out the earlier first, and don't immediately discard it - that way if sh*t goes down mid-transfer, you've still got good image(s) to migrate to.
Also, ZFS snapshots may be highly useful - those can stack nicely, can add/drop, reorder the dependenies, etc., so may make a good part of the infrastructure for managing images/storage. As for myself, bit simpler infrastructure, but I do in fact actually have a mix of ZFS ... and LVM, md - even LUKS in there too on much of the infrastructure (but not all of it). And of course libvirt and friends (why learn yet another separate VM infrastructure and syntax, when you can learn one to "rule them all" :-)). Also, the VMs, for their most immediate layer down from the VM, I just do raw images - nice, simple, and damn near anything can well work with that. "Of course" the infrastructure under that gets fair bit more complex ... but remains highly functional and reliable. So, yeah, e.g. at home ... mostly only use 2 physical machines ... but between 'em, have one VM, which for most intents and purposes is "production" ... and not at all uncommon for it to have uptime greater than either of the two physical hosts it runs upon ... because yeah, live migrations - I need/want to take a physical host down for any reason ... I live migrate that VM to the other physical host - and no physical storage in common between the two need be present virsh migrate ... --copy-storage-all very nicely handles all that (behind the scenes, my understanding is it switches the storage to network block device, mirrors that until synced, and then holds sync through migration, and then breaks off the mirrors after migrated, but my understanding is one can also do HA setups where it maintains both VMs in sync so either can become the active at any time; one can also do such a sync and then not migrate - so one has fresh separate resumable copy with filesystems in recoverable state).
And of course, one can also do this all over ssh.
async_brain@reddit (OP)
> So, if you've the bandwidth/budget, you could even keep 'em in high availability state, ready to switch over at most any time. And if the rate of data changes isn't that high, the data costs on that may be very reasonable.
How do you achieve this without shared storage ?
michaelpaoli@reddit
virsh migrate ... --copy-storage-all
Or if you want to do likewise yourself and manage it at lower level, use linux network block storage devices for your the storage of your VMs. And then with network block devices, you can, e.g., do RAID-1, across the network, with the mirrors being in separate physical locations. As I understand it, that's essentially what virsh migrate ... --copy-storage-all does behind the scenes to achieve such live migration - without physical storage being in common between the two hosts.
E.g. I use this very frequently for such:
https://www.mpaoli.net/\~root/bin/Live_Migrate_from_to
And most of the time, I call that via even higher level program that handles my most frequently used cases (most notably taking my "production" VM, and migrating it back and forth between the two physical hosts - where it's almost always running on one of 'em at any given point in time (and hence often longer uptime than either of the physical hosts).
And how quick of such live migration, mostly matter of drive I/O speed - if that were (much) faster then it might bottleneck on network (have gigabit), but thus far I haven't pushed it hard enough to bottleneck on CPU (though I suppose with "right" hardware and infrastructure, that might be possible?)
async_brain@reddit (OP)
That's a really neat solution I wasn't aware of, and which is quite cool to "live migrate" between non HA hosts. I definitly can use this for mainteannce purposes.
But my problem here is disaster recovery, eg main host is down.
The advice about no clobber / update you gave is already something I typically do (I always expect the worst to happen \^\^).
ZFS replication is nice, but as I suggest, COW performance isn't the best for VM workloads.
I'm searching for some "snapshot shipping" solution which has good speed and incremental support, or some "magic" FS that does geo-replication for me.
I just hope I'm not searching for a unicorn ;)
michaelpaoli@reddit
Well, remote replication - synchronous and asynchronous - not exactly something new ... so lots of "solutions" out there ... both free / Open-source, and non-free commercial. And various solutions, focused around, e.g. drives, LUNs, partitions, filesystems, BLOBs, files, etc.
Since much of the data won't change between updates, something rsync-like might be best, and can also work well asyncrhonously - presuming one doesn't require synchronous HA. So, besides rsync and similar(ish), various flavors of COW, RAID (especially if they can well track many changes and well play catch-up on that for "dirty" blocks later), some snapshotting technologies (again, being able to track "dirty"/changed blocks over significant periods of time can be highly useful, if not essential), etc.
Anyway, haven't really done much that heavily with such over WAN ... other than some (typically quite pricey) existing infrastructure products for such in $work environments. Though I have done some much smaller bits over WAN (e.g. utilizing rsync or the like ... e.g. I think at one point I had VM in data center that I was rsyncing (about) hourly - or something pretty frequent like that), between there and home ... and, egad, over a not very speedy DSL ... but it was "quite fast enough" to keep up with that frequency of being rsynced ... but that was from the filesystem, not raw image ... but regardless, would've been about same bandwidth.
async_brain@reddit (OP)
Thanks for the insight.
You perfectly summarized exactly what I'm searching: "Change tracking solution for data replication over WAN"
- rsync isn't good here, since it will need to read all data for every update
- snapshots shipping is cheap and good
- block level replicating FS is even better (but expensive)
So I'll have to go the snapshot shipping route.
Now the only thing I need to know is whether I go the snapshot route via ZFS (easier, but performance wise slower), or XFS (good performance, existing tools xfsdump / xfsreceive with incremental support, but less people using it, perhaps need more investigation why)
Anyway, thank you for the "thinking help" ;)
michaelpaoli@reddit
I believe there do exist free Open-source solutions in that space. Sufficiently solid, robust, high enough performance, etc., however is separate set of questions. E.g. Linux network block device (configured RAID-1, with mirrors at separate locations) would be one such solution, but I believe there are others too (e.g. some filesystem based).
async_brain@reddit (OP)
> believe there do exist free Open-source solutions in that space
Do you know some ? I know of DRBD (but proxy isn't free), and MARS (which looks not maintained since a couple of years).
RAID1 with geo-mirrors cannot work in that case because of latency over WAN links IMO.
michaelpaoli@reddit
https://www.google.com/search?q=distributed+redundant+open+source+filesystem
https://en.wikipedia.org/wiki/Comparison_of_distributed_file_systems
Pretty sure Ceph was the one I was thinking of. It's been around a long time. Haven't used it personally. Not sure exactly how (un)suitable it's likely to be.
There are even technologies like ATAoE ... not sure if that's still alive or not, or if there's a way of being able to replicate that over WAN - guessing it would likely require layering at least something atop it. Might mostly be useful for comparatively cheap local network available storage (way the hell cheaper than most SAN or NAS).
async_brain@reddit (OP)
Trust me, I know that google search and the wikipedia page way too well... I've been researching for that project since months ;)
I've read about moosefs, lizardfs, saunafs, gfarm, glusterfs, ocfs2, gfs2, openafs, ceph, lustre to name those I remember.
Ceph could be great, but you need at least 3 nodes, and performace wise it gets good with 7+ nodes.
ATAoE, never heard of, so I did have a look. It's a Layer 2 protocol, so not usable for me, and does not cover any geo-replication scenario anyway.
So far I didn't find any good solution in the block level replication realm, except for DRBD Proxy which is too expensive for me. I should suggest them to have a "hobbyist" offer.
It's really a shame that MARS project doesn't get updates anymore, since it looked _really_ good, and has been battle proven in 1and1 datacenters for years.
kyle0r@reddit
Perhaps it's worth mentioning that if you're comfortable storing your xfs volumes for your vms in raw format, and those xfs raw volumes are stored on normal zfs datasets (not zvols) then your performance concerns are likely mitigated. I've done a lot of testing around this. Night and day performance difference for my workloads and hardware. I can share my research if you're interested.
Thereafter you'll be able to use either xfs freeze or remounting the xfs mount(s) as read only. The online volumes can then be safely snapshoted by the underlying storage.
Thereafter you can zfs send (and replicate) the dataset storing the raw xfs volumes. After the initial send only the blocks that have changed will be sent. You can use a tools like syncoid and sanoid to manage this in an automated workflow.
HTH
async_brain@reddit (OP)
It's quite astonishing that using a flat disk image on zfs would produce good performance, since the COW operations still would happen. If so, why wouldn't everyone use this ? Perhaps proxmox does ? Yes, please share your findings !
As for zfs snapshot send/receive, I usually do this with zrepl instead of sync|sanoid.
kyle0r@reddit
I've written a 2025 update on my original research. You can find the research here: https://coda.io/@ff0/zvol-performance-issues. Suggest you start with the 2025 update and then the TL;DR and go from there.
Proxmox default is zvol unfortunately, more "utility" out of the box, easier to manage for beginners and supports things like live migration. Bad for performance.
async_brain@reddit (OP)
Thank you for the link. I've read some parts of your research.
As far as I can read, you compare zvol vs plain zfs only.
I'm talking about a performance penality that comes with COW filesystems like zfs versus traditional ones, see https://www.phoronix.com/review/bcachefs-linux-2019/3 as example.
There's no way zfs can keep up with xfs or even ext4 in the land of VM images. It's not designed for that goal.
kyle0r@reddit
Comparing single drive performance. CMR drives with certain workloads will be nearly as fast as native drive speed under ZFS... or faster thanks to the ARC cache.
Once you start using multi drive pools there are big gains to be had for read IO.
For sync heavy IO workloads one can deploy slog on optane for huge write IO gains.
async_brain@reddit (OP)
I've had (and have) some RAID-Z2 pools with typically 10 disks, some with ZIL, some with SLOG. Still, performance isn't as good as traditional FS.
Don't get me wrong, I love zfs, but it isn't the fastest for typical small 4-16Ko bloc operations, so it's not well optimized for databases and VMs.
kyle0r@reddit
I cannot agree with your comment per
For a read workload, if it can be handled within RAM/ARC cache then ZFS is blazing fast. Many orders of magnitude faster than single disk, like-for-like tests. Especially 4-16k databases. There is plenty of evidence online to support this, including in my research which I shared with you. focused on 4k and 1M testing.
citing napp-it:
Whenever your workload can be mainly processed within your RAM, even a slow HD pool is nearly as fast as an ultimate Optane pool.
For sync write workloads, add some optane slog to a pool and use sync=always and a pool is going to become a lot faster than its main disks. Many orders of magnitude faster.
citing napp-it:
I cannot personally comment on raid-z pool performance because I've never run them but for mirrored pools, each mirrored vdev is a bandwidth multiplier. So if you have 5 mirrored vdevs in a pool, there will be circa \~10x performance multiplier because the reads can be parallelised across 10 drives. For the same setup, for writes its a \~5x multiplier.
async_brain@reddit (OP)
I do recognize that what you state makes sense, especially the optane and RAM parts, and indeed having a ZIL will highly increase to write IOPS, until it's full and it needs to unload to slow disks.
What I'm suggesting here is that COW architecture cannot be as fast as traditional (COW operations adds IO, checksumming adds metadata reads IO...).
I'm not saying zfs isn't good, I'm just saying that it will always be beaten by traditionnal FS on the same hardware (see https://www.enterprisedb.com/blog/postgres-vs-file-systems-performance-comparison for a good comparaison point with zfs/btrfs/xfs/ext4 in raid configurations).
Now indeed, adding a ZIL/SLOG can be done on ZFS but cannot be done on XFS (one can add bcache into the mix, but that's another beast).
While a ZIL/SLOG might be wonderful on rotational drives, I'm not sure it will improve NVME pools.
So my point is: xfs/ext4 is faster than zfs on the same hardware.
Now the question is: Is the feature set good enough to tolerate the reduced speed.
async_brain@reddit (OP)
@ u/kyle0r I've got my answer... the feature set is good enough to tolerate the reduced speed \^\^
Didn't find anything that could beat zfs send/recv, so my KVM images will be on ZFS.
I'd ask you another advice for my zfs pools.
So far, I created a pool with ashift=12, then a tank with xattr=sa, atime=off, compression=lz4 and recordsize=64k (which is the cluster size of qcow2 images).
Is there anything else you'd recommend ?
My VM workload is typical RW50/50 with 16-256k IOs.
kyle0r@reddit
Well as a general observation if you are storing qcow2 volumes on ZFS, you have double cow... So you might wish to consider using raw volumes to mitigate this factor. It's not a must have but if your looking for the best IOPS and bandwidth possible, then give it some consideration. A side effect of changing to raw volumes is that proxmox native snapshots are not possible and snapshots must be handled at the zfs layer including freezing the volume prior to snapshotting, assuming the VM is running at the time.
A pools
ashift
is related to drive geometry. Suggest you check out my cheat sheet https://coda.io/@ff0/home-lab-data-vault/openzfs-cheatsheet-2Consider using
checksum=edonr
as there are some benefits including nop writes.compression=lz4
is fine but you might want to considerzstd
as a more modern alternative.Regarding record size. I suggest a benchmark of default vs. 64k with your typical workload. Just to verify that 64k is better than the 128k default. ZFS is able to auto adjust the record size when set to default. I'm not sure if it supports auto adjustment when set to non default. YMMV. DYOR.
From memory I found leaving the zfs default with xfs raw 4k volumes performed relatively well with typical workloads, that it didn't justify setting the record size to 4k. This is true for zfs datasets but probably not true for zvols which from memory benefit from the explicit block size being set for the expected io workload.
Have a browse of the cheatsheet I linked. Maybe there is something of interest. Have fun.
kyle0r@reddit
Have a look at the section: Non-synthetic tests within the kvm
This is ZFS raw xfs vol vs. ZFS xfs on zvol
There are some simple graphs there that highlight the difference.
The tables and co in the research generally compared the baseline vs. zvol vs. zfs raw.
michaelpaoli@reddit
So ... what about various (notably RAID-1) RAID technologies? Any particularly good at tracking lots of dirty blocks over substantial period of time so they can later quite efficiently resync just the dirty blocks, rather than entire device?
If one can find at least that, can layer that atop other ... e.g. rather than physical disk directly under that, could be Linux network block device or the like.
And one can build stuff up quite custom manually using dmmapper, e.g. dmsetup(8), etc. E.g. not too long ago, I gave someone a fair example of doing something like that ... forget what it was, but for some quite efficient RAID (or the like) data manipulation they needed, that was otherwise "impossible" (at least no direct way to do the needed with higher level tools/means).
Yeah ... that was a scenario of doing a quite large RAID transition - while minimizing downtime ... I gave a solution that kept downtime quite minimal, by using dmsetup(8) ... essentially create the new replacement RAID, use dmsetup(8) to RAID-1 mirror from old to new, once all synced up, split and then resume with the new replacement RAID. Details on that earlier on my comment here and my reply further under that (may want to read the post itself first, to get relevant context).
And ... block devices ... needn't be local drives or partitions or the like, e.g. can be network block device. Linux really doesn't care - if it's a randomly accessible block device, Linux can use it for storage or build storage atop it.
Anyway, not sure how many changes, e.g. md, LVM RAID, ZFS, BTRFS, etc. can track for "dirty blocks" and be able to do an efficient resync, before they overflow on that and have to resync all blocks on the entire device. Anyway, should be able to feed most any Linux RAID-1 most any kind of block devices ... question is how efficiently can it resync up to how much in the way of changes before it has to copy the whole thing to resync.
Sad_Dust_9259@reddit
Curious to hear what advice others would give
async_brain@reddit (OP)
Well... So am I ;)
Until now, nobody came up with "the unicorn" (aka the perfect solution without any drawbacks).
Probably because unicorns don't exist ;)
Sad_Dust_9259@reddit
Fair enough! Guess we’ll have to make our own unicorn :D
async_brain@reddit (OP)
So far I can come up with three potential solutions, all snapshot based:
- XFS snapshot shipping: Reliable, fast, asynchronous, hard to setup
- ZFS snapshot shipping: Asynchronous, easy to setup (zrepl or syncoid), reliable (except for some kernel upgrades, which can be quickly fixed), not that fast
- GlusterFS geo-replication: Is basically snapshot shipping under the hood, still need some info (see https://github.com/gluster/glusterfs/issues/4497 )
As for block replication, the only thing that approches a unicorn I found is MARS, but the project's only dev isn't around often.
Sad_Dust_9259@reddit
Nice breakdown! Have you messed around with MARS yourself, or is it more of a theory thing so far?
Sad_Dust_9259@reddit
Yeah, ZFS sounds like the way to go, even with the kernel hiccups. Trying out zrepl or syncoid? Let me know how it goes -_-
async_brain@reddit (OP)
I've only read articles about MARS, but author won't respond on github, and last supported kernel is 5.10, so that's pretty bad.
XFS snapshot shipping isn't a good solution in the end, because, it needs a full backup every 9 incremental ones.
ZFS seems the only good solution here...
gordonmessmer@reddit
I understand that Red Hat is discontinuing their commercial Cluster product, but the project itself isn't deprecated
async_brain@reddit (OP)
Fair enough, but I remember ovirt when Red Hat discontinued RHEV, ovirt project did announce it would continue, but there are only a few commits a months now. There were hundreds of commits before, because of the funding I guess. I fear gluster will go the same way (I've read https://github.com/gluster/glusterfs/issues/4298 too)
Still, glusterFS is the only file system based solution I found which supports geo-replication over WAN.
Do you have any (great) success stories about using it perhaps ?
lebean@reddit
The oVirt situation is such a bummer, because it was (and still is) a fantastic product. But, not knowing if it'll still exist in 5 years, I'm having to switch to Proxmox for a new project we're standing up. Still a decent system, but certainly not oVirt-quality.
I understand Red Hat wants everyone to go OpenShift (or the upstream OKD), but holy hell is that system hard to get setup and ready to actually run VM-heavy loads w/ kubevirt. So many operators to bolt on, so much yaml patching to try to get it happy. Yes, containers are the focus, but we're still in a world where VMs are a critical part of so many infrastructures, and you can feel how they were an afterthought in OpenShift/OKD.
async_brain@reddit (OP)
Ever tried Cloudstack ? It's like oVirt on steroids ;)
lebean@reddit
It's one I've considered checking out, yes! Need the time to throw it on some lab hosts and learn it.
async_brain@reddit (OP)
I'm testing cloudstack these days in a EL9 environment, with some DRBD storage. So far, it's nice. Still not convinced about the storage, but I'm having a 3 nodes setup so Ceph isn't a good choice for me.
The nice thing is that indeed you don't need to learn quantum physics to use it, just setup a management server, add vanilla hosts and you're done.
instacompute@reddit
I use local storage, nfs and ceph with CloudStack and kvm. Drbd/linstor isn’t for me. My more cash plus orgs use pure storage and powerflex storage with kvm.
async_brain@reddit (OP)
Makes sense ;) But the "poor man's" solution cannot even use ceph because 3 node clusters are prohibited \^\^
instacompute@reddit
I’ve been running a 3-node ceph cluster for ages now. I followed this guide https://rohityadav.cloud/blog/ceph/ with CloudStack. The relative performance is lacking but then I use CloudStack instances root disk on local storage (nvme) but use ceph/rbd based data disks.
async_brain@reddit (OP)
I've read way too much "don't do this in production" warnings on 3 node ceph setups.
I can imagine because of the rebancing that happens immediatly after a node gets shutdwown, which would be 50% of all data. Also when loosing 1 node, one needs to be lucky to avoid any other issue while getting 3rd node up again to avoid split brain.
So yes for a lab, but not for production (even poor man's production needs guarantees \^\^)
instacompute@reddit
I’m not arguing as I’m not Ceph expert. But the experts I’ve learnt from advised to have 3 replicas pool when having only 3 hosts/nodes, there’s no rebalancing really when any hosts were to go down - such setup may even be production worthy as long as you’ve with same number of osds per node.
My setup consists of 2 osds (nvme disks of same capacity) per node and 3 nodes, and my ceph pools are replicated with 3 replica count. Of course my total ceph raw capacity is just 12TB. Erasure coding and typical setup needing more throughput would benefit to have 10G+ nics and minimum 5-7 nodes.
async_brain@reddit (OP)
Sounds sane indeed !
And of course it would totally fit a local production system. My problem here is geo-replication, I think (not sure) this would require my (humble) setup to have at least 6 nodes (3 local and 3 distant ?)
async_brain@reddit (OP)
Just had a look at the glusterfs repo. No release tag since 2023... doesn't smell that good.
At least there's a SIG that provides uptodate glusterfs for RHEL9 clones.
async_brain@reddit (OP)
Okay, done another batch of research about glusterfs. Under the hood, it uses rsync (see https://glusterdocs-beta.readthedocs.io/en/latest/overview-concepts/geo-rep.html ) so there's no advantage for me, since everytime I'd access a file, glusterfs would need to read the entire file to check checksum, and send the difference, which is quite a IO hog considering we're talking about VM qcows which generally tend to be big.
Just realized glusterfs geo-replication is rsync + inotify in disguise :(
yrro@reddit
Does it necessarily checksum or can it use timestamps?
async_brain@reddit (OP)
Never said it was \^\^
I think that's inotify's job.
instacompute@reddit
With CloudStack you can use ceph for primary storage multiple-site replication. Or just use the Nas backup with kvm & CloudStack.
async_brain@reddit (OP)
Doesn't ceph require like 7 nodes to get decent performance ? And aren't ceph 3 node clusters "prohibited", eg not fault tolerant enough ? Pretty high entry for a "poor man's" solution ;)
frymaster@reddit
I know you've already discounted it, but... I've never had ZFS go wrong in updates, on Ubuntu. And I just did a double-distro-upgrade from 2020 LTS -> 2022 LTS -> 2024 LTS
LXD - which was originally for OS containers - now has VMs as a first-class feature. Or there's a non-canonical fork, incus. The advantage with using these is they have pretty deep ZFS integration and will use ZFS send for migrations between remotes - this is separate from and doesn't require using the clustering
async_brain@reddit (OP)
I've been using zfs since the 0.5 zfs-fuse days, and using it professionally since 0.6 series, long before it became OpenZFS. I really enjoy this FS for more than 15 years now.
Running on RHEL since about the same times, some upgrades break the dkms modules (happens roughly once a year or so). I use to run a script to check whether the kernel module built well for all my kernel versions before rebooting.
So Yes, I know zfs, and use it a lot. But when it comes to VM performance, it isn't on-par with xfs or even ext4.
As for Incus, I've heard about "the split" from lxd, but I didn't know they added VM support. Seems nice.
josemcornynetoperek@reddit
Maybe zfs and snapshots?
async_brain@reddit (OP)
Did you actually read the question ? I explained why zfs isn't ideal for that task because of performance issues.
scrapanio@reddit
If space or traffic isnt an issue do hourly Borg backups directly on the secondary host and a third backup location.
Qcow2 snapshot feature should reduce the needed traffic. The only issue I see is IP routing since the second location will most likely not have the same IPs announced.
async_brain@reddit (OP)
Thanks for your answer. I work with restic instead of borg (done numerous comparisons and benchmarks before choosing), but the results should be almost identical. The problem is that restoring from a backup could take time, and I'd rather have "ready to run" VMs if possible.
As for the IPs, I do have the same public IPs on both sites. I do BGP on the main site, and have a GRE tunnel to a BGP router on the secondary site, allowing me to announce the same IPs on both.
scrapanio@reddit
That's a really neat solution!
When you directly backup onto the secondary host you should be able to just start the VMs or do I miss something?
async_brain@reddit (OP)
AFAIK borg does the same as restic, ie the backup is a specific deduplicated & compressed repo format. So before starting the VMs, one would need to restore the VM image from the repo, which can be time consuming.
scrapanio@reddit
Depending on the size decompressing should take under an hour. I know that Borg can turn off compression if needed. Also Borg does do repo and metadata tracking but the files, if compression is deactivated, should be ready to go. Under the hood it's essentially rsyncing the given files.
async_brain@reddit (OP)
Still, AFAIK, borg does deduplication (which cannot be disabled), so it will definitly need to rehydrate the data. This is very different from rsync. The only part where borg ressembles rsync is in the rolling hash algo to check which parts of file have changed.
The really cood advantage that comes with borg/restic is that one can keep multiple versions of the same VM without the need of multiple disk space. Also, both solutions can have their chunk size tuned to something quite big for a VM image in order to speed up restore process.
The bad part is that using restic/borg hourly will make it read __all__ the data on each run, which will be a IO hog ;)
scrapanio@reddit
I am sorry, I missed the point by quite a bit.
Using VM snapshots as the backup target should reduce IO load.
Nevertheless i think zfs snapshots can be the solution.
Quick Google search gave me: https://zpool.org/zfs-snapshots-and-remote-replication/
But I think that instead of qcow2 backed images the block devices should now directly managed by ZFS.
I don't know if live snapshotting in this scenario is possible.
async_brain@reddit (OP)
It is, but you'll have to use qemu-guest-agent fsfreeze before doing a ZFS snapshot and fsthaw afterwards. I generally use zrepl to replicate ZFS instances between servers, and it supports snapshot hooks.
But then I get into my next problem, ZFS cow performance for VM which isn't that great.
exekewtable@reddit
Proxmox backup server with backup and automated restore. Very efficient, very cheap. You need Proxmox on the host though.
async_brain@reddit (OP)
Thanks for the tip, but I don't run Proxmox, I'm running vanilla KVM on a RHEL9 clone (Almalinux), which I won't change since it works perfectly for me.
Also, I don't really enjoy Proxmox developping their own API (qm) instead of libvirt, and making their own archive format (vma) which even if you tick "do not compress", is still lzo compressed, which defeats any form of deduplication other than working with zvols.