Is S2D supposed to survive a crash of the cluster disk owner node?
Posted by TechGoat@reddit | sysadmin | View on Reddit | 28 comments
I'm testing out a 3-node, 3-way-mirror CSV on SAS (didn't have the budget for NVMe unfortunately) SSD disks.
Enabling S2D was easy, and it's performant enough to consider putting it into production - but one thing that concerns me is that whichever node owns the cluster disk, seems to be a single point of failure; i.e. the test VMs that are stored on the CSV on all 3 nodes, don't seem to wait long enough if I simulate a crash (i.e. just hard powering off) of the S2D owner node.
If I do a proper, graceful shutdown/restart of that node - everything is fine; the ownership gets migrated smoothly and there's no problem. I'm only talking about crash/outage scenarios.
The other two nodes, the ones that don't own the S2D disk role - that's fine (if annoying) if when that node crashes, the VMs only on that specific node crash too (I'll only have 3x per node anyway; losing 3 VMs and annoying their users sucks but better than all of them) - but my eventual goal is to have 12x hosts sharing the CSV - if the crashing of that S2D disk role owner kills all 36 VMs though, that is keeping me up at night thinking about whether it's stable enough to go to prod or not.
I am having difficulty finding explicit documentation on this: should S2D, using a private VLAN network all its own for "Cluster Communications" and a different one for "Client Communications" - we're doing this already - should it be low-latency enough that in the case of a hard crash, ownership of the S2D role should instantly, within milliseconds, move to another node, and the other VMs should stay up?
It seems to me that when you're hyperconverged, you would want and expect a single node failure in a 3+ node cluster, even if it is the S2D owner node, to keep the cluster running. But maybe this is a single point of failure?
We're using the default settings for Server 2019 for thresholds and heartbeat delays:
CrossSubnetDelay : 1000
CrossSubnetThreshold : 20
PlumbAllCrossSubnetRoutes : 0
SameSubnetDelay : 1000
SameSubnetThreshold : 10
lewis_943@reddit
A few things:
Yes S2D is meant to tolerate the loss of the CSV owner, but it's not invisible or 100000% seamless.
The CSV owner is always a "single point of failure" in theory as it's an active-passive relationship, not active-active, for both S2D or a traditional SAN iSCSI LUN. The nodes are dependent on having a single coordinator for file-system metadata.
It is best practise for CSV configurations (both S2D & SANs) to split your storage into multiple CSVs so that the overhead cost of CSV ownership can be split across nodes. This also mitigates the issue you're describing - a stun period during the failover of the CSV ownership will only impact the storage owned by that node, if that node only owns 30% of storage (1/3 CSVs) then only 30% of your VM pool (by storage consumption) is at risk. It sounds like you have all your storage presented as a single CSV, which is creating your single point of failure.
It's not uncommon for sysadmins to combine the above with preferred role ownership settings, binding VMs to their CSV owner nodes. Powershell scripts also exist online that do this more dynamically by just programmatically checking & live migrating VMs to their CSV owner node. Therefore, if the CSV owner node goes down, the VMs that relied on its storage are also down, so they're going to fresh boot anyway. This tactic only works if your VM sizing is relatively uniform and your VMs aren't straddled across multiple CSVs.
The change to S2D in a HCI scenario effectively removes your stable (either battery-backed or non-volatile) memory caching layer that you would have in a SAN controller. S2D won't hold writes in memory, they go straight to the disk pool and must be written synchronously to all online members to ensure data integrity. The CSV block cache and in-memory REFS metadata cache are read-only. This means (more of) your pending writes are now queued in the volatile memory of the VM guest and the S2D pool is reliant on applictions and in-guest filesystems for application write safety - SQL server by default writes to disk using a FUA, so it will wait for the underlying S2D storage to report that data is written before continuing, but your individual apps might be different.
The increased wait times for storage responses caused during a CSV failover can cause apps to crash. It depends on your environment, OS, apps, configuration, etc, but if you run latency sensitive apps that trip up when there's a pause to I/O then yes it can break things.
Yes. Also it's generally best practise to have a minimum of two (separate) cluster-only communication VLANs, ideally split across separate physical network cards & switches, for redundancy. Returning back to that traffic "queue" point above - S2D "recommends" the use of SMB Direct and SMB Multichannel - even over a single physical NIC there should be two logical interfaces to give you multiple queues, similar to iSCSI. This is critical to lowering the latency of your storage traffic.
Microsoft also recommend disabling netbios on the OS network interfaces to speed up failover. Just don't disable it on the virtual failover clustering NIC, only on your L3 network interfaces on each host. This helps speed up the time taken to detect that another node is down in the cluster.
Are you using REFS or NTFS for your S2D CSV(s)? REFS has better data safeties (healing from resilient copies, Copy-On-Write metadata table) that make it more resilient to failure and a less impactful failover.
TechGoat@reddit (OP)
Wow, that's quite the reply - and thank you! I also appreciate your responses to the other folks in this thread too.
Yes, since this cluster was set up on Server 2019 from scratch not upgraded from a previous version, I followed this guide; the CSV it created was CSVFS; I am not quite sure how to look back and confirm that the original fs, before it was converted to a CSV, was REFS but I'm pretty sure it was. I assume there's a powershell method to check?
You are right - all of my storage is (right now) a single CSV. So when you say later on in your post that yes, VMs should stay up - that is only with the caveat that I need to follow best practices and split my single large pool into multiple CSV disks, and then I should be able to keep my VMs-that-weren't-on-that-host running, because I'll no longer have the single point of failure?
If that's the case, that's what I'll focus on re-architecting.
To give you an idea of my workload, the only thing I'm running (and planning to run) in this cluster is just Hyper-V VMs - no SOFS, SQL, or any other Microsoft cluster-aware services.
Based on the results of my Get-SMBServerConfiguration on each host node, I'm already using SMB Multichannel.
Each host has 2x 10Gb Qlogic pNICs, one on each rack's dual switch pair for HA. I used SET several months ago to team them into a New-VMSwitch. Unfortunately, those are the only NICs I have per host - so everything on the windows node is running across them; the Client network and Cluster network are, of course, just Add-VMNetworkAdapter -ManagementOS created vNICs. Am I basically up a creek here then, if I can't add additional pNICs to my hosts, and unable to carry out your recommendation of 2 separate cluster-only communication vLANs? (also, my network team probably won't give me a second vLAN for cluster-only communication).
Regarding NetBIOS - thanks for digging up the Wayback Machine archive. However, it looks as if the author is saying that NetBIOS should be disabled everywhere, including on the failover clustering NIC - unless I'm completely misunderstanding his two bullet points:
Regarding the former, as the link says, this is already done by default:
Disabling the individual node's NetBIOS will be easy enough, but for the current role owner of the cluster IP address - the shared cluster IP and the node's own IP are sharing the same vNIC. It doesn't seem possible to configure Windows to disable NetBIOS on just the node's TCP/IP settings, while leaving it enabled for the shared cluster IP?
lewis_943@reddit
In theory, your VMs on the still-alive nodes should be able to stay up even with a single, giant CSV. You have a problem somewhere in your environment that's causing this. Splitting the storage into multiple CSVs is basically just a workaround to your issue, but it's one you should implement regardless because even if you fix this issue, there's plenty of other scenarios that could take just one CSV offline.
Functionally, S2D HCI is just a SOFS and a HV host crammed onto the same bare metal.
That's not a valid test. That powershell command only checks if SMB multichannel is permitted, not actively in use. You can enable both SMBDirect and multichannel on incompatible hardware, it doesn't validate your network cards or spit errors, it just always fails back to standard, single-channel SMB because requirements aren't met. You need to actually generate traffic and then check
get-smbmultichannelconnection
while that traffic is in flight. Microsoft has this kb for that.What network flow control configuration do you have set up? You can run both VM and storage traffic over the same pNIC but you must ensure that they don't choke each other's share of bandwidth. That includes guaranteeing available bandwidth for the cluster heartbeat traffic too, which will be a necessary pre-requisite before you can make your heartbeat failover thresholds more aggressive. Else the I/O traffic from a boot storm could drown your hearbeats and tank the whole cluster. Especially as your node count exceeds your resilient data copy count and more and more of your I/O traffic is to another host, not local.
With that in mind, how are you planning to have 12x hosts running in a single S2D cluster with only 2x 10GBE shared networking? I'm worried that at a certain point your storage pool size will exceed the available networking capacity for storage traffic
Are you paying by the VLAN? Tell the stingy little shits that they're going to need to upgrade to QSFP+ 40GBE switching and cards if they can't give you secondary VLANs to properly utilise parallel 10GBE ports.
Cluster networking doesn't scrape the VLAN setting on the vNIC, it just checks subnets. You could have 2x storage networks that use different subnets but on the same VLAN tag. It's bad practice but it's better than the alternative. You can add multiple cluster-only communication networks by just adding another
add-vmnetworkadapter -managementos
command. Though I still maintain your network guys are jerks.I don't think I was clear enough earlier, i don't mean to enable netbios on the interface that has the cluster IP, I mean on the
Microsoft Failover Cluster Virtual Adapter
object that exists on every node (not just the owner). This is referred to as theNetwork Fault Tolerant (NetFT) adapter
in this kb.gamebrigada@reddit
In my PoC experience, by the time S2D is properly configured for HCI, you've made so many compromises that its not really worthwhile. I love S2D for SQL where you can have real blip free failover for most applications, where you can realistically plan long-term storage requirements, failover requirements etc. But for general HCI, unlike iSCSI LUN, having to tie physical disks to the CSVs really hurts long term flexibility.
lewis_943@reddit
Can you explain what you mean by this comment?
Your understanding of the topology seems either broken or you have a setup that is catastrophically misaligned with the published best-practice.
Physical disks have no direct relation to CSVs.
Physical drives must first be added to a storage pool. When you create your CSVs they are allocated in 256MB slabs as evenly as possible across the storage pool (either at time of creation if thick, as-consumed if thin).
The only way to tie physical drives to a CSV would be if you were manually creating individual storage pools per-CSV, which would not only neuter the performance due to the loss of parallel I/O across the whole drive pool but break automated storage pool management in Server 2016+.
gamebrigada@reddit
Wow my memory completely failed on me. My bad, for some reason I remembered that you had to add physical disks to a CSV. Was this never the case? I haven't looked at S2D in many years.
lewis_943@reddit
I think you're mixing up the storage pools and the CSVs.
Storage pool = collection of physical drives presented to the S2D role (available storage)
Cluster Shared Volume (CSV) = Logical volume presented to node OS (useable storage) allocated from storage pool
In 2012/R2 the storage pool was manually managed and, in theory, you could have multiple per-cluster. Adding drives to the pool(s) was a manual task.
In 2016+ storage pool management (adding, replacing, retiring drives) is mostly automated via scheduled tasks. These automations necessitate a single drive pool, which is the current best practise. S2D now supports creating hybrid capacity tiers with cache tiers, as well single & multi-tier volume creation, which makes accumulating all storage under a single pool for parallelism and wear-levelling more feasible and attractive.
Density limitations of 16 nodes per S2D cluster (with real-world optimums differing by application workload) still apply though. At a certain size, a shift to either converged storage (separate compute & S2D nodes) or failover cluster sets is necessary for either economy or redundancy.
gamebrigada@reddit
Thanks for the clarification.
I set it up for an SQL cluster and really liked the tech. This must have been 2020? Clearly I misremember. But I ended up not looking at it for HCI because of some lack of flexibility that I wasn't able to overcome. I instead went with VMWare vSAN that same year which offered everything and more. One that I do remember is the policy based configuration of redundancy that can be applied at per VM level was a huge reason for it. I'm in the smaller SMB space so we can have a large spend and then are pretty dry for years without much budget for expanding. So having the flexibility to triple mirror your important VMs today, and then tomorrow when you need space drop down to double mirror at the push of a button and a rebalance is extremely desirable. Or starting dev systems at no resiliency to expand on them once they go into production. This was a huge driver for us.
lewis_943@reddit
The "no resiliency" is the deal-breaker here, that completely defies the high availability purpose of a failover cluster, hence S2D doesn't accommodate for that. A single faulty drive could destroy your whole CSV and (more pragmatically) with the slab allocation across all nodes, you couldn't patch without taking the CSV offline.
Depending on your cluster size, parity might have been a more space-efficient alternative (albeit with a higher compute cost) to zero-resiliency. Else in a mixed hyper-v environment with sufficient SSD storage, I'd turn to dedupe to recoup some space savings from the non-SQL workloads.
gamebrigada@reddit
Yeah thats fair, but in my mind a dev system doesn't require resilience. Especially with good backups and instant restore from backup storage.
Does S2D now offer changing CSV resilience post create? I can't find anything about that.
lewis_943@reddit
Don't conflate your use-case with a product's intended use. If a dev system doesn't require resilience, then why are you putting it on a cluster? You don't take a hot-wheels car to a mechanic, then say they're a bad shop for not fixing it.
Also, and I'll admit this is more of a personal pet peeve here, a "dev" system is different from "pre-production".
Dev never goes into production. You can recreate the changes you made to dev in production, but dev is born and dies as dev.
BlackV@reddit
that's a detailed reply, Nice
Candy_Badger@reddit
Based on my experience, the cluster should survive a crash of the CSV owner node, and the VMs on other nodes must keep running. What do you see in the logs? If you’re not successful in resolving this issue, consider alternative solutions. For instance, we have multiple customers running Starwinds VSAN, which is simpler compared to S2D. Check it out here: https://www.starwindsoftware.com/storage-spaces-direct
TechGoat@reddit (OP)
Thanks for the post. One of my key requirements is that whatever solution is compatible with Citrix as a remote desktop server deployment system. They have a fairly short list - Nutanix, VMware, their own XenServer, and lastly - Hyper-V but only when it's managed by VMM. So it looks as if Starwind can expose an SMI-S api so that VMM can manage it, which means it might work with Citrix, but I haven't found any definitive blog posts or anything of 'yes, I use Starwind with VMM and it works with Citrix". I'll keep that in mind in case Microsoft S2D ends up not working at all.
-SPOF@reddit
The cluster is built to manage these types of failures smoothly by shifting ownership of the CSV to another node. You can adjust the
SameSubnetDelay
andSameSubnetThreshold
settings to lower values if needed. By default, Windows Server Failover Clustering uses heartbeat signals to check the health of nodes, and your current settings allow the cluster to wait up to 10 seconds.Additionally, you can set up QoS policies on the network adapters to give priority to cluster heartbeat and storage traffic.
I don’t have much experience with 3-node S2D clusters. Usually, I work with 4-5 nodes where S2D manages things without issue. For a setup like yours, you might look at Starwind VSAN. It’s a very stable solution for 2-3 nodes, though I wish it could scale for larger clusters.
Bighaze98@reddit
Hi!
The s2d cluster with a heartbeat disc tolerates disc breakage or at least should. Also because in addition to that the nodes make a second beat called vote to understand if they are alive or it is a problem of the witness’s reachability. Having said that, the qourum disk is no longer supported on the new version of S2d on azure stack hci. So it goes without seasing that it is recommended to use either a blob storage or a file share smb also obestated inside a folder that you create in the sysvol.
Brilliant-Advisor958@reddit
When a cluster host goes down the VMs don't fail over and keep running.
What happens is the cluster detects the vm is not running in the cluster and starts the vm on another host and the vm behaves like it crashed.
If you need to have 100 percent uptime , then you need to use other cluster mechanisms to keep the services running. Like SQL availability groups or load balanced web servers
disclosure5@reddit
I think you're misunderstanding the issue.
If VM1 is running on host1 but host2 crashes, in an S2D environment you will frequently see VM1 crash. It will often then not simply start on another host as you describe, because the disk will be marked offline.
lewis_943@reddit
For a disk to be marked as offline, it needs to have failed automated restart more than the defined retry threshold. Same as a VM.
Those thresholds can be modified to essentially configure the cluster to persistently retry both the disk and the VM
If the CSV has failed to come online, that's not a slow or extended handover/seizing of the CSV ownership, that's an architecture issue Either the system isn't HA capable or there's something else breaking things (3rd party AV is entirely possible).
disclosure5@reddit
I feel you're not familiar with S2D in making this statement.
lewis_943@reddit
I feel you're not helpful in making any of your statements.
In the absence of a technical explanation (which I assume you don't know), you haven't given any specifics about the cluster on-which you've observed this error, let alone how to recreate it. No-one reading your comments can know if your setup or use-case is of any relevance to them.
disclosure5@reddit
It is supposed to be tolerant of this failure, but that's never been my experience and no doubt some MVP will inform you that it's fixed in the next Preview release just like it has been for eight or so years.
BlackV@reddit
note that word OP, its does not mean VMs will necessarily stay up
disclosure5@reddit
I think the clarification is that it should not completely pants itself the way it does. You have about a 50% chance of disk corruption and VMs not booting after this activity.
lewis_943@reddit
Without going in to defend S2D too hard - it absolutely has weaknesses - this sounds like your workloads have either very high storage needs or very poor storage safety mechanisms.
Unexpected power loss and I/O pauses/stalls can still occur on bare metal and direct-attach storage. If your OS or app doesn't have the necessary safety mechanisms to cope with that, then that's an issue with the software, not the hypervisor. If your app has such a high volume of requests that those interruptions are show-stoppers, then that's an issue with architecture and scale, not the individual hypervisor software.
BlackV@reddit
yes indeed, I'm agreeing with you
Nettts@reddit
Stay away from S2D. That's my advice.
lewis_943@reddit
The issue OP is describing doesn't seem to be specific to S2D, the common CSV best practices haven't been followed. Their issue could occur on either a SAN or S2D setup, or even a virtual guest cluster running on one of your listed alternatives.