Is S2D supposed to survive a crash of the cluster disk owner node?

Posted by TechGoat@reddit | sysadmin | View on Reddit | 31 comments

I'm testing out a 3-node, 3-way-mirror CSV on SAS (didn't have the budget for NVMe unfortunately) SSD disks.

Enabling S2D was easy, and it's performant enough to consider putting it into production - but one thing that concerns me is that whichever node owns the cluster disk, seems to be a single point of failure; i.e. the test VMs that are stored on the CSV on all 3 nodes, don't seem to wait long enough if I simulate a crash (i.e. just hard powering off) of the S2D owner node.

If I do a proper, graceful shutdown/restart of that node - everything is fine; the ownership gets migrated smoothly and there's no problem. I'm only talking about crash/outage scenarios.

The other two nodes, the ones that don't own the S2D disk role - that's fine (if annoying) if when that node crashes, the VMs only on that specific node crash too (I'll only have 3x per node anyway; losing 3 VMs and annoying their users sucks but better than all of them) - but my eventual goal is to have 12x hosts sharing the CSV - if the crashing of that S2D disk role owner kills all 36 VMs though, that is keeping me up at night thinking about whether it's stable enough to go to prod or not.

I am having difficulty finding explicit documentation on this: should S2D, using a private VLAN network all its own for "Cluster Communications" and a different one for "Client Communications" - we're doing this already - should it be low-latency enough that in the case of a hard crash, ownership of the S2D role should instantly, within milliseconds, move to another node, and the other VMs should stay up?

It seems to me that when you're hyperconverged, you would want and expect a single node failure in a 3+ node cluster, even if it is the S2D owner node, to keep the cluster running. But maybe this is a single point of failure?

We're using the default settings for Server 2019 for thresholds and heartbeat delays:

CrossSubnetDelay          : 1000
CrossSubnetThreshold      : 20
PlumbAllCrossSubnetRoutes : 0
SameSubnetDelay           : 1000
SameSubnetThreshold       : 10

[-]

lewis_943@reddit

A few things:

Yes S2D is meant to tolerate the loss of the CSV owner, but it's not invisible or 100000% seamless.
The CSV owner is always a "single point of failure" in theory as it's an active-passive relationship, not active-active, for both S2D or a traditional SAN iSCSI LUN. The nodes are dependent on having a single coordinator for file-system metadata.
It is best practise for CSV configurations (both S2D & SANs) to split your storage into multiple CSVs so that the overhead cost of CSV ownership can be split across nodes. This also mitigates the issue you're describing - a stun period during the failover of the CSV ownership will only impact the storage owned by that node, if that node only owns 30% of storage (1/3 CSVs) then only 30% of your VM pool (by storage consumption) is at risk. It sounds like you have all your storage presented as a single CSV, which is creating your single point of failure.
It's not uncommon for sysadmins to combine the above with preferred role ownership settings, binding VMs to their CSV owner nodes. Powershell scripts also exist online that do this more dynamically by just programmatically checking & live migrating VMs to their CSV owner node. Therefore, if the CSV owner node goes down, the VMs that relied on its storage are also down, so they're going to fresh boot anyway. This tactic only works if your VM sizing is relatively uniform and your VMs aren't straddled across multiple CSVs.
The change to S2D in a HCI scenario effectively removes your stable (either battery-backed or non-volatile) memory caching layer that you would have in a SAN controller. S2D won't hold writes in memory, they go straight to the disk pool and must be written synchronously to all online members to ensure data integrity. The CSV block cache and in-memory REFS metadata cache are read-only. This means (more of) your pending writes are now queued in the volatile memory of the VM guest and the S2D pool is reliant on applictions and in-guest filesystems for application write safety - SQL server by default writes to disk using a FUA, so it will wait for the underlying S2D storage to report that data is written before continuing, but your individual apps might be different.
The increased wait times for storage responses caused during a CSV failover can cause apps to crash. It depends on your environment, OS, apps, configuration, etc, but if you run latency sensitive apps that trip up when there's a pause to I/O then yes it can break things.

should it be low-latency enough that in the case of a hard crash, ownership of the S2D role should instantly, within milliseconds, move to another node, and the other VMs should stay up?

Yes. Also it's generally best practise to have a minimum of two (separate) cluster-only communication VLANs, ideally split across separate physical network cards & switches, for redundancy. Returning back to that traffic "queue" point above - S2D "recommends" the use of SMB Direct and SMB Multichannel - even over a single physical NIC there should be two logical interfaces to give you multiple queues, similar to iSCSI. This is critical to lowering the latency of your storage traffic.

Microsoft also recommend disabling netbios on the OS network interfaces to speed up failover. Just don't disable it on the virtual failover clustering NIC, only on your L3 network interfaces on each host. This helps speed up the time taken to detect that another node is down in the cluster.
Are you using REFS or NTFS for your S2D CSV(s)? REFS has better data safeties (healing from resilient copies, Copy-On-Write metadata table) that make it more resilient to failure and a less impactful failover.

[-]

TechGoat@reddit (OP)

Wow, that's quite the reply - and thank you! I also appreciate your responses to the other folks in this thread too.

Yes, since this cluster was set up on Server 2019 from scratch not upgraded from a previous version, I followed this guide; the CSV it created was CSVFS; I am not quite sure how to look back and confirm that the original fs, before it was converted to a CSV, was REFS but I'm pretty sure it was. I assume there's a powershell method to check?

You are right - all of my storage is (right now) a single CSV. So when you say later on in your post that yes, VMs should stay up - that is only with the caveat that I need to follow best practices and split my single large pool into multiple CSV disks, and then I should be able to keep my VMs-that-weren't-on-that-host running, because I'll no longer have the single point of failure?

If that's the case, that's what I'll focus on re-architecting.

To give you an idea of my workload, the only thing I'm running (and planning to run) in this cluster is just Hyper-V VMs - no SOFS, SQL, or any other Microsoft cluster-aware services.

Based on the results of my Get-SMBServerConfiguration on each host node, I'm already using SMB Multichannel.

Each host has 2x 10Gb Qlogic pNICs, one on each rack's dual switch pair for HA. I used SET several months ago to team them into a New-VMSwitch. Unfortunately, those are the only NICs I have per host - so everything on the windows node is running across them; the Client network and Cluster network are, of course, just Add-VMNetworkAdapter -ManagementOS created vNICs. Am I basically up a creek here then, if I can't add additional pNICs to my hosts, and unable to carry out your recommendation of 2 separate cluster-only communication vLANs? (also, my network team probably won't give me a second vLAN for cluster-only communication).

Regarding NetBIOS - thanks for digging up the Wayback Machine archive. However, it looks as if the author is saying that NetBIOS should be disabled everywhere, including on the failover clustering NIC - unless I'm completely misunderstanding his two bullet points:

You want to disable NetBIOS in a couple different places:

Every Cluster IP Address resources - Here is the syntax (again, this needs to be set on all IP Address resources). Note: NetBIOS is disabled on all Cluster IP Addresses in Windows Server 2016 by default. Get-ClusterResource “Cluster IP address” | Set-ClusterParameter EnableNetBIOS 0

Base Network Interfaces – In the Advanced TCP/IP Settings, go to the WINS tab, and select “Disable NetBIOS over TCP/IP. This needs to be done on every network interface.

Regarding the former, as the link says, this is already done by default:

PS C:\Windows\system32> Get-ClusterResource | Get-ClusterParameter | where Name -EQ "EnableNetBIOS"

Object             Name          Value Type
------             ----          ----- ----
Cluster IP Address EnableNetBIOS 0     UInt32

Disabling the individual node's NetBIOS will be easy enough, but for the current role owner of the cluster IP address - the shared cluster IP and the node's own IP are sharing the same vNIC. It doesn't seem possible to configure Windows to disable NetBIOS on just the node's TCP/IP settings, while leaving it enabled for the shared cluster IP?

[-]

lewis_943@reddit

when you say later on in your post that yes, VMs should stay up - that is only with the caveat that I need to follow best practices and split my single large pool into multiple CSV disk

In theory, your VMs on the still-alive nodes should be able to stay up even with a single, giant CSV. You have a problem somewhere in your environment that's causing this. Splitting the storage into multiple CSVs is basically just a workaround to your issue, but it's one you should implement regardless because even if you fix this issue, there's plenty of other scenarios that could take just one CSV offline.

is just Hyper-V VMs - no SOFS, SQL, or any other Microsoft cluster-aware services.

Functionally, S2D HCI is just a SOFS and a HV host crammed onto the same bare metal.

Based on the results of my Get-SMBServerConfiguration on each host node, I'm already using SMB Multichannel.

That's not a valid test. That powershell command only checks if SMB multichannel is permitted, not actively in use. You can enable both SMBDirect and multichannel on incompatible hardware, it doesn't validate your network cards or spit errors, it just always fails back to standard, single-channel SMB because requirements aren't met. You need to actually generate traffic and then check get-smbmultichannelconnection while that traffic is in flight. Microsoft has this kb for that.

Each host has 2x 10Gb Qlogic pNICs, one on each rack's dual switch pair for HA. I used SET several months ago to team them into a New-VMSwitch. Unfortunately, those are the only NICs I have per host

What network flow control configuration do you have set up? You can run both VM and storage traffic over the same pNIC but you must ensure that they don't choke each other's share of bandwidth. That includes guaranteeing available bandwidth for the cluster heartbeat traffic too, which will be a necessary pre-requisite before you can make your heartbeat failover thresholds more aggressive. Else the I/O traffic from a boot storm could drown your hearbeats and tank the whole cluster. Especially as your node count exceeds your resilient data copy count and more and more of your I/O traffic is to another host, not local.

but my eventual goal is to have 12x hosts sharing the CSV

With that in mind, how are you planning to have 12x hosts running in a single S2D cluster with only 2x 10GBE shared networking? I'm worried that at a certain point your storage pool size will exceed the available networking capacity for storage traffic

also, my network team probably won't give me a second vLAN for cluster-only communication

Are you paying by the VLAN? Tell the stingy little shits that they're going to need to upgrade to QSFP+ 40GBE switching and cards if they can't give you secondary VLANs to properly utilise parallel 10GBE ports.

Am I basically up a creek here then, if I can't add additional pNICs to my hosts, and unable to carry out your recommendation of 2 separate cluster-only communication vLANs?

Cluster networking doesn't scrape the VLAN setting on the vNIC, it just checks subnets. You could have 2x storage networks that use different subnets but on the same VLAN tag. It's bad practice but it's better than the alternative. You can add multiple cluster-only communication networks by just adding another add-vmnetworkadapter -managementos command. Though I still maintain your network guys are jerks.

It doesn't seem possible to configure Windows to disable NetBIOS on just the node's TCP/IP settings, while leaving it enabled for the shared cluster IP?

I don't think I was clear enough earlier, i don't mean to enable netbios on the interface that has the cluster IP, I mean on the Microsoft Failover Cluster Virtual Adapter object that exists on every node (not just the owner). This is referred to as the Network Fault Tolerant (NetFT) adapter in this kb.

[-]

TechGoat@reddit (OP)

Hi again Lewis. I'm really appreciating your insights here.

Alright, so while CSV-splitting is definitely recommended, and I'll still plan on implementing it during my next downtime (I've moved the 3 node cluster in prod for this month just because my Citrix users were clamoring for more VMs to do their research in) and during that downtime, I'll add a 4th node and re-architect the entire storage layout to put a CSV on each one of them. But it's worrisome that supposedly the failover should be quick enough, as you say, in a hard-reset scenario that the ownership of the role doesn't pass to another node though. Did my default values for thresholds and heartbeats seem fine?
Okay, I tried copying over a large random file from our main file server over to one of the nodes, and ran "Get-SMBConnection" - the field of "NumOpens" jumped to 1028; after the file transfer finished it dropped back down to 3. The KB you linked to was helpful to tell me the command, but heh, it didn't clarify what output would tell me if Multichannel was working or not - I'm just guessing at NumOpens being that large that it is successful? I tried searching for what "NumOpens" is supposed to mean but didn't find much. Is it supposed to be the "selected" field in "Get-Smbmultichannelconnection"? That also reads "true" for all my connections that are listed.
Re network flow config: Perhaps this is the issue preventing fast failover; I'm still waiting for the switch-management team to get back to me on Priority Flow Control. Our cisco switches look like they should support it, but from what visibility I have into current configuration, none of the switch ports my hosts are on, have it turned on. How key is PFC? I have not yet set up QOS, but was planning on following this guide to set QOS during the next downtime.
If I have 12 hosts in the cluster, and 4 of them sharing the S2D role/disks, and each host has 2x 10G connections, then don't I have 40Gb of throughput?
(Happy Note: We are actually in the process of switching over our 10G switches to 25G switches, and my 12x hosts' Qlogic cards support 25G, we just didn't have the budget for new switches until this year. It's not 40G but still over double the speed!)
Re: the NetBIOS disabling; yes - sorry - the "Microsoft Failover Cluster Virtual Adapter" only shows up in ipconfig /all - not in the Windows GUI, not in the Failover Cluster Manager snap-in, or anywhere else I can see. (is it supposed to appear in the Windows GUI somewhere? huh) But yes, seems like it will be easy to avoid turning off NetBIOS on this one (now that I know where/what it is) since it defaults to on and it's supposed to stay on. Thanks for clarifying.

[-]

lewis_943@reddit

How key is PFC?

I can't really say without understanding your full architecture, but it's a choice between "a lot", "very" and "critical". Your VM and storage traffic are essentially competing for bandwidth over the same physical L2 connections. If you don't prioritise that correctly, the slowdown to storage will impact your VMs.

If I have 12 hosts in the cluster, and 4 of them sharing the S2D role/disks

Hold it right there. Go read up on the pool quorum and then come back to us. This sounds like oly 4/12 hosts in a cluster will be populated with drives, which is bonkers for HCI.

then don't I have 80Gb of throughput, since 4 hosts x 20G?

A 3-node cluster with 3-way mirror has 3x clear fault domains. Every node has a copy of the data. In a 4-node cluster, there is a 75% chance that the data is local. In a 16-node cluster, you have a <20% chance that the data is local.

Adding hosts is a linear operation, but the network demand it creates is exponential. Your hosts need to have excess networking capacity at the start or have additional networking bandwidth added as time goes on. The 25Gbe will help here, but it still might not be enough, depending on how much actual storage you run.

Also, your L2 80Gbps capacity is shared with VMs, so realistically it could be anywhere from 40gbps to 70gbps. But, if you don't put in the correct protections, some crazy VM network traffic could push that number even lower.

seems like it will be easy to avoid turning off NetBIOS on this one

A lot of online sources use powershell to kill netbios on the registry entries for all of the NICs, which gets everything, even the hidden/virtual/disconnected ones.

[-]

TechGoat@reddit (OP)

Hello again Lewis - read you loud and clear that PFC is somewhere between "a lot, very, and critical" to a cluster! So I went ahead and turned it on, following this guide.

I've been in touch with our network admin to confirm that our Cisco switches have PFC turned on for them. He replied that they're all set to "auto" and therefore should work for PFC if a device tries to send tagged traffic through them - but when I gave him the list of ports that my 3x test hosts are running on, he told me sorry, I don't see any PFC tagged traffic.

When I run Get-NetAdapterRDMA on the test host, it shows that my two physical NICs, that I created the SET VMswitch on, do have all the Good Things turned on - RDMA, PFC, and ETS - but the two vSwitches I created, while RDMA is turned on for them, they just show "NA" for both PFC and ETS.

Name                      InterfaceDescription                     Enabled     PFC        ETS
----                      --------------------                     -------     ---        ---
SLOT 2 Port 2             QLogic FastLinQ QL41262-DE 25GbE Adap... True        True       True
vEthernet (DualSFP)       Hyper-V Virtual Ethernet Adapter         True        NA         NA
vEthernet (ClusterComm)   Hyper-V Virtual Ethernet Adapter #3      True        NA         NA
SLOT 2 Port 1             QLogic FastLinQ QL41262-DE 25GbE Adap... True        True       True

So my next question is - is this expected? Personally, I would have thought it would be expected; PFC is just for physical NICs, not virtual ones. But the fact our network admin says he's not seeing any tagged traffic is making me wonder if this is the problem.

I tried googling around on this, but I can't see details on how I might go about making my vEthernet nics switch from "NA" to any other status.

[-]

gamebrigada@reddit

In my PoC experience, by the time S2D is properly configured for HCI, you've made so many compromises that its not really worthwhile. I love S2D for SQL where you can have real blip free failover for most applications, where you can realistically plan long-term storage requirements, failover requirements etc. But for general HCI, unlike iSCSI LUN, having to tie physical disks to the CSVs really hurts long term flexibility.

[-]

lewis_943@reddit

Can you explain what you mean by this comment?

having to tie physical disks to the CSVs really hurts long term flexibility

Your understanding of the topology seems either broken or you have a setup that is catastrophically misaligned with the published best-practice.

Physical disks have no direct relation to CSVs.

Physical drives must first be added to a storage pool. When you create your CSVs they are allocated in 256MB slabs as evenly as possible across the storage pool (either at time of creation if thick, as-consumed if thin).

The only way to tie physical drives to a CSV would be if you were manually creating individual storage pools per-CSV, which would not only neuter the performance due to the loss of parallel I/O across the whole drive pool but break automated storage pool management in Server 2016+.

[-]

gamebrigada@reddit

Wow my memory completely failed on me. My bad, for some reason I remembered that you had to add physical disks to a CSV. Was this never the case? I haven't looked at S2D in many years.

[-]

lewis_943@reddit

I think you're mixing up the storage pools and the CSVs.

Storage pool = collection of physical drives presented to the S2D role (available storage)

Cluster Shared Volume (CSV) = Logical volume presented to node OS (useable storage) allocated from storage pool

In 2012/R2 the storage pool was manually managed and, in theory, you could have multiple per-cluster. Adding drives to the pool(s) was a manual task.

In 2016+ storage pool management (adding, replacing, retiring drives) is mostly automated via scheduled tasks. These automations necessitate a single drive pool, which is the current best practise. S2D now supports creating hybrid capacity tiers with cache tiers, as well single & multi-tier volume creation, which makes accumulating all storage under a single pool for parallelism and wear-levelling more feasible and attractive.

Density limitations of 16 nodes per S2D cluster (with real-world optimums differing by application workload) still apply though. At a certain size, a shift to either converged storage (separate compute & S2D nodes) or failover cluster sets is necessary for either economy or redundancy.

[-]

gamebrigada@reddit

Thanks for the clarification.

I set it up for an SQL cluster and really liked the tech. This must have been 2020? Clearly I misremember. But I ended up not looking at it for HCI because of some lack of flexibility that I wasn't able to overcome. I instead went with VMWare vSAN that same year which offered everything and more. One that I do remember is the policy based configuration of redundancy that can be applied at per VM level was a huge reason for it. I'm in the smaller SMB space so we can have a large spend and then are pretty dry for years without much budget for expanding. So having the flexibility to triple mirror your important VMs today, and then tomorrow when you need space drop down to double mirror at the push of a button and a rebalance is extremely desirable. Or starting dev systems at no resiliency to expand on them once they go into production. This was a huge driver for us.

[-]

lewis_943@reddit

The "no resiliency" is the deal-breaker here, that completely defies the high availability purpose of a failover cluster, hence S2D doesn't accommodate for that. A single faulty drive could destroy your whole CSV and (more pragmatically) with the slab allocation across all nodes, you couldn't patch without taking the CSV offline.

Depending on your cluster size, parity might have been a more space-efficient alternative (albeit with a higher compute cost) to zero-resiliency. Else in a mixed hyper-v environment with sufficient SSD storage, I'd turn to dedupe to recoup some space savings from the non-SQL workloads.

[-]

gamebrigada@reddit

Yeah thats fair, but in my mind a dev system doesn't require resilience. Especially with good backups and instant restore from backup storage.

Does S2D now offer changing CSV resilience post create? I can't find anything about that.

[-]

lewis_943@reddit

Don't conflate your use-case with a product's intended use. If a dev system doesn't require resilience, then why are you putting it on a cluster? You don't take a hot-wheels car to a mechanic, then say they're a bad shop for not fixing it.

Also, and I'll admit this is more of a personal pet peeve here, a "dev" system is different from "pre-production".

Dev never goes into production. You can recreate the changes you made to dev in production, but dev is born and dies as dev.

[-]

BlackV@reddit

that's a detailed reply, Nice

[-]

Candy_Badger@reddit

Based on my experience, the cluster should survive a crash of the CSV owner node, and the VMs on other nodes must keep running. What do you see in the logs? If you’re not successful in resolving this issue, consider alternative solutions. For instance, we have multiple customers running Starwinds VSAN, which is simpler compared to S2D. Check it out here: https://www.starwindsoftware.com/storage-spaces-direct

[-]

TechGoat@reddit (OP)

Thanks for the post. One of my key requirements is that whatever solution is compatible with Citrix as a remote desktop server deployment system. They have a fairly short list - Nutanix, VMware, their own XenServer, and lastly - Hyper-V but only when it's managed by VMM. So it looks as if Starwind can expose an SMI-S api so that VMM can manage it, which means it might work with Citrix, but I haven't found any definitive blog posts or anything of 'yes, I use Starwind with VMM and it works with Citrix". I'll keep that in mind in case Microsoft S2D ends up not working at all.

[-]

-SPOF@reddit

The cluster is built to manage these types of failures smoothly by shifting ownership of the CSV to another node. You can adjust the SameSubnetDelay and SameSubnetThreshold settings to lower values if needed. By default, Windows Server Failover Clustering uses heartbeat signals to check the health of nodes, and your current settings allow the cluster to wait up to 10 seconds.

Additionally, you can set up QoS policies on the network adapters to give priority to cluster heartbeat and storage traffic.

I don’t have much experience with 3-node S2D clusters. Usually, I work with 4-5 nodes where S2D manages things without issue. For a setup like yours, you might look at Starwind VSAN. It’s a very stable solution for 2-3 nodes, though I wish it could scale for larger clusters.

[-]

Bighaze98@reddit

Hi!

The s2d cluster with a heartbeat disc tolerates disc breakage or at least should. Also because in addition to that the nodes make a second beat called vote to understand if they are alive or it is a problem of the witness’s reachability. Having said that, the qourum disk is no longer supported on the new version of S2d on azure stack hci. So it goes without seasing that it is recommended to use either a blob storage or a file share smb also obestated inside a folder that you create in the sysvol.

[-]

Brilliant-Advisor958@reddit

When a cluster host goes down the VMs don't fail over and keep running.

What happens is the cluster detects the vm is not running in the cluster and starts the vm on another host and the vm behaves like it crashed.

If you need to have 100 percent uptime , then you need to use other cluster mechanisms to keep the services running. Like SQL availability groups or load balanced web servers

[-]

disclosure5@reddit

I think you're misunderstanding the issue.

If VM1 is running on host1 but host2 crashes, in an S2D environment you will frequently see VM1 crash. It will often then not simply start on another host as you describe, because the disk will be marked offline.

[-]

lewis_943@reddit

For a disk to be marked as offline, it needs to have failed automated restart more than the defined retry threshold. Same as a VM.

Those thresholds can be modified to essentially configure the cluster to persistently retry both the disk and the VM
If the CSV has failed to come online, that's not a slow or extended handover/seizing of the CSV ownership, that's an architecture issue Either the system isn't HA capable or there's something else breaking things (3rd party AV is entirely possible).

[-]

disclosure5@reddit

I feel you're not familiar with S2D in making this statement.

[-]

lewis_943@reddit

I feel you're not helpful in making any of your statements.

In the absence of a technical explanation (which I assume you don't know), you haven't given any specifics about the cluster on-which you've observed this error, let alone how to recreate it. No-one reading your comments can know if your setup or use-case is of any relevance to them.

[-]

disclosure5@reddit

It is supposed to be tolerant of this failure, but that's never been my experience and no doubt some MVP will inform you that it's fixed in the next Preview release just like it has been for eight or so years.

[-]

BlackV@reddit

tolerant

note that word OP, its does not mean VMs will necessarily stay up

[-]

disclosure5@reddit

I think the clarification is that it should not completely pants itself the way it does. You have about a 50% chance of disk corruption and VMs not booting after this activity.

[-]

lewis_943@reddit

Without going in to defend S2D too hard - it absolutely has weaknesses - this sounds like your workloads have either very high storage needs or very poor storage safety mechanisms.

Unexpected power loss and I/O pauses/stalls can still occur on bare metal and direct-attach storage. If your OS or app doesn't have the necessary safety mechanisms to cope with that, then that's an issue with the software, not the hypervisor. If your app has such a high volume of requests that those interruptions are show-stoppers, then that's an issue with architecture and scale, not the individual hypervisor software.

[-]