Hyper-V, VMware, or other, which would you choose?

Posted by jedimaster4007@reddit | sysadmin | View on Reddit | 128 comments

I'm curious what y'all would choose to do in my situation.

We're a small org, currently have a 4-node VXrail VMware cluster running about 50 VMs. The cluster's been running since 2020, but support just ran out in December. For the vast majority of the cluster's life it has been rock solid, but with no support and aging hardware it feels risky to keep using it.

My predecessor wanted to transition to Hyper-V, so they bought three Server Datacenter 2022 nodes and two Dell PowerStore appliances, so that's the new cluster I inherited. For some reason they only included a 2-port NIC on each host, so each host only has one path for management and one path for iSCSI. Because of that we've lost the cluster twice due to unannounced switch firmware upgrades which brought down too many nodes at once, and for some reason even if I brought all but one node offline and tried to force quorum, I could never restore the cluster. In both cases I had to destroy the cluster and build a new one. It wasn't too devastating because we had only migrated a couple of non-critical VMs to test performance, and I just had to restore those from backups after building the new cluster.

The redundancy issues are easily fixed, but I'm more concerned about the cluster's resiliency. I've spent almost six months now trying to figure out why the cluster can't be restored after quorum loss, it's too complicated to get into all the details but even with expert consultation it's still a mystery. Having to build a new cluster isn't so bad when it's just a couple of non-critical VMs that go down, but the idea of having to build a new cluster with all of production completely down is nightmare fuel. So that leads us to a difficult choice.

Do we just add extra NICs to fix the redundancy issues and continue with the existing Hyper-V cluster hoping for the best? Or, do we take advantage of an optional (up to) $500k one time fund to buy a replacement VXrail VMware stack? Or a third option like Nutanix/Proxmox? Fixing the redundancy issues makes it less likely that the cluster would ever go down, we have really nice backup UPS and generator power as well, but I want to plan for the worst case scenario. We can always repurpose the PowerStores as file share servers, but I'm not sure what we would use the existing Hyper-V host servers for if we choose to pivot away from Hyper-V. I suppose we could try to convert the existing hosts to ESXi assuming that's possible, but since these hosts were intended for iSCSI storage they don't have enough storage for VXrail HCI. Although I suppose purchasing more storage for the existing hosts might be cheaper than buying brand new hosts especially with the cost of memory right now.

[-]

frosty3140@reddit

we have approx 35 VMs and we replaced our 3-host VMware (Dell R740s with SC5020 direct-attached storage) with a 2-host HyperV (Dell R660s with ME5024 direct-attached storage) for about 100K. very happy with result. no need for complex networking or storage for us. when we patch hosts we can run entire workload on a single host no problems. significant cost savings. but we did need to get the HyperV cluster built for us, as we didn't have those skills in-house.

[-]

Prudent_Cod_1494@reddit

Not sure if this is going to be controversial or not but at your size ProxMox might be a legitimate option.

[-]

Stonewalled9999@reddit

do not give any more money to Broadcom

[-]

jedimaster4007@reddit (OP)

I would prefer not to, but I'm torn because the alternatives are either sticking with Hyper-V which has given us lots of issues so far or learning an entirely new hypervisor, which risks just trading one set of problems for another.

[-]

TeddyRoo_v_Gods@reddit

ProxMox is fairly easy to move to if you are familiar with VMware. I moved my home lab to it years ago and almost convinced my boss to my our prod to it as well, but they lack US based support if that applies to you.

[-]

Vichingo455@reddit

Homelab ≠ business.

[-]

Sufficient_Prune3897@reddit

Yep, because in business you're unlikely to use most features. So business is easier.

[-]

Burgergold@reddit

Learning cost a lot less than broadcom

[-]

CharcoalGreyWolf@reddit

HyperV really isn’t bad, so I ask —what are your issues?

[-]

_bx2_@reddit

If you know one or two hypervisor, Proxmox us very easy to understand.

[-]

ZmajevaMuda@reddit

Nutanix?

[-]

iTinkerTillItWorks@reddit

Yes, it is a “Hyperconverged Infrastructure” and won’t support older OSs like many companies still run though

[-]

Stonewalled9999@reddit

Hyper-V is not that bad either put it on some desktops with 64GB RAM and learn it, or hire someone (I could volunteer) to help you sort it. The sticky part is the setup you do that once a decade.

[-]

iTinkerTillItWorks@reddit

POC some other products, shop around a lil. Promox, nutanix, openshift

[-]

Imhereforthechips@reddit

Because you don’t have enough networking, you lose majority quorum. Not being able to automatically recover is intentional - prioritizing data integrity over availability. You can, in some cases, force start the node with -fixquorum.

You need more networking for redundancy.

[-]

jedimaster4007@reddit (OP)

Totally agreed. The redundancy issue is easily fixed, but I'm more concerned that even after restoring connectivity, I couldn't get the cluster to start even with -forcequorum. I made sure only one node was online, all network paths up, DNS and AD running and accessible, still no dice. Two different consultants haven't been able to figure out why the cluster won't start, and in both cases I've had to destroy the cluster and build a new one. So, even if we fix the redundancy issue, it still doesn't alleviate my concern about simply being able to restore the cluster as it could go down for lots of reasons no matter how unlikely. Both switches could simultaneously go down, and then I'd be in the same boat despite having NIC redundancy. I either have to 100% guarantee that nothing could ever bring the cluster down (impossible), or try to figure out why the cluster doesn't want to start after going down, which we've been trying to figure out for months.

[-]

Imhereforthechips@reddit

Some thoughts on that:

Can the node that is still online see cluster volumes?

Cluster enabled networking has to be online.

Is the cluster database accessible and functioning?

And end result, forcing quorum is just bypassing votes, so if the condition that broke still exists to a degree, it will not recover.

[-]

jedimaster4007@reddit (OP)

Once the switches came back online, at first all three hosts were still online, and all three showed all MPIO paths connected and could see the CSV and disk witness in Disk Management. However, both were in the Offline state, and when I tried to bring them online it gave me an error that only the cluster can manage those disks. The problem was, I couldn't get the cluster to start on any node even if I brought all but one offline and tried to force quorum. Thus, 99% of functions in failover cluster manager weren't accessible and 99% of cluster-related PowerShell commands would fail due to the CNO being offline. DNS was working as the CNO hostname would resolve, but because the cluster was down none of the nodes were running the CNO resource, so it was unresponsive. ClusDB appeared to be present on all nodes under C:\Windows\Cluster, but when trying to start the cluster the error would say that the cluster configuration couldn't be found. I'm wondering if it was trying to access the cluster configuration stored on the disk witness and failed because it couldn't bring that volume online? If so, I couldn't figure out a way to force the node to use a specific path for the configuration to use. In that case maybe switching to a cloud or file share witness might avoid the issue, but that would suggest a major design flaw in disk witnesses, but I wasn't able to find any information supporting that theory in my research. Just some vague references to file share witnesses being "more reliable" but also somewhat less functional due to file share witnesses supposedly not storing configuration backups.

[-]

Imhereforthechips@reddit

I should have mentioned that the file share witness does not store a backup copy of CLUSDB, and a cluster does not let you point the configuration. A disk witness can store config data and help in recovery.

[-]

Imhereforthechips@reddit

The cluster was definitely not trying to read CLUSDB from the disk witness, but I could see how the error would seem that way.

Possibly it was refusing to trust any copy of CLUSDB reintroducing shared storage without authoritative membership.

I think a cloud or file share witness would very likely have avoided the deadlock you hit.

[-]

First_Slide3870@reddit

Hyper-V is good shit. I would add another NIC to your servers, and get a dedicated iSCSI switch, something simpler than meraki so you don't have to worry about "surprise" upgrades.

Other commenters have given solid advice about quorum, I've deployed many clusters but never had these issues.

[-]

illicITparameters@reddit

If money and broadcom weren't an issue, I would pick vmware 11/10 times. However, fuck Broadcom, Hyper-V all the way.

[-]

MrSanford@reddit

Proxmox

[-]

cantstandmyownfeed@reddit

If you're running Windows VMs, and you already have Datacenter licenses, Hyper-V is a no brainer. Its a very good platform, mature, supported, reliable, and the cost is fixed.

[-]

jedimaster4007@reddit (OP)

I honestly agree and really like Hyper-V. I've just lost a lot of confidence in Windows failover clustering due to the major issues we've had being unable to restore the cluster after quorum failure, so I'm torn.

[-]

cantstandmyownfeed@reddit

You lost confidence in a product, because another component of your stack is poorly managed or setup?

Hyper-V / Failover clustering wasn't the issue there, and a failed quorum should should never require rebuilding the cluster, you've got something else going on.

But, if your nodes maintained internet connectivity, move your quorum to an Azure Cloud Witness.

[-]

jedimaster4007@reddit (OP)

I don't deny that the lack of NIC redundancy is what led to the cluster going down, and that problem is easily fixed. If we had proper NIC redundancy, in both of these cases the cluster never would have gone down. My concern is that we can never fully 100% guarantee that the cluster will never go down, crazy coincidences can happen no matter how unlikely. Even if we fix this redundancy issue, something else could happen which causes two or more nodes to lose connectivity and then the cluster will go down. If that happens, I'll be back in the same situation where I can't get the cluster to come back online, even using the officially documented procedure to start a single cluster node with -ForceQuorum. Months of research and two different consultants couldn't figure out why that's happening, so I can either:

Keep pouring money into more consultants until we figure it out
Fix the NIC redundancy issue and pray that Murphy's Law will make an exception for us, and the cluster will never go down due to a freak coincidence that we couldn't have planned for (for example both redundant switches going offline simultaneously, even if that is very unlikely)
Switch to a product that has historically proven to be more resilient and possible to restore without having to build a whole new cluster.

[-]

Ben22@reddit

Please start with fixing the nic issue. Seems obvious from the outside. Im sure you have a reason for not doing it, but ripping out the hyperv seems to really be like burning down the house because you have a leak.

[-]

frankv1971@reddit

We are running Windows HyperV and now s2d for almost 20 years now (yes we started with Windows server 2008). In that time we not once had a cluster failure. Key is redundancy. In that time we once had an iscsi cluster failure where 1 node got destroyed (during a all disk replacement due to a bad disk batch the datacenter employee that did the disk swap one at a time pulled one disk too much while rebuild was not ready), even then our HyperV cluster kept working.

Having said that. In the past we worked with iscsi storage and found that once we moved to s2d all flash performance was much better. We have our 3th s2d cluster running now.

[-]

Ruachta@reddit

We are an msp and manage a lot of hyper-v and vmware clusters. Hyper-v is find for small setups.

Your configuration was started poorly. Add NICs, hire a partner to review and validate your setup.

[-]

firedocter@reddit

I think that is part of the problem with Hyper-V. It is easy to configure wrong, and not know it for a good while. Where as Vmware will absolutely kick up a fuss if you set it up without redundant nics.

[-]

SN6006@reddit

Proxmox has been good. I would go hyperconverged with at least 3 hosts on different switches, because those outages are rough

[-]

Vichingo455@reddit

I would go Hyper-V in this case, as you already have the licenses for Windows Server. Else I would take a look at Proxmox (even known in my experience is less stable than everything else).

[-]

koshia@reddit

stick with hyper-v, add nics - integrate Azure Arc, your data center licensing (if it has software assurance, usually does) comes with azure update manager, invebtory / change tracking and some useful services.

I quickly pivotted from vmware due to their BS and to xcp-ng, but I'm starting to realign and get back to full microsoft stack.

[-]

rubbishfoo@reddit

Do none of you read anything but the headline?

Yes, extra NICs to solve the redundancy issue.
Lot of people here recommend proxmox - I have no practical experience with it. Perhaps a homelab at somepoint. Heard good things about KVM also but I don't know that either.

Always plan for a worst-case scenario and a recommendation would be to balance out the workloads across the available cluster members. Make sure you know what it looks like with all running on 2 hosts... and you'll probably want to know if a single host could hold all the water.

Another suggestion: Make sure both your witness object (FSW?) is on a reliable system and that the DNS cluster (CNO) can survive any transitions of node failure.

[-]

jedimaster4007@reddit (OP)

I appreciate you.

We do have a 1GB disk witness served by a dedicated LUN in the primary PowerStore. From my research, since we have three nodes + witness we can tolerate one node failure, then a second after some time, but not two nodes at once. We have two linked switches which both have direct connection to the core, but because of the lack of NIC redundancy, two of our hosts are connected to one switch while the third is connected to the other. Even when our network guy upgraded the switches one at a time, once the switch with two hosts connected went down, the cluster lost quorum and failed. This is where it gets weird.

Once the cluster is down, the CNO is basically dead. The IP doesn't respond because none of the nodes want to take it, I assume. With the CNO being down, the vast majority of functions in Failover Cluster Manager are inaccessible. Similarly, almost all PowerShell commands pertaining to the cluster do not work, because the CNO is unreachable. Both the CSV and the disk witness volume are offline but still visible in the hosts' Disk Management, but we can't bring them online because they are managed by the cluster. Basically it tells us we have to use failover cluster manager to do anything with those volumes, but those functions within failover cluster manager are inaccessible. It's like a complete gridlock where everything depends on each other but everything is down, so nothing can be restored. As soon as I destroy the cluster, the CSV and witness volume come back online, but by then it's too late (I assume).

So, yes if we solved the redundancy issue the cluster never would have gone down in both of these cases. Even so, it's impossible to completely prevent the possibility of the cluster failing, and I don't like the fact that the cluster seemingly can't be restored once it goes down like that. I don't know if a cloud or file share witness might help?

[-]

rubbishfoo@reddit

Im not sure if this would fit your use case or not... but perhaps a small hardware DC that could communicate on the same subnet & then plug into the switch (and then point the interface's DNS to this one)... that way, it'd always (hahah always...) be available? That would persist the CNO so that it essentially exists outside of the solution.

IE - a hardware standby DC running DNS/AD roles is ALWAYS a good thing. Ask Maersk!

[-]

RedGobboRebel@reddit

This is almost exactly how I setup Hyper-V FCs.

A separate low power/low resource node(s) outside the cluster, on the cluster's management subnet, dedicated to infastructure resources the FC needs to run successfuly. On a Sever Core install, it hosts for the cluster's subnet, AD, DNS, and the FC Witness disk. Personally, do use a virtualization layer and don't place it on the bare metal. Both because the AD/DNS core vm can bounce extremely quickly vs a bare metal bounce, and because this node is also where I place a vm "jump box" with RSAT and other tools. Then I don't need to jump into a vm in the cluster to manage the cluster.

Again, it's low resources needed for Server Core with AD/DNS and a tiny SMB share. You just want typical redundancy, mirrored drives at a minimum. I prefer things with dual PSU, older hardware you are decommissioning works well if you are on a budget.

[-]

jedimaster4007@reddit (OP)

Oh yeah I've learned the hard way the risks of not maintaining a physical DC, we've got that covered for sure.

[-]

RedGobboRebel@reddit

I'd suggest considering a witness disk outside of this cluster's resources. A witness disk on the same array and storage network, as your cluster means with any issue on that storage network or array, you'll run into these kind of Cluster quorum issues again.

I've heard good things about the Azure witness option, but a cheap, reliable SMB setup on completely separate hardware works well, too. The Witness disk doesn't need to be on high speed storage. It just need to be reliable.

[-]

jedimaster4007@reddit (OP)

I'd like to hope that a file share witness will solve our problem instead of using a disk witness. It especially makes sense, now that I'm thinking about it, as from what I've researched the file share witness doesn't store a backup of the cluster configuration. I assume that would force the cluster to use the configuration on one of the nodes rather than trying to use the disk witness volume that refuses to come online. I'll try to test that.

[-]

Civil_Asparagus25@reddit

Proxmox all the way

[-]

sleepmaster91@reddit

Get as far as possible from Broadcom(vmware)

[-]

nwmcsween@reddit

Unlimited technical knowledge budget: Talos or Kairos Linux + Kubevirt with TrueNAS storage. To the naysayers about TrueNAS is literally used in 60% of the F500 companies.

[-]

Ozwulf67@reddit

Just stick with VMware but not vxrail as Dell is ending sales on those. Just get a small vsan ready node cluster from Dell and pay them to deploy a small VCF 9 cluster. I don't care what anyone says, if your business relies heavily on your VM servers and you can't suffer downtime, you are left with pretty much 2 choices...VMware or Nutanix. And the falsehood that Nutanix is cheap is a joke. Spend the money, get some peace.

I work for a large academic healthcare org (6 hospitals and 150 clinics)...when Broadcom hit the university (they carry theaster ela) with a 300% increase, we looked long and hard at other solutions. kVM requires twice as much hardware resources, Hyper V was not reliable or as flexible and Nutanix was not that much cheaper and still niche in our minds. We paid and will continue to do so.

Good luck

[-]

ksteink@reddit

Why not Proxmox?

[-]

Expensive_Recover_56@reddit

NEVER EVER use VMWare again. Unless you want to pay thousands and thousands of dollars license fees.
There are better options to use, but never VMWare.

[-]

jedimaster4007@reddit (OP)

Back in 2020 the total cost of our 4-node VXrail cluster was just over $150k. With the Broadcom fuckery, tariffs, and AI causing memory cost to skyrocket, I assume the cost will be at least double to replace it. That said, if the cost is under $500k for all the hardware and five years of support in a one time payment, I'm willing to spend the coin just for the sake of having more confidence in the cluster's resiliency.

[-]

AudreyML3@reddit

I’d price it out before making a decision. My 2020 costs are now 5x. They doubled just in the last 90 days and hardware jumped a ton during Covid.

[-]

tsaico@reddit

Man, my small orgs budgets for server hardware is like 20k to 50k. I think I am in the wrong small orgs

[-]

jedimaster4007@reddit (OP)

I guess we're technically midsize? 300 users, municipal government. Government has a reputation for being on the low end budget-wise, but we occasionally benefit from things like grants, Cares Act (which paid for the last cluster), ARPA, etc. In this case, our state requires that we keep a certain amount of money in emergency funding stashed away, and we figured out we were saving way more than we were required to, so a lot of unexpected funding is now available. It's competitive as all departments are submitting their proposals to use the funding, but we are confident that maintaining the software platforms we use for daily operations will be treated as a high priority.

[-]

JibJibMonkey@reddit

"we are confident that maintaining the software platforms we use for daily operations will be treated as a high priority."

I wish this were the case but I have yet to see it.

[-]

jedimaster4007@reddit (OP)

Ordinarily I would agree. Our case is a bit unique because before I got here, this org had a ridiculously incompetent IT team that made life a living hell for everyone. The current team gets a lot of special treatment because we've fixed almost all the issues from the previous regime and these people know just how bad it can get when the technology fails, they're highly motivated to make sure that never happens again. I don't expect that to last forever though.

[-]

ashimbo@reddit

Just add more NICs to the hyper-v hosts and setup a brand new cluster from scratch.

[-]

weed0z@reddit

Nutanix or Proxmox cluster

[-]

Kill3rT0fu@reddit

Proxmox

[-]

corruptboomerang@reddit

Proxmox or Hyper-V.

Just be aware, if you're entertaining both options, moving from anything TO Proxmox is fairly easy, moving FROM Proxmox to anything is fairly easy. Moving out of Hyper-V is much more difficult. So if you want to try before you buy, I'd suggest trying Proxmox first.

[-]

KavalierMLT@reddit

Hyper-V if windows, redshift if linux.

[-]

smoothvibe@reddit

Proxmox

[-]

Pantheonofoak@reddit

HPE VME

[-]

SandboxIsProduction@reddit

vmware burned everyone with broadcom licensing. hyperv works until you need anything beyond basic. proxmox is where smaller shops are landing but no enterprise support contract scares the cio. pick your pain

[-]

basec0m@reddit

Running my hyper-v cluster for 6 years... rock solid.

[-]

smellybear666@reddit

If you have mostly Windows Hyper-v

If you are mostly or entirely linux, proxmox.

Nics are cheap. I get HPE 25gb nics for less than $50 used. Get as many as you think you need.

[-]

urb5tar@reddit

Proxmox for the win. we have the same amount of vms and the syystem is blazing fast. a three-node-cluster hyperconverged with ceph.

[-]

TireFryer426@reddit

We were fully prepared to go Hyper-V/Netapp until Nutanix came in with numbers we couldn't ignore.
We are also coming off of VXRail and VMWare.
Nutanix has a lot going for it. We felt like we were compromising going with Hyper-V, but knew it would work. Once the Nutanix numbers dropped below what the Dell front ends and the netapp were going to cost it almost became a no brainer. They will dig deep on initial pricing, but you need to keep in mind that they are making their money at the renewal on the software. So look at the long term TCO. Still made sense for us.

[-]

Sintarsintar@reddit

anything but vmware

[-]

JLee50@reddit

VMware is dead to me.

[-]

anonpf@reddit

Sounds like you already have a solution but need to fix the redundancy issue with the network. Is it not possible to add a quad port nic to the server to create a multipath?

[-]

jedimaster4007@reddit (OP)

Absolutely, that part is easy. My concern is the fact that after the cluster goes down and all paths are restored, the cluster refuses to come back online. Following the documented procedure to restart the cluster with forced quorum just doesn't work, and even with two consultants helping we haven't been able to figure out why that's happening.

So, yeah we can fix the redundancy issue easily and reduce the chance of the cluster going down to near zero. However, there's no way to completely prevent the cluster from going down due to some other failure no matter how unlikely. It's possible, even if extremely unlikely, that both switches could go down simultaneously and the cluster would lose heartbeat. If that happens after we've moved all our production VMs to this cluster, and I'm not able to bring the cluster back up without completely destroying and rebuilding, that's going to be an absolute nightmare.

[-]

mydoorisfour@reddit

Proxmox easily. We're even moving all of our Hyper-V servers to Proxmox

[-]

FactMuch6855@reddit

Add NICs. You should be able to fix the cluster goofiness. Hyper-V is fine, especially if you like it and are comfortable with it. I've used Hyper-V and VMware forever but I'm kicking Broadcom to the dirt over the next few months.

[-]

jedimaster4007@reddit (OP)

Worth a shot, I'll add NICs and switch from disk witness to file share witness, then once all that is set up I'll yank some cables and see if the cluster wants to cooperate.

[-]

xSchizogenie@reddit

4 switches, 4 NIC paths.

[-]

OinkyConfidence@reddit

Hyper-V. With Broadcom not caring about VMware the way they should, Hyper-V is a strong recommend these days.

[-]

jedimaster4007@reddit (OP)

I've felt the same way for years, but I've lost a lot of confidence in Windows failover clustering because of the issues I've experienced with this cluster. If I could figure out why we're unable to restore the cluster after loss of quorum I would feel a million times better about sticking with Hyper-V, but even with multiple expert consultations we have been unable to figure it out.

[-]

TechSupportIgit@reddit

...I think you've already scowered this page.

https://learn.microsoft.com/en-us/windows-server/failover-clustering/recover-failover-cluster-without-quorum?tabs=failover%2Cautofailover

Make sure the cluster volumes are visible, iSCSI is properly connected to each node, blah blah. I'm assuming the consultations have already tried this.

[-]

jedimaster4007@reddit (OP)

The CSV and disk witness LUNs are definitely visible after the switches come back online, I can see them in Disk Management on each host, but they're offline. If I try to online them, it basically says the disk is managed by the failover cluster and I have to use the cluster to manage them. The problem is, because the cluster is down, I can't access the vast majority of functions in failover cluster manager. In both cases I tried what was recommended in that article, power down all but one node and start-cluster -forcequorum. I tried that for each node one at a time but no matter what, the cluster would not start. It's so weird and frustrating.

[-]

TechSupportIgit@reddit

I'm curious, what happens when you try to run this?

Start-ClusterNode -Name "YourNodeName" -FixQuorum

[-]

jedimaster4007@reddit (OP)

It tries to start but ultimately fails. I've since destroyed the cluster and built a new one so I can't test it anymore, but from what I recall it was failing because it couldn't access the cluster configuration backup. I believe by default the cluster configuration backup is stored on the disk witness volume, but the disk witness volume was offline. I tried to bring the disk witness online through Disk Management, but it said I had to use failover cluster manager. I couldn't access 99% of the functions within failover cluster manager because I couldn't get the cluster to start, with every imaginable permutation of having one node online, running every variation of Start-ClusterNode -Fix/ForceQuorum, but the cluster would never start. Since I can't access almost anything in failover cluster manager and basically none of the cluster-related PowerShell commands work, I couldn't do anything to the disk witness. Tried unmapping and remapping the LUNs multiple times, unmounting just the CSV, when mapped they just stay offline in Windows. Interestingly, as soon as I destroy the cluster, I can online the LUNs in Disk Management no problem. I assume by then with the cluster destroyed it's too late to do anything with the disk witness, but I could be wrong.

[-]

TechSupportIgit@reddit

Yeah, not much else you can try doing anymore now.

Your config though is highly suspect and I'd look into getting SFP NIC cards for your nodes. Have the nodes connect directly to your power stores, then manage the power stores through OOB. But that's just where my brain is going, I've never much looked into power stores or did the hardware install myself. Our environment took the white glove install from Dell for our cluster.

[-]

jedimaster4007@reddit (OP)

Connect directly as in no switches for iSCSI? Right now we just have two top-of-rack switches connecting both management and iSCSI on different VLANs, I'm leaning towards getting a separate pair of SFP switches specifically for iSCSI since the existing switches have very few SFP ports available. But yes we're definitely getting a four port SFP card for each host and might even add a second four port card later on.

[-]

Jhamin1@reddit

Running a whole Hyper-V cluster on 2 nics per chassis is pretty iffy. Best practice is to have a management network, an iscsi network, a "heartbeat" network, a live migration network, and a network to attach the Hyper-V virtual switch (aka network access for the VMs). I'm just guessing, but it's possible that those are all running as vlans on your 2 nics, but that puts a level of abstraction onto all this that may be a source of your issues.

Again, just spitballing here, but I wonder if you couldn't get the clusters back up because the heartbeat network isn't coming back correctly. Likely due to a misconfig, maybe the heartbeat traffic is going out the wrong vlan?

If you are talking about 3 nodes & 2 storage targets you should either directly wire the nodes to storage skipping a switch, or (Ideally) you should have *two* dedicated iSCSI switches and each node should have at least two ports connected to each of them. Hopefully from different NICs in the node chassis.

The idea is that any given piece of the chain can fail & the node will retain a valid network path to storage. If a NIC fails, you have another NIC with all the same connections. If a switch fails, you have another switch that has all the connections needed.

[-]

jedimaster4007@reddit (OP)

Currently there are only two networks configured in the failover cluster, one for iSCSI and one for (I'm assuming) all the rest. I don't see heartbeat and live migration specifically defined, but Cluster Network 1 is for iSCSI and under cluster use it says None. Cluster Network 2 is for management/Internet and under cluster use it says Cluster and Client. Maybe there are some PowerShell commands that will give me more detail. Eventually I would love to have each of those networks split up into dedicated interface pairs, at the moment we have a major shortage of available SFP ports on the existing switches, but I do love the idea of getting some dedicated SFP switches for storage, migration, and heartbeat. Would probably just need to add an additional 4-port NIC to each host, then have two interfaces for each network.

[-]

TechSupportIgit@reddit

This. Thiiiiiiiiiiiiiiiiiiiiis.

[-]

TechSupportIgit@reddit

I'm retarded and thought in my head there's enough SFP ports on the power stores for 6 connections.

But yes, get a dedicated switch and don't let whoever does the switch firmware updates touch it. Use it only to interface the power stores with your nodes.

[-]

TechSupportIgit@reddit

You need to reconfigure the quorum disk once the first node is force started. Then make sure each node is allocated with a vote. Then bring back up the other nodes.

If nothing else, might have to rebuild the whole cluster from scratch.

[-]

jedimaster4007@reddit (OP)

Yeah ultimately in both cases we had to destroy and rebuild from scratch. That's what worries me, if we can't figure out why the cluster doesn't want to start after going down, it feels risky to continue moving all our production VMs to the cluster. There's no way to 100% guarantee that the cluster will never go down for any reason, I want to know that I'll be able to restore the cluster and not have to destroy and rebuild, because then I would have to restore every VM from backup and reconfigure all the backup jobs with 300 users screaming at me.

[-]

sryan2k1@reddit

Go with what you know and can support. People hate broadcom, rightfully so but VMWare just fucking works. Microsoft support is beyond non-existent if you ever need help

[-]

chefkoch_@reddit

Drop $400 for 4 10GB nics, how didn't you do this 6 months ago?

[-]

jedimaster4007@reddit (OP)

I hadn't looked too closely at the new cluster hardware, at the time there were many other high priority projects and fires to put out from the previous IT team. We had just set up a SOW for a consultant to go over the whole setup top to bottom right before the first time the cluster went down, the SOW was restricted to evaluating the configuration and no troubleshooting (to save cost), so we had to delay that project until I could get the cluster restored. Since only a couple of non-critical VMs had been migrated, I restored those to a standalone Hyper-V host outside the cluster and put it on the backburner while we finished our other projects. Now that I have more time, I've looked into it and discovered the redundancy issue, I made the mistake of assuming the previous team wasn't so stupid as to order Hyper-V hosts with only two NICs each.

[-]

Lost_Term_8080@reddit

You need to address the network path redundancy issue, but your hyper-v cluster has some other serious issues that are probably configuration problems in the cluster or RTFM problems in how the hosts were built. It takes A LOT to seriously screw up a failover cluster instance, they can be badly misconfigured, have some degree of DNS/AD problems and normally still work reliably. On that note, it could be some serious DNS/AD problems as well, but they would have to be so severe that you would be noticing it everywhere in the network.

[-]

jedimaster4007@reddit (OP)

As far as I can tell DNS and AD seem fine after the outage, our virtual DCs are still in the old VXrail cluster and we have a physical DC as well. I can ping the CNO hostname from each node and it resolves, but it's just unresponsive because the cluster is down and none of the nodes are running the resource. MPIO is running although arguably it doesn't matter since, for the moment, each host only has one path to the storage network. That will definitely be fixed easily.

There are so many little things that could be contributing. The theory that seems to have the most promise has to do with us using a disk witness on the same PowerStore as the CSV. The disk witness is not added to the CSV, it's just in cluster available storage, but the theory is that that witness may be needed to bring the cluster online and since the cluster is down, the witness volume can't be brought online and is therefore inaccessible, a catch-22. Even then I feel like forcing quorum on a single node should bypass the need for the witness to be online, yet we could never get it to work. But still, perhaps if we switch to file share or cloud witness, maybe the witness would still be accessible with the cluster being down. Otherwise who knows, maybe it's a random MPIO driver issue, some deep level setting in the PowerStore that isn't visible in the UI, tons of possibilities. I've verified that MTU settings are correct everywhere, but maybe some other obscure switch-related setting is causing issues. I've gone down dozens of rabbit holes and haven't nailed down the answer yet unfortunately.

[-]

joeykins82@reddit

Your Hyper-V cluster is unreliable because it's poorly designed/spec'd.

2 high speed NICs would likely be fine if you were using internal storage: the minute you add iSCSI to the mix, you can't be running like that. Everything needs to be designed with redundancy & resilience in mind. 2 NICs for management & OS traffic, 2 NICs for iSCSI is the minimum here. You also need at least 1 Domain Controller to be running off the cluster, and for DNS resolution by the hypervisors to function even if all VMs are stopped, and you need an external quorum mechanism (either a file share witness to a non-VM host, or an Azure witness).

Fix those design issues and you'll be able to pull the plug on the cluster and then have it recover just fine.

[-]

jedimaster4007@reddit (OP)

Totally agreed, and we will definitely add NICs to ensure that both management and iSCSI have at least two cables running to different linked switches. However, I'm still worried, because even with the existing lack of redundancy, once the switches came back online all paths were accessible. We have a physical DC outside the cluster, DNS, AD, network connections all working. All hosts could see the CSV and disk witness, but they wouldn't come online because the cluster was down, and I couldn't start the cluster even following the official procedure of bring all nodes offline except one and running Start-ClusterNode -ForceQuorum. No matter what I couldn't get the cluster to start, in both cases, and two different consultants couldn't figure it out either. My concern is that even if the redundancy issue is fixed, if something else down the road causes the cluster to go down such as both switches unexpectedly failing (no matter how unlikely that is), there's effectively no difference in the resulting situation. We get the switches back online, all networks are restored, but since I couldn't get the cluster to start in the exact same situation before, it's likely that it still wouldn't start in this new scenario. Until I can figure out why the cluster won't start after going down, I essentially have to just hope that the cluster never goes down, that nothing unexpected or unpredictable will ever happen. The only thing I can think of is, if we switched to a file share or cloud witness instead of disk witness, would that witness volume come back online even with the cluster being down? If so that would seem to imply that disk witnesses are a bad idea, if by design they won't come online unless the cluster is running.

[-]

joeykins82@reddit

Windows Server has a ton of guardrails. All of the cluster shared disks are probably being kept as offline or set to read only mode because of something like MPIO isn't set up or the SAN policy is incorrect and needs to be manually changed from OfflineShared to OnlineAll.

I'm pretty much certain that with an appropriate number of NICs for what you're doing you'll see the problems you've encountered thus far go away.

[-]

jedimaster4007@reddit (OP)

Worth testing for sure, we should be ordering the new NIC cards soon. Once that's all configured, I'll temporarily move the couple of non-critical VMs to local storage and bring the switches down one at a time, see if the cluster wants to come back online.

[-]

DailonMarkMann@reddit

That VXRail stuff is expensive. Do you have the hardware to migrate off of that and keep the hosts running VMWare? As for going to hyper v, a lot of people are doing it, but I tried it once and it just didn't work as well. I've been told it is better now, but the juice isn't worth the squeeze at this point.

[-]

jedimaster4007@reddit (OP)

I'd be willing to try converting the existing Hyper-V hosts to ESXi and just doing a basic vSAN setup. I figure it's gotta be cheaper to just buy the licensing and extra hardware as opposed to buying whole new hosts especially with the cost of memory right now. Plus I never really saw the value add of VXrail over standard vCenter, so maybe that could save us some cost as well. I've just never done a conversion from Windows Server Datacenter to ESXi before.

[-]

DailonMarkMann@reddit

the VMconverter tool used to work really well. it didn't care what the host was lol! I was at a meeting the other day with a reseller. The customers on VXRails were complaining about the price increases. I rolled my eyes because everyone complains about VMware pricing now, but when I saw the pricing I couldn't believe it. It was like an order of magnitude more expensive to keep the VXrails license. To heck with that! As for the conversation, find a good guinea pig, and start migrating!

[-]

Assumeweknow@reddit

Xcp-ng...

[-]

thesals@reddit

Same, migrated to it a year and a half ago and haven't looked back.

[-]

Assumeweknow@reddit

Been dealing with it since xen days and honestly its so stable. Driver support sometimes lacks. But stability level is pretty insane.

[-]

thesals@reddit

Only issue I'm currently facing is that I can't update 2 of my pools to 8.3 as their Dell PERC H355 controllers aren't supported yet... But 8.2 works flawlessly on them

[-]

dengar69@reddit

More NIC's with redundant switches with redundant APC's.

[-]

AmiDeplorabilis@reddit

From my position, it first depends on how many VMs are involved, and the server which I plan to use.

For example, I manage a couple of small clients: I just installed a Hyper-V server (no CALs involved) at a small clinic and I'm running one ESXi server (no vSphere) at another.

In the end, it boils down to to the number, OSs, use of the VMs one has to manage, and the costs one is willing to incur to run them, to say nothing about the relative platform stability.

[-]

Gary_harrold@reddit

Failover clustering can be a bit of work to do correctly, and it sounds like there was some "learning on the job" when your cluster was built. I second the sentiment that it is actually morally wrong to give Broadcom more business.

I really liked MicroCloud and Proxmox as alternatives. At your scale, both options would work fine. There are cheaper alternatives than VMWare that perform the same function.

It really just depends on your internal skill levels and appetite for responsibility.

In my experience at small orgs, your biggest issue is going to be getting enough hardware and time to actually build out a solution. I would bet that there are whole areas of your infra that should be reimagined to fit into modern standards.

[-]

jedimaster4007@reddit (OP)

Oh yeah we've got a lot of problems to fix from the previous IT team, but we've made a ton of progress in the last year. Being behind the times is kind of unavoidable for government orgs, but we're doing what we can. How is the support experience for MicroCloud and Proxmox?

[-]

sheep5555@reddit

i wouldnt go back to broadcom/vxrail, we were in the same situation as you including the vxrail and dell quoted us 500k as well.

At a minimum you should get extra NIC for your hosts and try to sort that out, if that doesnt work you can repurpose your hardware for proxmox. Get a consultants help (proxmox partners)

[-]

DarthJarJar242@reddit

Based solely on the tech... VMware all day everyday and it's not even close.

Include the shady shit that Broadcom has gotten up to recently and their insane pricing structure Hyper-V has been looking better and better with Proxmox being a close second.

[-]

leadout_kv@reddit

agree, vmware, based solely on tech and then add its a very small org, 4 nodes, with 50 vms and vmware is probably the last choice. hyper-v or even an open source solution like proxmox should be a serious consideration.

[-]

Temporary-Library597@reddit

You talked yourself into your path before your OP here. Why ask if your answer is "We've had trouble with Hyper-V" to every single response here?

[-]

jedimaster4007@reddit (OP)

I guess I just want to make sure I'm making the right choice. And honestly I've always liked Hyper-V, I've argued against people who insist that Hyper-V is garbage because they haven't used it since before 2012. Part of me hopes that someone will magically have the answer for why our cluster doesn't want to restore, almost six months of googling and consultants and we've been unable to figure it out. If I could just figure that out, I would feel a million times better about staying with Hyper-V.

[-]

JerikkaDawn@reddit

Everyone's ignoring the fact that OP is already planning on fixing the redundancy issues, but even with those fixed, OP needs to plan for the worst and everyone's ignoring OPs issue with that part. People are telling OP to set up witnesses they already have, and to add networking they're already going to be adding and not addressing the cluster rebuild situation.

[-]

YouDoNotKnowMeSir@reddit

Proxmox or Openstack

[-]

Ytijhdoz54@reddit

As comfy as I feel with VMware hyper-v is the future imo. Broadcom really knows how to turn its client base against them.

[-]

Possible-Shelter-800@reddit

Hyper-v

[-]

TNO-TACHIKOMA@reddit

it's really up to how much money you have.

if you have vmware kinda money but yeah fuck boardcom then go nutanix.

else gotta suck it up with vmware. proxmox is just not up to par with its over decade old qemu.

if you can accept chinese stuff, look to Huawei and Sangfor hci appliances.

[-]

Stonewalled9999@reddit

Do the hosts still have the abilty to NPAR ? We split our 2 port 10G into 2G management 2G VM and 6G ISCI and then teamed then in the OS and presented the one NIC to the VMs.

[-]

jedimaster4007@reddit (OP)

I'm not familiar with NPAR, but mainly I just want to make sure MPIO has at least two paths to the storage network.

[-]

Stonewalled9999@reddit

google Dell NPAR it will do what you need.

[-]

Aggressive_Common_48@reddit

Just started working on Hyper-v as we had the datacenter license otherwise I would have gone with proxmox

[-]

Sensitive_Scar_1800@reddit

lol still loving the dream on VMWare, but I guess that’s an unpopular opinion

[-]

valenx@reddit

We're replacing VMware with Proxmox.

[-]

SofterBones@reddit

Used to have VMware, moved to Proxmox due to Broadcom fuckery. Hyper-V decent choice that some of my colleagues have moved to.

[-]

jedimaster4007@reddit (OP)

I actually like Hyper-V a lot. I guess it's just that the multiple cluster failures we've had erodes my confidence in Windows failover clustering.

[-]

tarvijron@reddit

Hyper V and buy two more network cards you silly goose. Why would you put yourself INTO the beartrap of VMWare at this point, especially given that they do not give one thin, runny crap about customers who aren't doing millions of dollars of business with them a year.

[-]

netsysllc@reddit

do not give Broadcom money

[-]

Azuras33@reddit

If mostly Windows, Hyper-V, If mixed, Proxmox.