VMs slow on dell server

Posted by ntuner@reddit | sysadmin | View on Reddit | 40 comments

Any ideas what could cause VMs to run slow on dell server all of sudden, R650 ? Was running latest vsphere 8 and one day VMs started being slow, guest vm cpu hovers around 95% without anything running. Have a couple hosts like this it’s not a single host issue. All firmware is up to date, already wiped and reinstalled esxi and even hyper v same issue with both hypervisors. Vm is on a local sas storage, all dell diagnostics return no issues. Esxtop shows normal values. Tried different drive for os same issue. Running out of ideas what to check next.

[-]

Mehere_64@reddit

What is happening with the VM causing high cpu?

[-]

ntuner@reddit (OP)

It’s pretty much useless, very laggy. User would simply open explorer or browse local files and vm cpu spikes to 95-100%. Then if the user doesn’t do anything eventually comes down to maybe 50% but takes a while then it will keep spiking up very often without even doing anything

[-]

MitochondrianHouse@reddit

What process is consuming the most CPU specifically. Add the CPU Time column to task manager, can make it easier to see long term.

[-]

dinominant@reddit

I've seen a Dell R730xd spontaneously slow to a crawl, as if the CPU is throttled to 100Mhz, yet all diagnostics show everything is fine. It would take up to an hour to gracefully shut down.

The only change in metrics was that the power consumption increased to 100% on both power supplies for no apparent reason.

I still don't know what the root cause is, but it seems to impact long running systems (online for 3-6 months or more), and the only indicator is a massive sustained drop in performance.

[-]

ntuner@reddit (OP)

The host itself seems fine, only VMs are slow. But nothing was changed or updated which is odd

[-]

Lets_Go_2_Smokes@reddit

Shutdown host, pull power, and drain all flea power. Leave off for 30 seconds and turn back on, same issue?

[-]

ntuner@reddit (OP)

Hmm interesting, I have not tried this

[-]

Lets_Go_2_Smokes@reddit

We had a similar issue with a new Dell server. Can't find the article now but Dell sent us those steps and it solved it for us.

[-]

ntuner@reddit (OP)

Was it the same issue or anything else ?

[-]

Lets_Go_2_Smokes@reddit

The vms became crazy slow. If I'm not mistaken, either the host or vm were capping at a certain Mhz. iDRAC was crazy slow to respond also. Few other weird things. We did reboots, updates, etc. It was not until we drained all flea power did it go away. Could be different than yours but worth a shot.

[-]

ntuner@reddit (OP)

I see. The host itself appears to work fine, idrac etc. the main issue is cpu usage on the vm is high which makes the vm slow

[-]

Lets_Go_2_Smokes@reddit

Does the max frequency in the vm match the host or same on others hosts? Ours were capping at a low number. For example, the vm on 1 host showed something like 100mhz, running at 100% but when moved to another it was normal ghz

[-]

CPAtech@reddit

Sounds like firmware updates are needed.

[-]

cjcox4@reddit

Any type of VM, or Windows VMs?

[-]

ntuner@reddit (OP)

Windows VMs is what I have only

[-]

cjcox4@reddit

So, Windows 11, in particular, but this maybe backed into earlier, has a fairly easy "device driver" attack, and the mitigation can be especially painful on non-Windows 11 CPU setups, where VM wise, I think you pay at least 30% performance penalty (maybe more), but might also happen for even compliant CPU scenarios... the Window (11?) Security (OS) feature you can "play with" disabling is: Windows Security -> Device Security -> Core Isolation - Memory Integrity, change this to Disabled and retest.

[-]

modder9@reddit

You’re thinking of spectre/meltdown and about a dozen other speculative execution CVES.

[-]

cjcox4@reddit

I don't believe this one was associated with the old spectre/meltdown. This one is "newer"... but .... maybe just something missed by Microsoft (?)

I noticed the performance hit on multiple hypervisor types. One running a "lesser" support CPU (one of those arbitrarily supported enterprise CPUs that pretty much "same" as other CPUs that were deemed unsupported). So, figured it was due to lack of "commands" needed for proper performance of the mitigation. However, on a different hypervisor with a much newer fully supported CPU in Windows 11, I noticed the same huge performance hit.

[-]

modder9@reddit

I was just remediating these low/medium tenable alerts yesterday. It’s about 12+ vulns going back to 2017. The latest are spectre/meltdown, but they are solved by exactly what you described. At WORST case with very specific workloads it can have a 30% CPU performance impact.

Actually, I have this in my recent notes still:

Recommended Settings 3: -

SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management\ FeatureSettingsOverrideMask: 0x00000003 (3)

SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management\ FeatureSettingsOverride: 0x00800048 (8388680)

CVEs Covered: CVE-2017-5715, CVE-2017-5753, CVE-2017-5754, CVE-2018-3615, CVE-2018-3620, CVE-2018-3639, CVE-2018-3646, CVE-2018-11091, CVE-2018-12126, CVE-2018-12127, CVE-2018-12130, CVE-2019-11135, CVE-2022-0001 Note: Hyper-Threading enabled. Note: Most protections enabled by default on clients. Required combined mitigation for CVE-2022-0001.

[-]

excitedsolutions@reddit

Compare the current metrics to the historical baseline - oh wait…no one does this. In all seriousness, it has to be something with parts that age like storage and much less likely to be cpu or ram related. Assuming you don’t see anything egregious in the host, I would investigate iops for storage and see if you can make an educated guess if it is underperforming.

[-]

benuntu@reddit

I'd check network and see if perhaps the adapters have a driver issue of some sort? Odd that it would be with both ESXi and Hyper-V but only on that one host. Does it work fine if a user in a VM just access the internet, or is it specifically with explorer? If just explorer, sounds like a local networking issue either with config, windows driver, or adapter firmware.

[-]

ntuner@reddit (OP)

It’s a cluster of 5 hosts all have the same issue. Moving the same vm to another cluster works fine. It doesn’t seem to matter what the user does, just guest os response is laggy

[-]

BlackV@reddit

reinstalled esxi and even hyper v same issue with both hypervisors.

then its a VM issue, start there

[-]

_Robert_Pulson@reddit

Deploy a very basic Windows Server OS guest VM with no apps installed and see if you get the same poor performance.

I would start with the OS level of the guest VM. Sounds like something changed there. I've run into issues where a new real-time scan policy magically scanned on every read/write disk IO, or a Windows update has a memory leak problem and just consumes all RAM...

If you have down time at night/weekend, you could move your guest VMs and put a host in maintenance mode. Contact Dell to get their stress test iso and run it. It can last 1-2 days or so. Otherwise, deploy IO Meter on that host and do stress tests. This is only to confirm whether your host(s) can handle the workloads. Maybe you need to add more hosts to the cluster...

[-]

ntuner@reddit (OP)

Unfortunately supported ended not long ago. The same vm works fine on other hosts/cluster. The same vm originally ran fine on this particular cluster

[-]

_Robert_Pulson@reddit

I don't think it's your hosts. Look at the guest OS level

[-]

Some_Team9618@reddit

What’s the disk utilization look like? I’ve seen spikes happen and things slow to a crawl because things were waiting for I/O.

[-]

BowelEruption@reddit

System BIOS > System Profile Settings: set to high performance?

[-]

vermyx@reddit

How many vms do you have configured to use as many vcpus as you have cores on the hardware? Also it usually is recommended to add a vcpu when you get to around 80% cpu usage in a vm.

[-]

ntuner@reddit (OP)

1 right now. Nothing was changed or rebooted at least 6 months prior then one day vms were slow on this particular cluster

[-]

vermyx@reddit

That vm will cause all vms to hickup due to how the esxi scheduler works. They probably have always had this issue but hasn't been noticed until a vm pegged a cpu which will cause scheduling conflicts.

[-]

ntuner@reddit (OP)

Ok could be, just so far issue doesn’t seem to be specific to a vm. Any VMs are slow even if running a single vm. The same vm moved to another cluster is fine

[-]

vermyx@reddit

The "slowness" would be scheduling the cpu time for the vm that has all cpus assigned. Since all cpus are being used the scheduler itself does not have a "free" cpu so it is fighting with the vms to get cpu time, which in turn causes the other vms to wait in line until that one vm is done. It isn't a could be it is slowing things down

[-]

Helpjuice@reddit

What does your performance logs say that are logging to a SIEM?

What do you mean by different drive? How many drives, what types of drives, what is the RAID controller being used, is it battery backed?

Are you properly making sure you have enough physical resources for the kernel to operate normally?

Are the drives failing?

What happens when you migrate to a different physical server?

[-]

ntuner@reddit (OP)

Nothing out of ordinary or at least that means anything. This is a vsan host with multiple drives. Same issue using vsan storage, iscsi, or local. Tried rebuilding the host without vsan or part of a cluster no difference, no issues with the same vm in another cluster. Idrac doesn’t show any issues

[-]

Helpjuice@reddit

Ok, do you have all the metrics being forwarded and to a different machine so graph and visually see what is actually happening over time? This is not something you will be able to solve without metrics?

Also what do you mean Nothing out of the ordinary? How much free memory , storage, and cpu is available to the VSAN host? Are the network links saturated or large enough to handle the throughput? Are the network switched maxed out bandwidth and throughput wise?

[-]

R0B0t1C_Cucumber@reddit

You don't happen to have a snapshot running on it do you ?

[-]

seannyc3@reddit

Virtualisation based technology has become enabled in the VMs recently?

[-]

ntuner@reddit (OP)

But it works in another cluster/host fine

[-]

seannyc3@reddit

Have you checked that VBS stays enabled when it’s on the other host?