Question about an older Dell Server mysterious reboot during power flicker...

Posted by thegreatcerebral@reddit | sysadmin | View on Reddit | 19 comments

I have an old Dell R730. It is running VSphere and hosts our VMs. Redundant PSUs, each going to a separate UPS that have, what are reporting as "good batteries" as they have passed a self test. They are not network monitored so info out of them is pretty much non-existent.

Today we had the power literally flash like blinking your eyes. There were things plugged into the wall that rebooted but across the building nobody went down. Even initially the PCs didn't go down as they are on UPS units also.

All of a sudden it seems like the server rebooted (it did not SOUND like a reboot) and I did not press any button (although the BIOS may have it to resume power after it is restored).

I have VSphere telling me: Agent can't send heartbeats, host is down. Which I'm not sure how it logged that as it is a solo system running ESXi and then on top of that is where VSphere lives in its VM.

I am in IDRAC now and looking at the logs I see nothing past March 9th about power. Then I show today "System CPU Resetting":
Detailed Description:

System is performing a CPU reset because of system power off, power on or a warm reset like CTRL-ALT-DEL.

There is no keyboard attached to the server. Nobody was in the room except myself and I did not touch it as when I went in to check on the system had lights etc. so I didn't touch it.

I have SYS1003, SYS1001, SYS1000, and then a SYS1003:

SYS1003 - System CPU Resetting
SYS1001 - System is turning off
SYS1000 - System is turning on

The UPSs were always on as well. They did not have any errors on them and the batteries were still pegged at 100%.

The only thing I can gather is that the power dropped that fast that it somehow triggered the system to reboot itself?!? I do not have any kind of powerchute or anything like that software enabled and there is no tie-in to any UPS at all other than power cables.

I'm honestly baffled. If the unit would have just died from power loss I would think I would have seen something other than what I see in IDRAC logs. I see previously when I corrected an issue we had with where one of the PSUs was plugged in before and I see previously PSU1 power loss, redundancy lost etc. but nothing this time.

By my account it should have never gone down.

Anyone come across anything like that before?