Question about an older Dell Server mysterious reboot during power flicker...
Posted by thegreatcerebral@reddit | sysadmin | View on Reddit | 19 comments
I have an old Dell R730. It is running VSphere and hosts our VMs. Redundant PSUs, each going to a separate UPS that have, what are reporting as "good batteries" as they have passed a self test. They are not network monitored so info out of them is pretty much non-existent.
Today we had the power literally flash like blinking your eyes. There were things plugged into the wall that rebooted but across the building nobody went down. Even initially the PCs didn't go down as they are on UPS units also.
All of a sudden it seems like the server rebooted (it did not SOUND like a reboot) and I did not press any button (although the BIOS may have it to resume power after it is restored).
I have VSphere telling me: Agent can't send heartbeats, host is down. Which I'm not sure how it logged that as it is a solo system running ESXi and then on top of that is where VSphere lives in its VM.
I am in IDRAC now and looking at the logs I see nothing past March 9th about power. Then I show today "System CPU Resetting":
Detailed Description:
System is performing a CPU reset because of system power off, power on or a warm reset like CTRL-ALT-DEL.
There is no keyboard attached to the server. Nobody was in the room except myself and I did not touch it as when I went in to check on the system had lights etc. so I didn't touch it.
I have SYS1003, SYS1001, SYS1000, and then a SYS1003:
SYS1003 - System CPU Resetting
SYS1001 - System is turning off
SYS1000 - System is turning on
The UPSs were always on as well. They did not have any errors on them and the batteries were still pegged at 100%.
The only thing I can gather is that the power dropped that fast that it somehow triggered the system to reboot itself?!? I do not have any kind of powerchute or anything like that software enabled and there is no tie-in to any UPS at all other than power cables.
I'm honestly baffled. If the unit would have just died from power loss I would think I would have seen something other than what I see in IDRAC logs. I see previously when I corrected an issue we had with where one of the PSUs was plugged in before and I see previously PSU1 power loss, redundancy lost etc. but nothing this time.
By my account it should have never gone down.
Anyone come across anything like that before?
stufforstuff@reddit
You have unmanaged old UPS's - what error message did you think they would show?
er1cAtWork2@reddit
Didn’t Dell have a run of servers and desktops that shipped with bad capacitors? They would swell up and any power fluctuations would cause a reboot… That was a long time ago….
stufforstuff@reddit
So were R730's
Arudinne@reddit
The only capacitor plague I know about was from the late 90s or early 2000s. R730s started hipped in 2014.
That being said, the server is ~10 years old or older (depending on when it was purchased) so it's possible some of the component have degraded due to age.
thegreatcerebral@reddit (OP)
Interesting. I am not sure. I've only been here just over 2.5 years and was an HP guy prior.
Sweet-Sale-7303@reddit
You probably need a pure sign wave UPS.
thegreatcerebral@reddit (OP)
Could be. We are upgrading soon anyway but today marks the start of the hurricane season and we are bound to get this a lot even before we can upgrade.
Sweet-Sale-7303@reddit
How old are the batteries in the ups?
thegreatcerebral@reddit (OP)
I'm not sure what the APC 3000 is. It doesn't say online. The front panel has a lcd lit up and it is underneath a sin wave icon. Not sure what that means but at least it has that.
CharacterUse@reddit
What kind of UPS do you have? Offline UPSs pass through mains power to the device directly, Line Interactive pass through mains but with voltage regulation on the output, only Online UPSs generate clean AC and don't pass any mains through.
Offline UPSs can pass through a large voltage spike or drop before triggering the switch over to battery, and I've seen some Line Interactive UPSs allow quite large voltage swings before the AVR kicks in. Combine either of those with an old power supply with tired capacitors and you could trigger a reset before the UPS kicks in.
Without having info from the UPS it's hard to say more than that.
Ideally replace your UPS with an offline one, one with a better AVR or put an AVR upstream of the UPS, and inspect the Dell PSUs for bloated caps.
xendr0me@reddit
Technically speaking this is correct, are we talking about a "UPS" or a "battery backup" because those are two totally different things. - from a technical standpoint I am assuming you mean battery backup.
thegreatcerebral@reddit (OP)
I'll have to look at what each says. I have two separate units.
Both UPS. One is an APC 3000 UPS. The other is a TrippLite Smart1500LCD. The APC shows 6% load and the TL shows about 16% load. TL shows 45 min runtime and the APC I didn't look but being a 3000 and 6% load and Battery showing 100% it should show over 100 minutes I would think if I looked through the menu.
CharacterUse@reddit
Both of those are line-interactive. There might have been a blip fast enough to get through the AVR and surge protection, especially if the PSUs are old in the machine (making it more sensitive).
I don't know about the TripLite but the APC can be monitored. You should always monitor and log UPSes if possible even if just locally specifically to diagnose this kind of thing, and have them setup to auto shutdown the machine if necessary.
thegreatcerebral@reddit (OP)
I do not believe it is the "active" one which is what I believe those are called. It is the one that switches over between. It typically is fast enough though to handle or else they wouldn't sell them lol. I was wondering though if it was a blip fast enough that it triggered the system into thinking someone pressed an old school reset switch. lol.
CharacterUse@reddit
Could be such a blip, yes.
xendr0me@reddit
Is the UPS connected via USB to the server? Maybe something is signaling it to restart? Did the uptime in the OS show reset?
thegreatcerebral@reddit (OP)
No, nothing from UPS to server save for power.
ESXi does show locally the system coming back online. I cannot find an "uptime" anywhere that I am looking but I do show it coming back "on".
xendr0me@reddit
Windows based guest OS running in it? Task Manager>Performance section CPU shows uptime or cmd prompt and - systeminfo | find "System Boot Time"
thegreatcerebral@reddit (OP)
Yes. I already know those went down and came up. I wanted to make sure I knew if ESXi went down itself or the VMs rebooted. Both of those matter.