What's your worst "horrible coincidence" experience?
Posted by joshuamarius@reddit | sysadmin | View on Reddit | 151 comments
I was transitioning a client with two locations to brand new Firewalls. I remote into Site A's Firewall and copy the config to the new Firewall locally (which I have in my home office). I then do the same with Site B. However, when I click Logout on the Firewall for Site B...Site A's firewall goes down completely! I then check my remote management app and I can see ALL workstations and Servers offline - mind you this is a super busy surgery center, which hosts EHR software and a phone system for Site B...so I am completely freaking out. To top if off, 10 minutes passed and nothing was coming back online 😱
I review my steps...check my browser history...I'm going crazy..."What did I do or click on...what am I missing??". It was 2 AM and I was dreading the possibility of having to drive down there. After about 15 mins and nothing coming up, I decided to check Down Detector...and also tried to remote into another client's Firewall, luckily, in the same zip code; it was also offline.
What happened? Literally at the same time I clicked "Logout", Spectrum had a massive outage in the area that lasted until 5 AM. Down detector had 300+ reports. That feeling of your stomach sinking...horrible!
So what was your worst horrible coincidence as a sysadmin? I know there's some of you crazy stories!
timconradinc@reddit
On Linux, the 'killall' command will kill all the processes of a given name. On Solaris, 'kilall' will kill all processes.
After the second outage, I realized the difference.
PercyFlage@reddit
Likewise with AIX, as I discovered to my chagrin.
SperatiParati@reddit
I blundered into this one myself!
Solaris is long since retired at work, but I still do "pkill" rather than "killall" on Linux systems.
Burgergold@reddit
It reminds me of a coworker who did a quieace on z/os
turisto@reddit
somehow, both of those sound dumb in their own way
Valheru78@reddit
Well, the Linux version is quite useful often, like I want to kill all threads of a certain program or I have replaced program X with a new version and want to make sure all old versions are no longer running.
Octa_vian@reddit
If you try to be fancy and iterate through a directory tree to remove files and use the output of "ls -a", you end up at /
And this was done on multiple systems. Our guy saw more and systems turn red on the monitoring dashboards.
timconradinc@reddit
I had a manager that failed his RHCE exam because he had the habit of doing ! (or whatever it was) in the shell - which would recall the last command and run it. The command was ran halfway through the test and was rm -rf /
taint3d@reddit
Why/how was the last command r -rf /?
bussche@reddit
!123 runs whatever command has the number 123 in the bash history.
He probably typo'd the number.
Fox_and_Otter@reddit
There would not be a bash history if rm -rf / was run
Moocha@reddit
As long as that bash process is still running, the runtime history will be available -- the one loaded from .bash_history at process init, plus whatever else was executed since it was started.
Fox_and_Otter@reddit
I mean sure, you can imagine a scenario where rm -rf / is in bash history, but running it from !! or !# isn't the real issue. The issue is having run the command in the first place.
Moocha@reddit
Absolutely, that's a whole different layer of WTF :)
Also, this suggests it was a rather ancient systems; on modern coreutils, you'd need
--no-preserve-rootfor that to do anything anyway. Maybe it wasrm -rf /*or something, but that's yet another level of WTF.brianozm@reddit
Actually for some time now, “rm -rf /“ will ask you if you really want to render your server inoperative.
jimicus@reddit
There would be if rm -rf ./ was run. Only unlike last time, he was in the root directory.
bussche@reddit
Oh yeah, good point lol
Stiefeljunge@reddit
Maybe / was set in a variable set in env
timconradinc@reddit
As I think about it more, it was effectively rm -rf / - not actually rm -rf /, I think it was a command, something like find . -exec rm {}, except his cwd was / and not whatever other directory.
Generico300@reddit
Ok, but if rm -rf / is in the command history, pretty sure the system is already hosed. Doesn't the command have to have actually run to show up in history?
Fallingdamage@reddit
Course, a sentient, intelligent developer would have looked at killall and thought "I should probably put a confirmation prompt on that one."
meditonsin@reddit
It's part of the SystemV shutdown process and I'm decently sure predates the version of killall that only kills specific processes. Command line utilities from that era often do not have any guard rails.
jimicus@reddit
That's not really how Unix works, though. It assumes that if you tell it you want to do something - you want to do something.
(And just to confuse things - killall does something different in Solaris versus Linux)
NotMedicine420@reddit
LMAO wtf.
Unnamed-3891@reddit
ELI5 what would be the point of actually killing ALL processes? Is it an unconventional way of halting a system? I am assuming that also kills init.
Drywesi@reddit
maybe part of shutdown/reboot prep?
timconradinc@reddit
It was how Solaris did the init scripts for shutdown.
AZSystems@reddit
Learning in the fire, you are now the Solaris SEM. 😂
timconradinc@reddit
Dealing with bugs in bash's emulation of solaris's Bourne shell was annoying to deal with.
TheGoodspeed15@reddit
Had a bad cable so I went to replace the cable by opening up a brand new package. It still didn't work. So I spent another 2 hours troubleshooting.
The problem ended up being a bad cable. Never in my life have I ever opened up a cable from a package and had that cable be defective
elitexero@reddit
Back in high school, a buddy bought some RAM from a local shop. Didn't work.
Brought it back for an exchange, brought it home, didn't work. When he brought it back again they figured he was doing something wrong and breaking it so they only agreed to an exchange on the terms that they opened it in store and tested it with the RAM tester. He agreed.
They opened a new pack, tested it, dead. Another pack, dead. Half the shipment was all dead RAM.
greenie4242@reddit
I've had that happen with batteries. One night my mates were all pumped up to play a few group Wii games on the rare night we were all free at the same time. Unfortunately only one controller out of four worked and there were no spare batteries in the house. No biggie, my mate walked across the road and bought an 8 pack of AA batteries from the convenience store. None of them worked though.
All the controllers worked when we swapped in the single pair of good batteries so all the new batteries must have been old or faulty. The person who sold it refused to swap them for another pack without a receipt. Mate bought them from the same guy 15 minutes beforehand and wasn't given a receipt, then the manager came over and made a big deal about it, mate just wanted to play Wii so just ignored the drama and bought another pack.
The second pack was also dead. Walked back and the manager accused my mate of lying and trying to return used batteries etc, mate came back upset because he just wasn't in the mood to deal with all that crap. We'd all been drinking so nobody could drive elsewhere to buy batteries.
Eventually I walked over with all four Wii remotes (one working) and said I wanted to buy batteries but only if I could test them first. I demonstrated all four Wii remotes working with our single pair of working AAs to show I wasn't lying, and asked if the manager could put fresh ones into all the remotes for me, then I was happy to hand over the cash.
Manager opened I think six packs of batteries hanging behind the counter and they were all dead. I was dreading going back to my mates to explain the bad news but luckily some other employee appeared and said "I think there's another box out the back". Thankfully the batteries from the new carton of fresh stock worked immediately.
Manager seemed pissed about it but eventually refunded both dead packs of batteries my mate had purchased.
Then we discovered the clock battery was dead in my mate's Wii so we couldn't set the time and date which meant we couldn't save anything, so we all ran around my mate's house looking for things that might have 2032 batteries. Also needed to find a screwdriver small enough to open the battery slot, think I used the tip of a knife. Lots of failures from completely flat RGB light and Bluetooth speaker remote controls, but eventual success using the battery from their car's spare keyfob! Found out afterwards that it might have worked with no battery installed at all, but having the completely flat battery in it caused some kind of issues, it might have gone reverse voltage or something.
Of course it happened on the one rare weekend when I hadn't packed my multimeter or Swiss Army knife in my backpack, deliberately leaving them at the office so they wouldn't remind me about work over the weekend.
elitexero@reddit
Sometimes the universe just says not today.
joshuamarius@reddit (OP)
That's horrible! Cat5/6? I will tell you I have come across specific cables not working with specific brands of network cards no matter how hard I tried. One of those weird ones...
TheGoodspeed15@reddit
It was a cat5e I believe. My boss thought it was hilarious.
Valheru78@reddit
I've had this happen several times, I've learned to always test with a third or even fourth cable.
Although to be fair these days I mostly just carry a cable tester and test cables, even when they are new from the packaging.
WendoNZ@reddit
We've just had exactly this with a fibre patch lead, you really do start questioning your sanity
Vesalii@reddit
We had a batch of patch cables with a few suss in them. My colleague to this day refuses to buy those again. It's all Digitus now.
Swiftrun57@reddit
Talk about winning the worst lottery ever lmao.
TheGoodspeed15@reddit
Yeah my boss thought it was hilarious 😂
I'm like it doesn't make sense! It got to the point that I thought people were just fucking with me. Like every time I would test it I thought somebody was turning something off just to play a prank on me or something lol
tremens@reddit
Probably not the worst but my most recent...Mandated to push out a CrowdStrike update to about 140 industrial thin clients across the plant. These are all write filtered, locked down, critical to operations of the machines, etc. Tested about a dozen of the less critical ones, no issues, so next shutdown I decide to rip it off.
So over the weekend I fire up the laptop, remote into my office PC, fire up PDQ and push out the disable write filter and reboot script, go wander off and wait for them to come back up. The ones that failed the script deployment I pruned off the list; not unusual for some to be powered off or whatever. Wait about 10 15 minutes and yolo'd out the update to the remaining 120 or so.
About 90 go through but about 30 fail. Again, no big deal. Happens and all the errors were just like product already installed or whatever, so I'm some cases likely they'd actually already gotten the update and didn't need it. I spot check about a dozen of the successes and then I start checking the fails.
Every. Single. One. Is entirely off the network. I can't remote to it, I can't ping it, nothing. A cold sweat starts to form as I realize what my Sunday.. and Monday.. and Tuesday... Are going to look like. I'm thinking all ~30 have bombed and failed to reboot. Probably have to go be yanked and/or PXE booted and reimaged, a massive impact to production across the board.
What's the commonality? Let me fire up our auditing tools and see is there something alike about all these, Windows Update level or software or something...and the tools won't load either... Wait. Oh. I can't get to anything.
My firewall session just happened to have expired. That's all. I was just being blocked from segmentation, but it expired right in that exact moment when I switched from working down my "Succeeded" list checks to my "Failed" list checks.
PumpkinNo4869@reddit
Two major parts of a key manufacturing process needed to be in communication near constantly (or at least on a set interval of a few dozen milliseconds, the engineer wasn't that clear and this was not a network I was responsible for or really knew anything about but I helped regardless) Ran around and checked various panels containing PLC equipment, and upon opening one of the doors the connection resumed, and stayed working for like 5 mins. We closed the door, and it dropped. Open the door, good to go. Utterly convinced it was something being pinched in that panel but then it stopped working entirely independent of the panel angle of the door. Eventually found out a burned out RJ45 on a non POE switch in another data closet. It was unplugged to be traced and repaired, but with it unplugged everything was fine. Plugging it back in after replacing the end and everything died. We still have no idea where it goes but I imagine another vendor accidentally created some loopback and/or sent POE down something that did not like it.
DariusWolfe@reddit
Had an outage caused by an AC tech hitting a switch he shouldn't have touched. When we came back up, several jump boxes were still unavailable, so I went in one by one with a kb/m and a monitor to find that all of the non-working boxes had static IPs without a reservation. I set them all to DHCP since there was no specific reason for static IP, and figured everyone was good.
Except it wasn't. Got back to my desk and even MORE machines were unavailable. Went back in, everything looked fine, but the whole batch was down.
Spent an hour, called in an external network tech, only to find that I'd jogged a loose Ethernet cable so that it still looked like it was plugged in.
bobs143@reddit
Made a change to some user accounts in AD. Five minutes afterwards I got a call that computers are down.
Turns out there was a power flicker so a few PC/laptops shut down. Others didn't so they rebooted and they were back to normal.
I was sweating for a bit.
ncc74656m@reddit
A number of times I've encountered Microsoft outages after making the "could lock you out of your tenant" kinds of changes. Like, "No no, bowels, not now, I need to try to fix this if I can before you can empty completely!"
thecomputerguy7@reddit
I rebooted a VM over SSH and the second I hit enter, the whole building lost power.
I just sat there and thought “there’s absolutely no way…” since it was just a SMTP relay.
PaleoSpeedwagon@reddit
You reciting the OSI layers in your head like they're Hail Marys
thecomputerguy7@reddit
“Is there any possible way this was my fault” was one of the things that popped in my head
joshuamarius@reddit (OP)
And even though you say "there's absolutely no way" you still doubt yourself 💔 That's the worst feeling.
thecomputerguy7@reddit
Yup. I saw “client sent disconnect” in my terminal window for a split second and then everything went dark for roughly 10-15 seconds. Of course it felt like hours though.
Server room was on its own UPS, and I was on my laptop with a docking station so I was never “down” but it still made me panic.
reelieuglie@reddit
Not tech, but I worked at a pool on the east coast a while back. It was my second week.
I open up one of the influent pipes for a water test, take the sample, and shut the pipe.
Immediately, everything in the room starts violently rocking and shaking, including the 700 gallon vat of chlorine. I thought I broke something and was going to be explosively bleached.
I ran outside to find out it was only an earthquake.
Igot1forya@reddit
It always seems to happen that the VPN max logon expiration timer matches perfectly with our maintenance windows. Just as we kick off a critical upgrade or launch event everything drops and I piss myself. Every single time.
A_Curious_Cockroach@reddit
Got called to do a restore at around 2am. Getting out of bed and logging in guy calls back in like 2 min. I'm like yeah dumbass give me time to actually log in. It's not the helpdesk. It's the storage team, who also owned backups. They hadn't renewed whatever application we were using at the time for backups so I couldn't actually do the restore. Storage guy was up doing some change and he saw in the chat some other guy telling the helpdesk to call the on call cause his RHEL upgrade had shit the bed and needed a restore.
All was not loss luckily back ups ran that night and while we could not use the restore function on a vm we could copy a backed up datastore so we just did that and added it back to the VC, then copied the vm from that datastore from last night to the regular production datastore, then added the VM to inventory. Time consuming but not that bad.
To top it off, after we got the restore done, the linux admin let us know he was patching a server he forgot to patch earlier in the week so he didn't have a ticket, change, nothing. Thought he could just hop on patch it and be done. Production server as well so he was shitting bricks the whole time. After we got it restored he didn't even patch it he was like taking this as a sign opening a change to do it next week.
Even funnier...the storage team did not immediately renew the backup application because a few weeks later whoever the on call was called me asking how i did the whole datastore copy move stuff cause he had to do it.
SimplifyAndAddCoffee@reddit
The one and only time I tried to join an early morning all-hands meeting from bed because I was too tired to get up, my cat turned on my phone's camera on the nightstand without my knowing and the CIO saw it.
elitexero@reddit
That's the work equivalent of someone walking in on your comedy movie during the 8 second sex scene.
PaleoSpeedwagon@reddit
The gasp I just gusped
mister_wizard@reddit
i had a guy join a morning call like this in bed and didnt realize his camera was on. His bedroom was very cluttered (think hoarders) and he had what looked like a damp towel next to his bed he used to wipe his face a few times.
LabDazzling6315@reddit
Half my motivation for joining meetings is the chance to witness gloriously embarrassing moments like this.
R3luctant@reddit
Totaled my car on my way to work the day the VM cert expired.
avlas@reddit
The sysadmin of my company got two flat tires while driving to the office on the day of the Cloudflare big DNS poopoo last year. It was a rough morning for everyone involved
Resident_Role_2815@reddit
That's the type of thing that happens to techs who read r/sysadmin before leaving for work and see the OMG ITS ALL FUCKED post from '30 minutes ago' with 400 comments.
Sorry boss, what can you do?!
avlas@reddit
That would have been a great explanation lol
Except the poor guy is actually one of the owners… He was trying to figure out why DNS was fucked, while sitting on a sidewalk waiting for the tow truck, laptop on knees and phone hotspot on.
As I was saying… rough morning
guitpick@reddit
Tire technician's notes: "It's always DNS."
InstrumentCombustion@reddit
Which was worse, the cert renewal or the crash?
R3luctant@reddit
After I climbed out of the ditch a cop gave me a ride to a gas station and I called my boss saying I wasn't going to be in when the I was informed that the entire org was calling in asking why they couldn't log in, ended ubering to the avis lot to rent a car and by the time I got to work my coworkers had gotten things taken care of, shit morning all around.
matroosoft@reddit
"Honestly, this car accident wasn't too bad after all!"
fnordhole@reddit
5/7. Recommend.
TheLexoPlexx@reddit
Perfect score
matroosoft@reddit
At least it wasn't DNS!
ms6615@reddit
One time I made a change to the phone system and minutes later we were told the water supply for our 32 story high rise had shut off unexpectedly and we had to evacuate and I somehow still for a moment thought it could have been me
Secret_Account07@reddit
I mean, unless the RCA says otherwise can we really know for sure?
itishowitisanditbad@reddit
The old water pump was setup as a phone that needed to constantly ring to be powered.
You did ruin it.
aaiceman@reddit
Bob set it that way so the POTS line would show in use and not get disconnected. Bob also retired 7 years ago.
Fadore@reddit
I was hired at a company that had an MSP as a single onsite IT resource working with the MSP. I told them for years that their on premise VOIP system was dated and a major risk of failing, but they didn't want to spend on a new system.
One day they decided they wanted to eliminate my position and go back to just the MSP. The day after I was let go the phone system went down. Apparently, my old boss was on the phone with the MSP for half the day trying to figure out if I was able to get back in after my accounts had been locked out...
I laughed when I found out they thought I'd done anything. Wonder how much it cost them to outsource and expedite a replacement system. I'm not a malicious person - being right was all the satisfaction I needed.
CarrotBusiness2380@reddit
Not tech, but when I was in college I worked on the grounds crew. I was sent out to spray a wasp nest above the entrance to a building. The nest was 15 feet up or so I sprayed it real good, then as I turned around there was a boom and the windows nearby shattered and broke.
It turned out to be a gas leak in a nearby house finding a spark, but I was convinced for a moment I had done it with the wasp spray.
joshuamarius@reddit (OP)
LOL! That happens!
Secret_Account07@reddit
I created a ticket that powered down a windows server. We have custom actions for everything, including restarts, but nothing to cmd power off a VM. I had a windows server that was up but couldn’t RDP, WMI, remote exec, etc etc etc. but was talking with our mgmt software so I knew client could retrieve and shut down. I think it was compromised as SecOps requested it.
Anyways I created a new simple cmd to hard shut down. Then deployed it to just that server. Walked downstairs and my phone is blowing up with p1s showing servers down. I’m thinking “FUCK I PUSHED IT THE WHOLE ENVIRONMENT! THOUSANDS OF SERVERS!” I ran back to my desk and was going to stop the action hoping it hadn’t deployed everywhere. BUT FUCK I CANT REACH OUR TERMINAL SERVER TO GET TO CONSOLE.
I’m about to go tell my manager. I’m devestated. This would have sooooo many people upset. Thousands. Then Someone said everything’s back up. I’m thinking “How? It wasn’t a restart cmd. And you’d have to manually power back on physical servers, PowerCLI VMs…ton of work.
Anyways there was a switch that went down and blocked only our OOB network. So mgmt tools were down, but not actual prod servers.
I was absolutely drenched in sweat and went to bathroom to clean myself up. I never told anyone, until now.
joshuamarius@reddit (OP)
Thanks for sharing that ✌🏻😁
nousername1244@reddit
Nothing like a perfectly timed outage to make you question your entire career for 15 minutes straight.
masheduppotato@reddit
Did a full network rewire and clean up for a very popular and famous cathedral years ago. Worked from like 9pm till 4am getting everything perfect and tested. Drive a little over an hour back home. Jump into bed around 6am to get some sleep before heading into the office.
Around 6:30 I get woken up by a call from the cathedral. The network is being weird. I can’t access the file shares. Then another call comes in from the business directory saying something seems off with the network.
I get dressed, grab a coffee and head back into the city now I’m traffic. I finally get there around 8:30 and start investigating. Everything seems in order in the server room. Go to the network area and wire in with my laptop and I pick up a strange address. Start walking into the business area and checking computers that are acting up.
All have an IP on a different subnet. Decide I’m going to try going to the gateway IP and hit a login page for a home router. Go find the business director and ask him if any new priest had arrived.
He tells me that arrived last night and had woken up early to get some work done and asked me how I knew. I ask for his room and head up.
Get in the room and I see a router plugged into the network and he’s using it as an AP but the thing I pumping out IPs as leases expire. I unplug it and within a few minutes everything is back to normal. I then ask him why he did that and he tells me so he can have WiFi.
I get him on the priests network and take the router with me telling him he can have it back when he leaves and drop it off to the business director before driving back home to try and sleep.
Just my dumb luck that all that had to happen at the same time.
anotherkeebler@reddit
We were in the middle of a production database migration from Alphas to Linux/x86 and right when I pushed the new DB paths to the app servers the command hung and suddenly we could reach absolutely nothing. Turns out the data center itself had a power failure
Master-IT-All@reddit
Anytime I'm onsite, someone inevitably brings up a problem with a printer.
RandoReddit16@reddit
100% printers are always not working... I've been in IT 10ish years and printers are still the thing I hate the most.
joshuamarius@reddit (OP)
It's gotten worse with all these web based apps that have in-between drivers/apps for both printers and scanners.
junkhacker@reddit
I'm not sure that counts as a coincidence. Printers having problems at any time has a slightly lower probability than the odds of water being wet.
joshuamarius@reddit (OP)
Correct. Now if the printer problem was that it keeps printing random pictures of the wife...🤓
timbotheny26@reddit
I got anxious just from reading that.
joshuamarius@reddit (OP)
😂😂 Imagine how I felt... Started at the top of my neck and went down my spine. It's just a horrible feeling.
Cipher-i-entity@reddit
We set up a starlink as a redundant emergency backup connection. Few weeks later we had an ISP outage on our main line and I was an entry network admin with my first real job and finally given my first critical task which was to configure our firewall to failover to the starlink. Didn’t work, but no biggie because as luck has it, the ISP barely seconds later said their circuit is back up. I go to undo my changes to revert back to use our main line, didn’t work.
I tried the using the starlink circuit again, didn’t work. Eventually I checked Downdetector and saw there was a huge starlink outage. And our ISPs main circuit to our site had a truck crash into the pole that carries our main circuit which caused a second outage right after the first one got fixed. This was the first time I experienced an outage and was terrified it was entirely my fault. I should’ve bought a lottery ticket that day.
intoned@reddit
Longtime ago.. working a office "Computer Room" that had an electrostatic air filter that would about once a week let out a giant bug zapper type "Buzzap!". Even if you knew it was there you would still jump half the time at the sound of it.
One time after maintenance on the UPS that fed the room, as I rotated the startup handle into the ON position the Air Filter decided to fire. Thought I just killed myself and all our gear at the same time.
mc_it@reddit
We were swapping out UPS batteries.
I picked one up to set on the rolly-cart to move it to and install in a different data closet and felt a stabbing pain in my back.
Thought I pulled or tore something.
Turned out to be a kidney stone that decided right then was when it wanted to rear its ugly spikes.
lungbong@reddit
I was decommissioning an old server, shut down the Java process, checked the monitoring to make sure there were no alerts then typed shutdown -h now and the second I pressed enter the fire alarm went off.
Fortunately it was just a drill.
Generico300@reddit
This happened in my college days, but it was a tech demo so I think still relevant.
My group had been working on a program to demonstrate SOAP integration between a Flash app (yes, I'm that old, shut up) on a browser and a backend. Well, the backend was using Google's API, and right as we went to demonstrate the integration Google had a complete outage. Main site, API, everything, for like 2 hours. We were like deer in the headlights staring at the professor like "I don't understand! It worked 15 minutes ago." while scrambling through the code to figure out what caused the problem.
joshuamarius@reddit (OP)
Your life Flashed before you eyes!
TheVillage1D10T@reddit
I just had to push some JRE updates the other day….our primary site ESXi servers ALL tapped out the CPU. Like 6 hosts just completely maxed out all at the same time. Freaked out for a second.
loupgarou21@reddit
Probably not the worst, but it's the freshest. I had some sites connected via an MPLS that were having rolling outages. I logged into the core switch that they were all connecting back to, and as soon as I did a completely different site, not a part of the MPLS, but also connected to that same core switch, went entirely offline, as did our environmental monitoring.
The environmental monitor just happened to die (it was scheduled for replacement), and the site that went offline was scheduled for maintenance, those going offline had absolutely nothing to do with the rolling outages on the MPLS, but both happened seconds after I logged into the core switch.
Parlett316@reddit
The one that pops out was doing a site survey for a potential client which was a private school. I'm talking to my contact and i'm in their network closet getting serial numbers from current production equipment.
Me: "This is the part I hate, touching equipment that is not mine not know if the power adapter is fully seated in"
Client: "Ha I get it."
And as I grab their Sonicwall, the entire goddamn building's power goes out.
Me: No fucking way.....
Client: HAHA, that shit happens alot!
And the end.
joshuamarius@reddit (OP)
Came close to home. I was working on TZ370s in my post 😁
Talkie123@reddit
I walked into the server room of a major medical facility in my area. I hadn't even gotten my other foot in the door when all the power in the building suddenly went out. I had every supervisor and boss poking their head in asking what I did. It wasn't until a patient walked in and said they were late because all the traffic signals had quit working. Turns out a car had struck a power line and killed power for the entire neighborhood. Unfortunately for me, that wasn't the last time something like that happened.
GrizellaArbitersInc@reddit
Started the Exchange online and OneDrive setup for new client. Ten minutes into setup MS portal outage on certain blades. Had to leave them mostly configured and await new user credentials once it was back up. Sorted them and they were happy.
6 months later they spin up a new business unit and a couple more machines. Need a new tenant and more new accounts. This time it was 30 minutes in that they went down again.
Easily the most incompetent I’ve looked doing one of these.
joshuamarius@reddit (OP)
WOW!
stonecoldcoldstone@reddit
working in the suspended ceiling up to my elbows in cables struggling to find the right cable run, fire alarm goes off, cables for fire are also in the ceiling... 15 minutes later after thinking I caused it my colleague radios over that it was catering
matroosoft@reddit
We deployed a few test setups for multiple monitors with built-in docking station. Worked like a charm.
Then we deployed a completely new office with 15 new desks with the new setup. Spent weeks on preparing, ordering, network, switches, AP's, cabling, desks, monitor setups etc. And finished just in time for the deadline as the next day they would move in. I just needed to test all monitors properly worked using the DisplayPort Alt mode.
Guess it, none of them did and my heart sank. Tested with different laptops, cables, unplugging second monitor, updating drivers, rechecking I ordered the right ones, look online for similar issues..
Then at some point I replugged the cable and it just worked. I looked again, the were 3 USB-c ports, one of them was labeled 'up' while the other ones had no label. All that time I had assumed the 'up' one was the one I'd need to use. It wasn't.
Just be accident I used one of the other ones, which was the only actual docking port. 🙄
agent_fuzzyboots@reddit
worked at a isp, plugged in a firewall at a customer, hmm it doesn't work, hmm, i have a signal on my phone but can't make a phonecall, shit did i do something?
we got ddosed bad.... (not my fault)
but once one of my customers caused a broadcast storm and took down everything (our config was shit) hey let's plug in these sonos speakers, let's wire them for maximum efficiency , oops they also speak wireless with each other and loop the traffic since their spanning tree was different from our spanning tree
Wild-Plankton595@reddit
Working L1 support at a school, so was cleaning up server room, pulling cables out of the back of the rack that were left hanging and not tagged. NBD, routine, cleaning up after vendors and server team.
Suddenly network goes down on the entire campus. I call the network guy, ofc share what I’d been doing, I don’t hide from my mistakes. I’m getting chewed out while he’s investigating what I took down.
Turns out it was a switch in the networking lab. A networking student plugged a cable into two switch ports in the same clearly labeled production switch, rather than whatever he intended to do on the clearly labeled lab switches. Misconfigured spanning tree allowed a loop to take the entire site down. Students got a real life example that day.
I own my mistakes and hold others to the same standard even if they are three levels above me. In the debrief, to everyone’s surprise, I got one of the few apologies this admin ever made in his 25 years at this place. Guy always had complaints because he has dry sense of humor and was prickly to boot, but I liked his sense of humor and I never experienced the issues everyone else had with him after that incident so others would ask me to work with him on their behalf.
joshuamarius@reddit (OP)
A production switch in a Networking Lab full of students.
So let me tell you about Murphy's Law.
CaptainZippi@reddit
Early Sun Fibre Channel array (3510) had an “interesting” cmd line syntax. Array was live and so we triple checked the command to create a new lun and mapping.
So the three of us were clustered round the guy doing the typing - and we confirmed that the command was correct and accurate.
“Confirm” “Concur” “Yup, do it”
…at that exact second the weekly 3pm Wednesday fire alarm test went off.
Took 15 minutes for my heart rate to return to normal.
joshuamarius@reddit (OP)
Ha! Ha! Ha! This is awesome.
fsweetser@reddit
Ran into this one just earlier this week.
Got a call from one of our users that a bunch of wireless temperature sensors all went offline (they monitor -60c freezers full of biomedical research supplies, so pretty critical). Look at the timestamps, and dammit - they line up with us changing wireless vendors in the building. We start going through various wireless troubleshooting rituals, but can't find anything obviously wrong, so I decide to step back and end to end troubleshoot, rather than assuming we know where the fault is.
Eventually I look at the firewall logs. Hey! Why are the sensors talking to a whole different bunch of servers than before the changeover? Go through the vendor support docs, and yeah - the vendor decided to swap their API endpoints to a whole new set of servers on the same damn day we swapped out the building wireless.
Luckily we only wasted a couple of hours on what turned out to be a three minute fix, but yeah, sometimes it just feels like the universe is screwing with you...
joshuamarius@reddit (OP)
Definitely that's how it feels sometimes.
Brekkjern@reddit
Once I had to make a change to the router config of the port I was connected through via SSH at a very remote site. The change would force the connection to bounce, so I expected to lose connection for a little bit. I make the change, lose connection, and wait a little bit before trying to connect again, but I just can't make a connection. I try again a few more times but nothing. I start getting a bit panicked, reviewing the changes I just did, but I don't understand why it doesn't work.
After running around in panic for a while I realize I will have to call the ones who manage the "local" techs. "Local" is doing a lot of heavy lifting as they might still be several hours away by car. I explain my problem and they tell me they will call back later when they have any info.
A few hours go by and I am anxiously waiting for news when I get a call back. "Yeah, the site is down. We actually had a tech on site and he was clearing trees next to the antenna, and one of them fell on the dish."
joshuamarius@reddit (OP)
Ha! What are the chances...right?
Stonewalled9999@reddit
1AM -5AM maint windows are common for Coax companies
joshuamarius@reddit (OP)
On which days?
CeC-P@reddit
UPS A and B in the rack, running critical medical tracking hardware. UPS A we swap live after moving over some dual PSU items. UPS B that we didn't even touch just decides to die randomly because of the vibration or something. We're on the phone scheduling night time downtime for the heart monitors and we get the emergency call about them being down right now. We immediately move plugs to the mounted but not turned on UPS B.
Wild-Plankton595@reddit
I’d been messing around with conditional access policies, applied to my test account, but it was the end of the day so I left it for the next day. Had a doctor’s appointment first thing in the morning. I’m sitting on the exam table waiting for the doctor when my phone starts blowing up. No one can sign in to anything Microsoft. They’re getting an error that they don’t meet requirements. Immediately think I messed up and applied my test policy to all accounts and would have some ‘splaining to do. Break glass account to the rescue, sign in on my galaxy s8 to view sign in logs. It’s not my test policy that’s blocking sign ins, it was the policy we have to block signings out of the country and anonymous IPs, it’d been in effect and working well for years, no changes had been made. Thank the Elders of the Internet. For some reason MS or ISP were treating all IPs as out of the country/anonymous. I disable that policy and within a couple of minutes all is working again. At this point the doctor has been waiting for me for a change, luckily her next appt was a no show so she had time.
Geodude532@reddit
Waited to the last minute to do my admin access training. Got sick and missed a week of work. Takes a lot longer to get access back than it does to keep it....
suicideking72@reddit
I was working for an MSP. I had an owner of a small business (website design) bring in his PC to get rid of minor malware. So all went well and I had him pick it up after a few hours.
So he call frantic an hour later. YOU DELETED ALL MY WORK! WHERE IS IT?
Ok, where is your work kept? He put EVERYTHING that was in progress in C:\temp...
One of my cleaner apps clears C:\temp and all his current work was gone, no backups. Lost a customer and he didn't want to hear it when I told him DON'T NAME A FOLDER TEMP! Temp work, TempW, anything but TEMP.
Happy_Macaron5197@reddit
not sysadmin level but i was doing a late night deployment, pushed to prod, site went blank, full panic mode, rolled back, still blank, spent 45 mins convinced i had nuked the database
turns out my ISP had a 1 hour outage and the site was fine the whole time. i just couldn't reach it.
the worst part is i had already drafted the "i am so sorry" message to the client and everything. never sent it but still haunts me.
the spectrum timing on yours is genuinely evil though. clicking logout and watching an entire surgery center go dark at 2am is the kind of thing that takes years off your life
fearless-fossa@reddit
Had an Ansible playbook running against prod servers for deploying an emergency hotfix for our application. While running suddenly it turns off and monitoring goes all red.
The guys in the DC tested the redundancy of the power supply. It was not redundant.
Connir@reddit
Not even IT related.
In high school I worked on the drama club, doing construction for play sets. I went into the store room for all of the lumber and turned on the light. This is the first time I ever went into this room alone, I was a freshman. The light switch had a pipe running up to a fire alarm, and over to a light bulb. The instant I turned on the light, the fire alarm starts blaring. The whole school has to evacuate.
I kept my mouth shut and waited outside with the crowd of students watching. Wondering what the hell I did wrong.
At the same instant I turned on that light switch, some other students moving play set stuff through the cafeteria knocked a sprinkler head off, kicking off the alarm.
We spent the rest of the evening mopping up the cafeteria. I don't think I told anyone about it until years later.
paishocajun@reddit
NGL that sounds like something you'd see in a slice of life anime and go "psh, like that'd actually happen" lol
Winter-Swimmer-3000@reddit
I was building a high speed internet service for a customer at fairly short notice; they'd insisted on taking the full internet routing table which imo they didn't really need. No problem- I had route-maps ready to go for this, merely some editing needed in notepad and we're off to the races. This would probably have been a dedicated 10Mbps or so service, managed CE which was probably Cisco.
I had 2 shells open; one to the PE, one to the CE. It was 2am or so by that point and I decided to get coffee before proceeding. A token gesture really but hey.
I came back to the consoles, had a look- that .txt is wrong, I've transposed a couple of things there, idiot. Blaming my tiredness, I did a fresh edit, reviewed- all looks good. We're ready to go, push it to the CE.
Why have I suddenly lost connectivity?- actually, why am I now not seeing the routes I was seeing just before I made the changes?
TL;DR- in my tired state, I pasted the CE changes into the PE. I made the news the next day :/
Fallingdamage@reddit
Ive had that happen many times in my career actually. Correlation vs causation??
Ultimately, things feel like they go down while you're working on them because, well, you're always working on something. So when an outage happens your first thought is "why did that action result in that??"
Just coincidences.
ioa94@reddit
Went to a tax office to swap out their firewall right at the tail end of tax season. Firewall guy told me everything was configured and it would be a quick swap. Hooked it all up, couldn't get out, so after a few mins of checking everything over I plugged the old firewall in, STILL no dice. Come to find out, between the time I unplugged the old firewall and plugged the new one in, there was an ISP outage at the same time as my firewall maintenance window. Still wild to think back on my luck to this day.
tdhuck@reddit
These are the worst because you are almost certain that plugging in the old gear will get them back online and you have time to figure out the issue with the new gear. Then once the old doesn't work now you have to think...was there a setting changed on the old fw that wasn't 'written' and nobody know what that change is?
I've had similar close calls and I am always paranoid about backups and changes. I will manually login and take a backup (assuming backup automation is in place). If the device has web and CLI access I will also grab the running and startup configs and print them out on paper and have them accessible via txt file. Of course I will compare them to make sure they are the same, but both are still readily available. The prep might seem overboard, but if you ever need it you are glad you have it nearby.
joshuamarius@reddit (OP)
Ha! Welcome to the club... Sit down and have a drink 😅
Orionsbelt@reddit
2 failed ac units, 3 different failed heat sensors at the same time server room at a chemical plant. Overheated SAN that was running HOT for hours and unresponsive, 5 am Monday i'm pulling the whole thing apart and letting it cool as much as I can before trying to see if its going to come up or if prod is going to be down for days as we restore.
burnte@reddit
I was working in a network closet on one of two core switches at about midnight friday night. Suddenly reports come in the whole facility is out. The secondary core switch died ten minutes into replacing the first one. Whole site was down for a while as we finished the swapout.
joshuamarius@reddit (OP)
You should have played the lotto that day as well. That's a 1 in a million chance.
Greerio@reddit
We made a change to an entra app, the same morning the site went down for the app. People couldn’t login. We spent too much time troubleshooting the app lol.
joshuamarius@reddit (OP)
Gotta love Microsoft 💚
Vast_Resolve_8354@reddit
Did a firmware upgrade on our Firewall at 6am before anyone else came in. It came back up like normal, but both our primary Ethernet line and SoGEA lines were down and everything had failed over to our cellular connection.
For some reason our SIP lines would not connect to the provider.
I spent a couple of hours panicking that I or the Firewall vendor had fucked up, before I found out the local fibre exchange had caught fire, and our SIP provider had not added our cellular IP to the whitelist of allowed connections.
Phones were down for about 30 business minutes before I could get hold of someone to actually add the IP.
joshuamarius@reddit (OP)
Ouch!
Hobo_Slayer@reddit
In high school I got to work part time as a paid employee doing L1 tech work for the school districts IT department.
One day they had me over at the districts central office using a vacuum to clean dust out from some old workstations. After about 5-10 seconds of vacuuming, the power went out. After a short bit it came back up, so I went back to vacuuming workstations. Again, after maybe 5-10 seconds later, the power to the building went out a second time. Power comes back. The cycle of "start vacuuming, power dies, power comes back" then repeats one more time.
This vacuum was also insanely loud and sounded like it maybe had issues, so I wondered if maybe it was tripping a breaker or something though that wouldn't really make sense for the power for the entire building to be dying instead of once circuit.
Other people had this theory too because even though I was behind a closed door and down a long hallway, it was loud enough to be heard around the building, and a bunch of people having a meeting down the hall noticed the effect of "vacuum starts, power dies" and came to the room and told me I had to leave because I kept killing power to the building.
While I'm arguing with them about the whole thing, the power goes out again, and this time stays out. It turned out there was something going on with the power for the whole town, and it dying and coming back just happened to be a perfectly timed coincidence to when I was turning on the vacuum, to the point where everyone in the building thought it was my fault.
joshuamarius@reddit (OP)
This is exactly how nicknames such as "Vacuum boy" get started in IT Teams 😂 Great story.
manofsticks@reddit
Luckily I'm just a dev, so I only got to witness this and didn't have to fix it.
One time our primary system had a hardware failure I don't remember the details of. No big deal, we have an off-site backup and we switched it over.
A few hours later, a car hit a telephone pole down the road from the off-site backup. No big deal, there's a generator backup.
But the week before, we had the generator inspected, and the inspector left it off when they were done... so the generator didn't kick on.
LUCKILY by the time that happened, the primary server hardware had been fixed, so we only had a very brief outage.
BoltActionRifleman@reddit
Final Destination - IT Infrastructure
evilcreedbratton@reddit
Exchange Online outage right after a migration
joshuamarius@reddit (OP)
Ouch! Felt that one! 🥺
SenTedStevens@reddit
One time at an old job, I was tasked to swap out the old batteries in a UPS. I disconnected the old batteries, loaded up the new ones, and ran the battery test. Seconds after doing that, the lights went out. The whole building lost power and went on generator. I damn near shat my pants thinking I messed up something. The building manager discovered that a squirrel did parkour on a transformer and took out our grid as I swapped out the batteries.
joshuamarius@reddit (OP)
☝🏻☝🏻😂😂😂 Now that's what I'm talking about!!
TheDawiWhisperer@reddit
i used to work at an MSP and for one customer we a little environment that had DR or sorts by running duplicate servers on two vmware hosts.
one day we had to shut host1 down so HPE could swap out something on it so i shut all the VMs down first, the plan was that the VMs on host2 would assume the workload.
so after the VMs on host1 were all shutdown gracefully i went on the iLO and shut down the host.
a couple of minutes later Nagios lit up like a Christmas tree when all the VMs that were not in downtime alerted as being down...i was like "wtf, that's right" and i don't really believe in coincidences so was pretty sure i'd fucked something up.
Turns out the iLO IPs on the two hosts had been mixed up on the asset register.
Doh.
I mentioned it to the solution design guy who put it all together and the absolute prick tried to make out that it was my fault for not double checking the serial number on the host. Get fucked, knobhead.
destructornine@reddit
I was onsite at a client troubleshooting a Wi-Fi issue. Talked with the HR manager for a few minutes. She needed Wi-Fi for an upcoming interview call. I suggested she take the call from the coffee shop on the ground floor of the office building. About half an hour later, we got evacuated for a bomb threat. The bag with the bomb was at the table next to the HR manager. She jokingly asked if I was trying to get rid of her.
(It didn't blow up and turned out to be a duffle bag of books someone planted as a prank).
joshuamarius@reddit (OP)
Murica!!
JimTheJerseyGuy@reddit
Back in ISDN WAN link days, I was working for a telecommunications company. I ran a script to do some updates on a router at a major office site on the other side of the country. I think we had around 200-300 employees there. Moments after I kick the script off - complete loss of connectivity. 20 minutes of scrambling to get the backup modem number (because the wrong one was in our directory) only to connect and find out the entire link was down because our own people where working on the ISDN circuit and failed to notify us.
joshuamarius@reddit (OP)
I think I speak for many when I say the biggest pet peeve in our field is lack of notification/communication. Literally it turns into friendly fire.
BrentNewland@reddit
I set up Microsoft Authenticator for our firewall VPN MFA (we already had DUO). Everything tested and working. Which MFA you got based on AD group membership.
I tell the rest of the I.T. team, within minutes the VPN is down. Internet at the office is down. People asking what I did to break things.
Turns out a UPS failed, and it took down a switch, which prevented failover to the second firewall.
joshuamarius@reddit (OP)
That's a bad feeling when they turn the investigation to you 🥴 At least I was alone at 2AM at night during my situation. Glad it was a UPS ✌🏻
RoyalTranslators@reddit
I was cleaning up a bunch of old unused DNS records for our domain one afternoon a few months ago. The next morning all of a sudden people are calling me saying they haven't received email they were expecting, some are saying they can't send out either... oh. no.
At this point I'm thinking I definitely deleted a critical record for 365 and things only just now propagated. I log into the DNS dashboard - everything seems to still check out there. I refresh r/sysadmin and lo and behold, huge Microsoft outage. I should have known -_-
Velocireptile@reddit
One of the recent AWS hub outages occurred shortly after I had completed an overnight firewall firmware update I'd been advising people about for a week. "Whatever that update was you were talking about, can you undo it? The whole Internet went down!"