Sometimes, the solution can be so simple...
Posted by JudgementMaker123@reddit | talesfromtechsupport | View on Reddit | 82 comments
So, this post has gotten removed from other places for being fake, as this obviously would never happen in real life, it has literally gotten me banned from a sub for 'creating an unbelievable story'. So, I hope it fits here and that there are actually people who will believe me. I have posted a Q&A at the end with the most common questions I got asked before the post was removed in the other subs. I have also translated this from another language using Google Translate (I know, terrible), I did check it but if there is anything I missed, sorry.
This all happened about 10 years ago, but I'm still in contact with the people from the company and I hear that, unfortunately, things haven't improved, not even in 10 years. This story takes place during my first month as a trainee at the company, working in IT, mainly providing first-level support, but the easy stuff, like telling people how to turn on the computer and where to plug in the USB-Stick.
Wednesday, 11:55 a.m., just before lunch. Back then, I was still a trainee with very little knowledge, but when suddenly 50 tickets from 10 countries landed in the ticketing system basically at once, all with the same message, "SAP is down," I knew we had a huge problem.
The troubleshooting began, and after almost 20 minutes, we hadn't made any progress. We had tried practically everything we could. The mood was terrible, everyone was hungry, everyone was frustrated, but the problem had to be solved now, production was at a standstill in 10 countries. In my youthful innocence, I joked, "Maybe someone just pulled the plug." Man, if looks could have killed. I got yelled at, "If I have nothing productive to contribute, then shut up." Intimidated, I sat in the corner and watched while the others frantically tried to find a solution. The phone kept ringing, new tickets kept coming in. All I could do was answer the phone and say, "We're aware of the situation" and respond to tickets in the same way.
At 12:45, one of my colleagues returned from his lunch break; he had left at 11:45. He came in and saw how everything was going wrong. He asked what was going on. We explained. He just looked worried and asked basically to himself "Could this have something to do with the Telekom guys I left in the server room before my lunch break?"
Silence, dead silence. Everyone just stared at him. "You did what," someone managed to ask, while two others had already started sprinting towards the server room. "They were supposed to be here, we knew they were coming," "Yeah, at 2:30 PM. You can't just leave strangers unattended in the server room and then go on your lunch break!" "Okay, sorry, it won't happen again."
Suddenly, the connection to SAP is re-established. Relief. The two colleagues return from the server room. They both look down at the floor. The boss asks, "So, guys, what was the problem?" "Well, he had to plug in a device for work and unplugged it. He said he didn't think it was important because all the other plugs were labeled, only that one wasn't."
Dead silence again. No one looks at me. After what felt like 10 minutes, but was probably only a few seconds, the boss simply said, "How about I order pizza for everyone? You all worked through your lunch break." People nodded and walked back to their desks. I was still sitting at the trainee desk in the corner, the worst possible spot. The boss came over and asked what kind of pizza I wanted. I answered, and he kept walking. No one spoke to me for a good hour. I just kept working, processing the tickets related to the incident and eating my pizza.
In the five years I was with the company, the incident was never mentioned again. However, every time there was another major incident at the company (and there were far too many, they were so awful), I was taken seriously and given a chance to speak before being yelled at.
Q&A
Why was there no emergency plan in place?
I don't know, they probably didn't think it would happen. I see plenty of companies in my now lime of work that don't have an emergency response plan and would probably panic the same way of their critical system went down.
Why didn't anyone check the server room?
Again, I don't know, probably because it was very improbable that it was coming from there. Only we had access to the server room. 2 people were working from home, 1 guy had left for lunch and people had seen him leave, half of the team was supposed to leave for lunch at noon, the other half at 1 p.m., so we didn't expect anyone to even be in the server room, let alone unsupervised.
Why did you keep on receiving tickets, why wasn't a master ticket created, why did you not post anything on the intranet?
We kept on receiving tickets because people were panicking about production having come to a standstill and as you know, we will work faster the more tickets there are (this is a joke by the way). And I didn't know what a master ticket was, I was less then a month in, I had no idea what I was doing. And I definitely didn't have access to the intranet to put a message on there.
Why did the phone keep ringing, why didn't you put a message on that says you are working on the problem?
Again, not my domain, I was working there for less than a month at this point, I was just told to pick up the phone, say we are working on the problem and hang up.
Why weren't there any failover in place?
There were, but nobody had tested whether they actually worked in like 2 years. If one system failed (this includes pulling the plug on one system), it was supposed to automatically switch over to another system, it just didn't.
Why wasn't electricity being monitored?
I don't know, there were failovers in place so everyone just assumed that something line this couldn't happen.
Why were people left alone in the server room?
I don't know, the guy was probably hungry, wasn't thinking straight and thought they couldn't do much damage.
Why wasn't this shown on the monitoring tool?
I don't know, I was a trainee, I wasn't even looking at the monitoring tool and if I had, I probably wouldn't have understood anyway, but I assume if it had said 'plug A was pulled', someone would have gone to check.
I hope I have answered most questions and that this doesn't get me banned, it really is a true story, I have many others like this because that company was chaos but the pay was excellent for a trainee.
aj4000@reddit
100% believe this. One time I had a site offline and unable trade for almost a week because one of the venue managers who should not have been in the comms room decided that he actually should be. He saw an ethernet patch lead that was unplugged and hanging loose, so he just plugged it in into a random empty port in one of the patch panels despite having absolutely no idea what he was doing. This connected our closed and isolated network into their venue network, so our hardware was getting an incorrect IP and couldn't reach the host. I attended 4 times to try replacing different hardware, each time I asked if anyone had been in there for any reason. The first 3 times he told me no, absolutely not, no one at all has been in here. On the 4th visit he finally tells me what he'd done, but of course couldn't remember which cable he touched. I ended up having to rewire our whole side of the network to isolate it again.
ToBioRob@reddit
Everybody who said it's fake, was in your managers position.
Margenin@reddit
?
Ben725@reddit
In other words, those were his managers who he used to work for and wanted to deflect and pass the blame onto OP.
Yodo9001@reddit
Why do you say that the situation hasn't improved at the company? You were taken seriously after that.
If this was 10 years ago, how do you remembwr that it happened at 11:55 on wednesday?
The rest sounds believable.
LurksWithGophers@reddit
Eh if its big enough you tend to remember. I remember another department breaking DNS for every customer at 4am on a Thursday a decade ago.
syntaxerror53@reddit
Least it wasn't 5pm on a Friday.
JudgementMaker123@reddit (OP)
I mean, its easy to remember that it was just before I went on my lunch break and I was only in the office on Wednesday, Thursday and Friday and it was my first day that week.
Also, yeah, I got taken seriously, but that was about it, the plug remained unlabeled, no emergency plan was put in place, the failover still didn't get tested regularly, so overall, nothing improved and there were enough other incidents (major and minor) that could have easily been prevented.
LeahInShade@reddit
It's a very easy time to remember - 5 minutes before lunch break.
It's pretty possible also that while OP was listened to from then on before being screamed at, it only stopped at listening, not progressing to doing. OP also hadn't worked there for a long while, so fairly likely the crazy culture just carried right on. Big organizations are often impressively inert, and even after some sort term improvement may devolve right back to default chaos.
shell_shocked_today@reddit
So, I'm not saying I don't believe it, but here's the part I have trouble with.
Any system running SAP should have dual power supplies. While it certainly is possible, and has happened, that only one PS is in use, that certainly shouldn't be the case. Unplugging one of the PSs should have generated an alert, but not an outage.
But, I've worked in places where that could have happened. And I've had cleaners unplug production servers. It happens.
OldGeekWeirdo@reddit
It might have been one of those situations where a P/S failed and was on back order. But somehow the crew doing the troubleshooting didn't know or forgot.
The part I have trouble with is if the server was completely dead, why did it take them an hour to go look and make sure it didn't catch fire or something?
I'm wondering if it's a case that the plug ran part of the server (like an external disk rack) and the server was talking (and complaining).
syntaxerror53@reddit
Can't ping it, go check it in person.
oxmix74@reddit
Yes. Unplugged server wouldn't be at the top of my list, but it shouldn't take long to realize the server is unresponsive in every way you try to hit it. Which should lead to taking a look at it.
shell_shocked_today@reddit
Yeah, i mean we've all seen stupid power setups. I mean, the setup is just stupid enough to be real. Heck - one site i worked on only had one PS plugged in because they ran out of power cords with the right ends for their PDU.
OldGeekWeirdo@reddit
"It runs, and it's quitting time. I'll tell the boss later."
Even if they tell the boss, what's the chances it will be on their priority list.
bmwiedemann@reddit
I remember a case of a Fujitsu server, that I only plugged in with one cord, because it was only for a quick HDD benchmark. It crashed under load... And later I found that crashiness went away when plugging in both PSUs. Apparently, a single PSU was not enough to sustain its maximum power usage. Maybe the PSUs were already aged.
ThunderDwn@reddit
You're assuming the company actually put some thought into the deployment and hardware procurement.
The best laid plans are often derailed by penny-punching manglement. I worked in one place that had a fairly large exchange server - back in the days where exchange was mostly on-prem, and had just gone from version 3.5 (with one datastore) to whatever the next version was which allowed multiple data stores.
The executive were in rapture. No more deleting email.
trouble is, they only approved the hardware with single power supplies (this was for storage and server) despite the system architect, the system administrators, the vendor and anyone else who had half a brain telling this this was a Very Bad Idea - because single supply units were cheaper.
I'll leave to the imagination what happened the first time there was a power surge (oh yes, no UPS either) that took out several power supplies in the hardware.
Backups? We don't neeeed no steenkin' backups!
shell_shocked_today@reddit
Even better is when they have backups - but didn't spring for the Exchange connector license.....
Rathmun@reddit
I've seen dual power supplies plugged in right next to each other. I've also seen cleaners unplug two things at once because "If I don't it doesn't work." (If they didn't they'd pop the breaker.)
I haven't seen the two happen at the same time, but that doesn't mean much.
Drew707@reddit
I've seen PSUs plugged into the same PDU.
Rathmun@reddit
That tracks. Haven't seen it with my own eyes, but I believe it.
OcotilloWells@reddit
Both supplies were probably plugged into the same cheap power strip.
XkF21WNJ@reddit
Someone somewhere would be stupid enough to plug both of those into the same extension cord.
ryanlc@reddit
Other subs thought this shit was too unbelievable!? This happens all the fucking time!!
JudgementMaker123@reddit (OP)
One said that it was unbelievable because no one checked the server room, anybody with common sense would have first checked the server room. The post was removed before I could reply that it is also common sense to not randomly unplug things in a server room XD
Old-Class-1259@reddit
Sorry to hear of your experience on other subs, I would think most people have heard similar or even have their own story along these lines.
Loading_M_@reddit
Also, the server room is access controlled - someone can't just walk in and unplug things.
Nyssa314@reddit
I know, right? 1st step: does it have power? 2nd step: turn it off and back on. 3rd step: do whatever they were doing after they skipped the 1st 2 steps.
Own-Cupcake7586@reddit
Yeah, I buy this. I’ve spent more time on things stupider.
Turbojelly@reddit
People tend to forget just how stupid people truely are.
Also, due to Murphy's Law, if someone says "That isn't the issue, don't check it." You are all but garenteed to find that it actually is the issue.
Fwoggie2@reddit
Me too. We nearly fried our warehouse system across seven sites (delivering beer and wine to 1/3 of the UK's pubs and stadiums) back in 2006. Someone in the IT department pulled out a plug in the server room late one afternoon because they needed to plug in whatever and they didn't check what they pulled out. It was the air con system for the whole server room.
By the time the first techy woke up at 6am to discover a bunch of SMS alerts the servers were dangerously close to catching fire. He floored it to the office, turned on the Aircon to max, opened the door, stole every gangplug and desktop fan he could find in a desperate attempt to vent the heat.
After that incident the air con got rewired so you couldn't unplug it and the rotating night shift second line support got added to the server alert SMS list plus access to the monitoring dashboards.
theoldman-1313@reddit
The people saying this is fake are probably interns themselves. They will learn. Probably the hard way.
fatmanwithabeard@reddit
Yeah.
I've been in rooms like that.
Only thing that bugs me is that the group didn't immediately rally around you when it turned out you were right.
15 years ago I had clients that were the wild west like that place sounds, and clients that could see the state of every outlet in the room. Operational maturity was available, but even today it's not universal.
noO_Oon@reddit
Welcome home, you‘ve found your peers! This is very believable!
Steve061@reddit
This reminds me of when I worked for a large government department that had systems that had to be available 24/7 internationally. There were all sorts of redundancies, including generators and battery UPS that would give supply while the gensets kicked in (they normally took a couple of minutes).
We had a power outage in HQ (animal fried itself on the mains feed) Everything worked as it should. Batteries took over and gen-set kicked in while we waited for the mains to come back. After 15 minutes the mains came back and the generators started shutting down. Five minutes later someone in the server room asked why the generators had shut down and why were we back on battery because there was no mains feed into the servers.
Turns out all the testing had never actually cut mains power to the main relays. Two units the size of cigarette packets and costing less than a couple of hundred dollars had gone open circuit when they lost the mains power and just would not close the contacts when re-energised. It took three weeks for one of the key databases to be rebuilt because it was an old mainframe with only a single daily backup (since replaced!).
jamoche_2@reddit
Things like someone pulling a plug because they don’t know it’s important are classics for a reason. Anyone who thinks it’s fake just hasn’t been around very long. Same for perfect storms of multiple things going wrong at once when only the new person is around — or in my case, I wasn’t new but I also was an expert in something only tangentially related to the problem.
commentsrnice2@reddit
I was just reading a story yesterday about a power connection that was screwed into the base plate for security and someone still unplugged it in favor of something else
MrJingleJangle@reddit
Fully believe it. At a place i worked at in the 90s, we had one (Novell) file server that would mysteriously reboot every evening. The cleaner had to plug her cleaner in somewhere so she unplugged the server rack. The server racks were half-height rack in a convenient position on the floor with a LaserJet on top. Most of the other server racks had their power and Ethernet from a floor box under the rack, so tamper-proof, but this one was adjacent to a wall on a hard floor, so wall socket.
Best-Conclusion5554@reddit
I have instant recall of (I hope) all the really idiotic things I did in customer server rooms over a 40 year IT career ... thank goodness I've retired.
Original_Flounder_18@reddit
FWIW, whenever anything is malfunctioning, my boss tells me IT is aware of it and working on it-but put in a ticket anyway.
GreyWoolfe1@reddit
27 years in field service. The first 5 working for a DELL service provider and the next 22 working on government contracts. It happens more than you think.
Drew707@reddit
This is 100% believable. I have lived a very similar scenario with a colo provider of all people. They had three backbones into their center and assured us they had all the redundancy. One day they let one of the providers in to do some work, and the tech kept unplugging and plugging in some fiber connection. It was generating latency and packet loss that was nuking our phone system hosted at the colo, but apparently not enough to trigger their failover to another carrier. All because a third-party was left alone in the center.
Another time, our landlord essentially did the same thing to one of the other tenants in our building. That tenant was still using some legacy copper T1s or some shit for their phones that was routed through an empty unit and not the normal common telco closet. Some contractors came in to remodel the empty unit for a new tenant, and when they asked about the punchdowns, our landlord told them it was old stuff and everyone went through the telco closet and main riser. They completely ripped out all the phone lines and that tenant (call center) was dead in the water for something like a couple of weeks while they struggled to upgrade their entire system to VoIP. After that, I made sure no matter who was working in the shared telco closet, I had someone monitoring the project since that's where both our fiber lines came through.
OcotilloWells@reddit
I was in a government building that got re-cabled for telephone service. The cabling company had a totally separate crew for ripping out the old cabling after they were done with the new cabling. They asked the on site government guy, who had no idea what they were doing other than what I just said, as to which wires to pull. He told them all the new wiring was black, so anything else was okay. All the new wiring was not black.
The worst part was that guy made me the temp contact person at first because he went on vacation on the start date. Didn't bother to tell me, I didn't even know the wiring was going to be worked on. If he'd said something to me when the wrecking crew came in, I would have recommended a full stop, and insisted a supervisor from the cabling company come in, as that second crew clearly had no clue what they were doing. The funny thing is, my dad actually knew the owner of the company. I had asked the contracting office if my dad could spot check the job after hours, unofficially. If he saw any problems, we could just pretend that I had seen the issues. They gasped in horror, and said no.
MiloBard@reddit
There was a video I saw somewhere, like a short film or student film or maybe a commercial, where one guy spent his whole life laying down railroad tracks the old fashioned way, and then a few miles behind him another guy's job was tearing up the tracks so in the end they were never actually used. Corporate miscommunication and lack orf coordination, one department doesn't know what the other is doing.
SoItBegins_n@reddit
Shoulda done it without asking them.
OcotilloWells@reddit
I would have needed something so we knew c what their scope of work was supposed to be. Which is why I'm still kind of pissed that I was the poc for about 2 weeks, I had no idea exactly what they were supposed to do.
freddyboomboom67@reddit
I've been the guy doing the unplugging, so I totally believe it.
Not on purpose, but accidentally.
Story time: I was a field service engineer in the early 2000's and was let into the customer's server closet. The guy that let me in left immediately. The back of the rack had the cascade of power and network cables you see in bad memes. Removing the drive out of a tape library from the rear (ATL P1000, IYKYK) I accidentally pulled the plug out of a Sun server. Quickly plugged it back in and turned it on, but nobody ever came to investigate. Finished replacing the tape drive and went on my merry way. Nobody even checked my bags as I left.
gogozrx@reddit
It's always layer 1. If it's not, check again, because it's always layer 1.
Stormdanc3@reddit
My uncle has a story about an exec who wanted to test how effective the absolute worst case scenario backup procedures were on their main server. He did this by essentially pulling the server power plug in a way that bypassed the multiple redundant UPS and the normal low power shutdown procedure…with no warning to everyone.
Turns out, they did have a robust backup system, but it still took them two full days to get everything back online. Spency.
Ill_Cheetah_1991@reddit
Yup - sounds possible
Actually - anyone who has worked in IT for more than 10 years and say they have NOT seen something like this is either lying or not paying attention!
My record was 5000 users by leaning back in my chair and the chair leg knocking the off switch on a plug!
resonantfate@reddit
It blows my mind reading a bunch of old accounts of how the EPO switches in data centers and server rooms used to just be exposed without a protective cover. People were just expected to not hit them by accident. In my day, I've never seen an EPO switch that didn't have a protective cover.
K-o-R@reddit
What is "SAP" in this context?
resonantfate@reddit
SAP is an enterprise business management software product. I believe it can handle everything from inventory to industrial process tracking to finance. Probably a lot more.
https://en.wikipedia.org/wiki/SAP
ThunderDwn@reddit
I don't know why people think this is fake - I've seen similar circumstances at least a dozen times in my career.
The more infuriating one was a cleaner - who specifically did NOT have access to the server room, but who somehow managed to convince someone to give him a master key to every lock, overriding the keycard lock - would unplug one of the core switches every....single...weeknight...to plug in his vacuum cleaner.
Yes, there were massive process failures in that one too.
OldTimeConGoer@reddit
Doing contract IT service a long time ago I got called out to a small bank office to deal with a back-office server that was throwing errors. Remote diagnostics were, at that time, quite crude so I was sent to try and figure out what was going wrong. Fixing or replacing it would be done by an infrastructure team if needed.
I got to the bank office and was given access to the "server room" which was a locked cupboard, typical for a small IT setup like this office needed. The servers were standalone boxes, not rack mount -- "Compaq Great White Whales" IIRC. One of them was blinking error codes which I diagnosed as thermal issues from the documentation which was, for a change, actually on hand. On inspection I found the PSU intake grill choked with shiny ribbony stuff. I pulled it clear, did a reset and after a few minutes of rebooting the fans spun up and everything settled down. I stayed on site for an hour as the system support team logged in remotely and check out the server then I was told I could hand the system back to the office and leave.
As I was signing out I saw someone heading towards the server room with a couple of boxes of Christmas decorations, lots of shiny ribbony stuff... Yup, it turned out that they kept all that crap in the "server room" 350 days of the year but they had cleared it out temporarily to make it easier for the tech (me) to get at the servers. Thanks guys.
db48x@reddit
Oof, that ought to be illegal. Like obstruction of justice, or destruction of evidence, or witness intimidation, or…
BoulderNerd@reddit
Plugs get detached for all kinds of reasons. This is a very believable situation. The classic case is the cleaner needs to plug in a vacuum.
laz10@reddit
What's so unbelievable?
Dranask@reddit
Back in the 80’s I worked for an IT company that sold systems to Estate Agents [Relators]. The supplied an AMOS server to the main office and then dumb terminals at the outlier offices all connected with modems.
The number of offices that were down in the morning were usually attributed to the cleaner removing the power to the comms equipment to hoover the office and OFC never put it back.
Johnnysoul33@reddit
The whole story sounds like it could have happened and i have seen similiar stuff. I just have to say i cant believe one thing and that es that you instantly got 50 tickets. When production stands still nobody cares to write a ticket that shit instantly gets escalated per phone call to the highest IT guy in place who then informs the team atleast thats what would have happened in every company i have worked for so far.
JudgementMaker123@reddit (OP)
Might have been over the course of 5-10 minutes. But 50 tickets is nothing, it was 10 sites, so about 5 tickets from each site. And production was at a standstill, yeah, we were also called when that happened but I wasn't part of those calls, I was just looking at the ticket system. SAP was also used in other areas where an outage of SAP for an hour wasn't as important, those people could just go to lunch and if it was fixed by the time they came back, it would have been fine, so those people didn't call in necessarily, they just wrote a ticket and since they were not all in the same room, they each wrote a ticket individually. Thats how 50 tickets at basically the same time came together, all with the same message, but for some areas it wasn't as urgent as for production.
ManWhoIsDrunk@reddit
I've seen similar things several times. Nothing here is unbelievable.
WinginVegas@reddit
I've posted a similar situation here previously. Had a location with a regular, twice a week system disconnect that killed all access to a major part of the network.
Two weeks after this was reported, I just stayed in the area and then watched a cleaner unplug the main plug from a rack to plug in their vacuum, work for about 15 minutes, then plug it back in. This was the small rack that had the main firewall and router as it came in to the building.
No idea who set it up originally or why all that wasn't on the server room or on a UPS, predated us taking over their support.
5thhorseman_@reddit
Reality is unrealistic, I guess...
Random-Mutant@reddit
Server down?
Can you ping it? No?
Jump on the ILO/DRAC, what do you see? Nothing?
Go to the server room and look.
The fact the server stayed down and physically unchecked for all that time is where my suspension of disbelief checks out.
OldGeekWeirdo@reddit
Some people just don't have troubleshooting chops.
But I also wonder if the plug only ran part of the server, like perhaps the disk array. Server still talks, but very unhappy.
faithfulheresy@reddit
Nah, I believe it. So many people just jump past the simple stuff assuming it has to be a bigger problem.
Harry_Smutter@reddit
People calling this fake are dense as fuck. This happens FAR too often...We literally had contractors cut AN ENTIRE bundle of data and power cables because they're morons.
frac6969@reddit
I believe you because I’ve seen far worse.
rossumcapek@reddit
Step 0, always check the cables.
faithfulheresy@reddit
Anyone who actually believes that this story is "unbelievable" has never worked in a server environment. Idiots unplugging critical equipment is one of the most common causes of sudden, unexplainable outages. XD
So yeah, I believe this happened exactly as you've explained it.
hymie0@reddit
There's a very good reason why the first question we ask is "Is it plugged in?"
ryanlc@reddit
Because Roy taught us to ask...
K1yco@reddit
Definitely not 100% the same, but our office bought a robot vaccum. They neglected to account for the one desks in each row that houses all the network wires for the row.
The solution? They just gave each desk at least 2 zip ties. Not a proper cover or a barrier, nor did our internal tech support bother to require everything to look nice. Just two zip ties.
LadyA052@reddit
Cleaning lady needed to vacuum.
cruiserman_80@reddit
Dont know why this would get you banned unless other subs are enforcing word limits.
I've met so many people in this field that never even think about anything below layer 3 and never consider that a piece of hardware could fail or be powered down as part of their diagnosis.
Chocolate_Bourbon@reddit
I worked front-line support years ago for a debt card processor. Our clients were banks. Towards the end my client was our biggest and most complex. They bought every product we had. Their CEO and ours appeared on the cover of a trade magazine, holding hands while exuding leadership. They were a super big deal VIP.
And they knew that too. They were well aware of their status and also even before the deal were convinced they were the most important bank in the world, although they weren't even the number one bank in their state in the US. In terms of our client list, by size I don't think they were even in the top 50. But they would routinely make absurd demands of us and my company would routinely acquiesce. A couple times they refused a global change because they believed we hadn't tested it enough. But over time I managed to gain their trust and we settled into a rhythm.
Towards the end of my handling them we put in some new equipment to increase our processing capacity. These things were beasts. I remember hearing them characterized as CPUs, which seemed odd to my ears as supposedly each one cost over 1 million dollars. Regardless, going from 12 to 16 was going to be a huge upgrade.
I was told that the change would be seamless, as we would not touch current infrastructure. We would install the CPU's, then bring them online gradually. It would be like adding more cashiers at a supermarket. So I pitched this change and the client agreed to let it happen.
Then they our techs made the change very early Sunday morning in the US. And there was chaos right at the end of the change window. Parts of the environment went offline. Some authorizations failed. I got paged that morning. I called the client and explained what happened. I was told, and I assured them, that this was a short temporary outage with no lasting implications. They accepted that, grudgingly.
That Monday morning I gradually discovered that our automatic reconciliation process was failing. This was the process that matched parts of a transaction to another part. Like when you pay for an uber and the app asks if you want to add a tip. Something like that can be and often is two separate transactions. The first part is the authorization to the bank for the ride, the second part is the completion which includes the tip. The two should match up for the books to balance.
So I had to explain to the client what happened. Fixing this required them to manually review and match the thousands of transactions that occurred during the outage. It took them over a week of people sitting in a room doing the mapping. (A lot of our clients had systems in place that would handle such an event, but they didn't.) They weren't as nonchalant this time. The CEO went ape shit. I got screamed at. He screamed at their account executive. I think he even called our CEO. They reverted to their old posture of being paranoid. (which they had some justification for, boo-boos like this were uncommon, but not as uncommon as they should have been.)
I heard a couple weeks later what had happened. During the upgrade the techs had a bunch of power cords strung around the server room. They were of course temporary and would be removed after everything was in place and basic housekeeping was done. Then one of the techs stumbled over one of the cords, unplugging it. Boom, outage.
Head_Razzmatazz7174@reddit
I worked in tech support. There were many times we got calls. "My computer won't turn on!"
Is it plugged in? Yes ... oh wait....
Risherenow44@reddit
Totally believable, I worked in IT for over 40 years and had all kinds of crazy shit happen.
BigWhiteDog@reddit
Seen things like this before so not a surprise at all.
quadralien@reddit
I believe you!
__wildwing__@reddit
If your back up isn’t tested, you don’t have a back up.