What have been your costliest admin mistakes?
Posted by msic@reddit | linuxadmin | View on Reddit | 64 comments
For me it would be not actually recording credentials and then needing them later. Might remember them eventually, but there is no excuse not to put them somewhere they can be retrieved, hehe.
On the hardware side, assuming all modular PSU cables were interchangeable (they are not).
tekbill@reddit
Probably suggesting and implementing SolarWinds for NASA then finding the Russians hacked the code base - boy was my face red
ogn3rd@reddit
Solarwinds123 lmao. They were never serious.
Competitive-Sky-9541@reddit
I screwed up the config of frontal http proxy, in a saas company managing lots of IOT devices : one customer got all devices of every customers on his dashboard, and could control them.
Line-Noise@reddit
Potentially costliest:
I was working for Weta Digital on The Lord Of The Rings.
We upgraded our internet connection which required installing a new router. Forgot to transfer the firewall rules over that were supposed to block SSH into the FTP server in our DMZ.
There was an old vulnerable version of SSH on there that got popped almost immediately.
Luckily I had Tripwire running on there. I saw the notification email the next morning and went straight into the server room and yanked the network cable from the box.
We did some analysis and determined that the hacker didn't realise what they had found and was just using it as a jump box to try hacking other things.
The reason why it was potentially costly? This server had test renders of Gollum and a bunch of other stuff that we regularly sent to New Line Cinema. If they had leaked to the Internet we would have been screwed.
butrosbutrosfunky@reddit
That's hilarious, your box containing hundreds of millions in cinema IP gets rooted by some loser who didn't even use it to send spam
punklinux@reddit
Mishandled some git commands and did a rebase on the master repo (which, ultimately, I had no business even touching). Undid about a week's worth of updates for about 5 developers. Did not realize I had done this until some developers, who were always complaining otherwise, started complaining. One of the developers immediately started blaming another developer for sabotaging his code intentionally. That other developer ended up going to his desk, and threatened to take him outside and beat the shit out of him for the accusation. A manager separated them. This created a huge drama storm, and eventually, my manager asked in a meeting if anyone "rolled back" a week of changes, but I wasn't in that meeting because I was dealing with an unrelated issue in the data center.
Eventually, the sysadmin team was discussing the drama, and I realized it was me. So I went to my boss, and he was NOT pleased, because he thought I had hidden that I had done it without authorization and then tried to hide it. I asked, "If I tried to hide it, why did I come to you?" and he didn't have an answer for that. In the end, I was not called out on it and we were able to get some of the code back from restores. But things with that boss had soured, and eventually I left and got a new job because I always felt like nobody trusted me after that.
butrosbutrosfunky@reddit
Gotta say, you could have just shut the fuck up about that one
evild4ve@reddit
I assumed a mission-critical service installed on a thin client server supported thin clients, and that the previous admin hadn't just got lucky with the version of it they had installed.
butrosbutrosfunky@reddit
Jesus christ
arkham1010@reddit
I made a mistake once a number of years ago that was pretty big.
As in, it was on the front page of cnn.com big. No, I'm not going to tell you what it was exactly or whom it was for. Yes, I actually did keep my job because I immediately informed my boss of what happened, but my mistake caused a cascade of other issues that no one realized were a problem, and while fixing my mistake took about 5 minutes, the cascade lasted days with other teams making massively larger fuckups than mine was.
The only thing I'll say is 'It involved DNS'.
butrosbutrosfunky@reddit
You wanna fuck up a company, DNS is fine tool for that end. However, if you wanna fuck up an ISP or an entire nation-state, buddy that's gonna require BGP
arkham1010@reddit
Well, here's ONE fun thing we found after my (really) minor screw up. (I had transposed two numbers, instead of, say 192.168.14.28 I had 192.167.41.28.
HOWEVER, various application teams were hard coding IP addresses into their applications. Not scripts like python, I'm talking C++ code that needed to be recompiled. Then there was issues with compiler versions and...yeah it became a giant shitshow.
xouba@reddit
It's always DNS.
butrosbutrosfunky@reddit
Except when it's BGP, then you have entire ISP's and nation states going dark because some guy fucked up some CISCO updates
romprod@reddit
GoDaddy would be my guess if it wasn't Facebook
AmSoDoneWithThisShit@reddit
I rememember one similar... My wife was at the Gym and she saw the company on the morning news with the headlines "Massive System Outage"
When she got home she made me a cup of coffee, came up to the bedroom and set it on the nightstand, and gently woke me saying "Honey....you're going to have a REALLY shitty day today."
She wasn't wrong, I woke up and didn't sleep again for 72 hours. (Apparently, I'd slept through the cellphone call at 3am.)
Wasn't me though...
dustinduse@reddit
Been there done that. 58 hours straight after a coworker forgot to apply a critical security patch, and about another week worth of late nights before everything was running again.
Line-Noise@reddit
Hey everyone! We found the person that worked for CrowdStrike!
arkham1010@reddit
LOL!
LINUX admin, now wintel!
Carribean-Diver@reddit
It's always DNS. Or certificates.
lariojaalta890@reddit
FB?
arkham1010@reddit
No
lariojaalta890@reddit
I figure it was worth a shot. Do you remember from a few years ago?
AmSoDoneWithThisShit@reddit
I absolutely remember that... It's why I'm glad I'm a storage guy and not a network guy. ;-)
renaissance_man__@reddit
That was BGP iirc
lariojaalta890@reddit
It was the root issue and it caused DNS servers not to be reachable.
StaticDet5@reddit
Hahahha lol.
The number of times that I've heard "I screwed up, it's a big deal and people seem cool about it... It involved DNS"... I feel like this is a thing
arkham1010@reddit
Lets just say that me performing the documented backout plan was fine.
Someone panicking and failing over the AD system without asking or it being tested wasn't.
linuxunix@reddit
This was form a Major Financial institute. The corporate office wanted to have conference rooms have a PDA to reserve the space and make better utilization of the rooms. They are just a tablet running linux. What was interesting, when they first power up, the host name is not set, or that was the intention. In reality, the hostname file had this name 'NULL'.
So the Fuckup is, when plugged in the DHCP server asked what your hostname is, the the device replied NULL, so the domain controller assigned the name null.bank.com (fake, to protect the reputation), which got interpreted as .bank.com or simply bank.com.... So all internal traffic in 100 offices over in 40 countries redirected all traffic to this room conference device. Just lucky that it was only 'internal' traffic, and not actual internet.
butrosbutrosfunky@reddit
haha kickass
Hxcmetal724@reddit
rm -f .* does NOT delete the hidden files
butrosbutrosfunky@reddit
Folks still getting fucked by rm -f in this the year of our lord 2025, haha it's good some shit never gets old
Parker_Hemphill@reddit
You can pass “-i” for it to prompt you when doing possible risky operations. I have that aliased for my users. It’s easy to override in your own .bash_alias or simply pass “command rm” when you don’t want to be prompted.
Hotshot55@reddit
You can just throw a backslash in front of the command to ignore the alias, e.g.
\rm
.Anticept@reddit
Ok this one is good
Amylnitrit3@reddit
One morning I arrived at work and the boss was waiting for me in the parking lot. He asked me if he should send everyone home. I then trotted down to the IT basement, where my colleague was sweating and running around in despair. He had activated our latest software version last night (he likes to get up early) and had forgotten to apply the version migration code beforehand. This meant the database was now empty. And he hadn't made a backup beforehand either.
korsten123@reddit
Probably deleting mysql database files of a primary database server of a production system
dodexahedron@reddit
One of those fuckups where you know you fucked up before your finger hits the key, but it's too late to stop the motion?
doubled112@reddit
Tab completion is not always your friend.
xouba@reddit
Or Control-R: you type 5 characters of the command and bash shows you the right one; you type the 6th character and bash decides that you want another, wrong command; but by the moment you realize, your finger is already in a collision course towards the enter key and it's too late.
dodexahedron@reddit
God damn muscle memory. 😩😅
That one bites me at least once a week, though not on anything high stakes since change controls are copy-paste jobs from a scrutinized to hell and back list of commands to execute for that among plenty of other reasons.
doubled112@reddit
Relatable. I may have muscle memoried my way through fdisk on the production Nagios server one time. The units were not the units I expected but it was too late.
What are you doing? Testing backup restores for audit...
dodexahedron@reddit
Just say you were testing an in-house equivalent of Chaos Monkey!
Good on you for "ensuring your business continuity plan was viable." 😅
doubled112@reddit
Don't look at me, I was just summarizing (stealing) my director's joke.
He was always really chill about this sort of thing.
whamra@reddit
Customer had problems. Somehow login to the wrong customer's data. When discussing the problem with the customer he showed zero indication that were not discussing the same issue or I'm seeing different data. I even included screenshots of the log files. He told me to just wipe his account clean and sysrt over. I happily obliged.
I deleted another customer's data and it took me 10 minutes later to figure it out.
That was two years ago and till today I double and triple check, then cross check every ID, email address, and ip address when performing such tasks.
ShoneBoyd@reddit
Shouldn’t this go through a CAB first? We can acknowledge the request and keep them information of their request progress, but the final decision has to be done from the department manager via written email.
jgo3@reddit
Attempting to upgrade the C libraries on a production web server while it was in production.
ClumsyAdmin@reddit
Killed a production DB for an application that had 30k+ people working in it. My boss was watching as it happened, said something like "Oh well, it happens a couple of times a year". No consequences at all somehow.
Shot-Document-2904@reddit
Allowing “show last logon properties” to clients before applying to the domain controller. Prevented hundreds of engineers on a multi-million dollar program from logging in.
HTDutchy_NL@reddit
Production went TITSUP for 24 hours due to 3TB RDS/MariaDB shenanigans while I was on vacation. Forgot to check iops limits before going away. Of course it was at the limit and couldn't scale up in size before raising iops. And note that this thing restarting took 4 hours to get back to usual performance.
Well this time I had to restart it to raise iops limits, followed by 20 hours of waiting for some background iops optimization BS before we could get more storage space and get back up.
It wasn't a fun phone call but I took my responsibility.
HTDutchy_NL@reddit
3TB RDS/MariaDB shenanigans. Forgot to check iops limits before going on vacation. Of course it was at the limit and couldn't scale up in size before raising iops. And note that this thing restarting took 4 hours to get back to usual performance.
Well this time I had to restart it to raise iops limits, followed by 20 hours of waiting for some background iops optimization before we could get more storage space and get back up. Was a fun 24 hours.
AmSoDoneWithThisShit@reddit
Something along the lines of:
tar -cv /dev/rmt0 --remove-files archive.tar . /
Note the space between . and /
It was on the 3rd tape before I realized what I'd done....
This was the same day we found out our Brand New Veritas NetBackup system wasn't worth a shit...
Yanked the system out from under 5 running marketing databases on a Sun UE10k...
amazingly I didn't get fired. Heckled, Poked, Prodded, made endless fun of...yes...but not fired.
bohoky@reddit
I forgot to mount a scratch monkey
dodexahedron@reddit
OK, I misread this one as "coolest admin mistakes," and couldn't open the post fast enough to see what crazy responses there were to that.
Then I was disappointed that it was a more normal question. 😔
meagainpansy@reddit
The Executive Director was giving me stink eye for not keeping our monitoring software on the latest version like their marketing machine told her I should. I told her I like to stay a few major versions behind because the software she insisted on was a piece of shit.
Guess who saved a large org of like 50k users from the SolarWinds hack. Not Miss Bleeding Edge over there. *smugly polishes fingernails on shirt*
Being lazy's pretty cool.
dodexahedron@reddit
Was about to ask if it was a product that rhymes with Butts Cup Mold. But that'll do, as well. 😆
meagainpansy@reddit
I was a minor version behind the hacked one lol. I really just lucked out there. But SolarWinds was already a mess before any of that happened.
msic@reddit (OP)
Perhaps we should start another thread on that, lol
dodexahedron@reddit
I almost did. Then something else shiny appeared. C'est la ADHD. 😅
Cherveny2@reddit
along the lines of recording credentials, not documenting a process you do very rarely, but is critical when it happens, be it how to configure some odd bit of software, or how to reinstal some critical software if, for some reason, you need it to move hosts.
Not verifying backups actually ARE taking place, and the data contained is VALID
Caduceus1515@reddit
Once took out an entire subnet of production servers at a major financial institution when a typo and a badly-timed network pause resulted in my hitting return a few times to "wake" the connection, but the input was getting through to the other side and inadvertently set a device to the IP of the gateway...
Hrafna55@reddit
Formatted a physical server that was still in use. That was on me. The fact it had no backups wasn't my fault.
This was long ago. Communication error more than anything. I should have asked for clarification and my manager should have directed me better in the first place.
fubes2000@reddit
I forgot the using AWS ClientVPN was supposed to be a temporary solution, and only realized that it was costing us 700-800/mo [multiple gateways, multiple users] every month for a couple years.
mylinuxguy@reddit
Not costly.. but semi-painful....
1) I didn't like vi or know how to use it very well. I did a !q to quit editing the /etc/passwd file and saved an empty /etc/passwd file to disk. Managed to restore from backup fairly quickly, but that was not fun.
2) I fired up a 'devbox' dhcp server to test out some VM auto install scripts and took out production for about 1/2 of the building since I was on the production lan and passing out private ip addresses and routes that didn't work for the production users. Took IT a bit to track me down and have me turn off my DHCP server.
Now my (x) wife lost $156 Million Dollars when she worked at Mobile Oil. It wasn't lost... just misplaced for a day or so. That's a fun memory. ;)