Tech Horror Stories: When things go wrong

ryanmaynard

Administrator
Staff member
It is a rite of passage to take prod offline or wipe a db. Tell us your best worst stories.
 
I was tasked to compact an Exchange database, I backed up the database, I thought it had a different extension -- compacted the production database, then deleted both the production and the backup with a del *.edb command, I stared at my monitor just realizing what I did. We came up with a hacky way to restore data from Outlook cached data but that was a huge pain -- I had tried to recover the files but >2GB files were stored in a way that made most off-the-shelf restoration software barf -- oh well database was compacted. We find out their tape backup hasn't run for months.

I double check destructive items now, backups in place, working.



We had a project where we had to process a few hundred million data points from one format into another format, to be fair we designed the system to do it sort of "at random" instead of one customer at a time, we knew we'd have some "downtime" during this processing time at the beginning of the year (was happening through New Years), customer service gets a ticket that "data is missing" and they panic -- and I panic, I pull the plug on the conversion.

Yeah, no everything was fine, it was in the middle of processing that data for that customer, that behavior was expected. We talked about this, but the panic made us knee-jerk because support had CC'd all of C-level. We were multiple days into the process and had a now corrupt DB with the only option was to start over.

We stepped back for a bit, made the system more "intelligent" to process linearly so customers had their data offline for a shorter period of time, reminded customer support what would happen, and took a few days to rerun the process again (some of our tweaks made it significantly faster too).


People ask me why I'm so calm about issues -- it's because panic makes us do dumb things.



Another time I was in a server room, we had a chest-height APC UPS unit that was hard-wired to the wall, our Director of IT at the time was really adamant about moving it around for some reason (I forget why), I noticed the rear panel was loose and the giant power feed didn't look too safe. I told him "yeah, I wouldn't push on that, this looks really flaky" which he then insisted on pushing it against the wall pretty hard. Power went out in the racks killing all of our servers as the room goes silent and he just looks at me.

"Pull it away from the wall and don't touch it again"

Power back on, bloody lucky he didn't electrocute himself, but that was a pain to get everything back online.



We were getting alerts nearly daily from our Nagios system that a server was >90% full, watching it slowly climb up, I kept forwarding these e-mails to the IT department (I was technically not in charge of this) and telling them to please remedy the situation (expand the VM disk during downtime, clear some files, spit on it, I don't care, do something), but fell on deaf ears for weeks.

Last day I'm working there (got another job, this type of crap is why), I get awoken by a panicked phone call, our main file server filled up, all services stopped on it, everything is offline.

"Okay, power it off, expand the disk, bring it back online, extend the disk, restart all services, we should be back online, like 5 minute process, I'll head over to work and be there"

I work 5 minutes from work (a blessing and a curse)

I get in, things are still offline, I ask one of the system admins

"I decided to just restore from backup"

Bruh, that's a multi-terabyte system (it's why we couldn't hot expand the disks, VMware didn't allow it for disks that large at the time)... in lieu of a few minutes of downtime, you started a 4+ hour full system restore from backup.

I walk over the operations head and tell them "Yeah, our IT department failed to follow directions again so your downtime just got extended for another 4 hours"

I had a talk with the Director of IT at the time, who asked "is this over that machine you kept sending us alerts for?" I replied with "Yes" and he was just like "well, okay then"

I honestly could have just logged in and done it myself, would have taken a few minutes, but I wanted the IT department to handle anything on their own, but ultimately I still could have prevented it. They also complained constantly they needed an expensive monitoring system and Nagios "didn't cut it" -- worked for me for years.

Ok like 90% of this kind of stuff is why I have my own team these days, it's hard to put up with a crap team you didn't build that is just going to do a crap job
 
Oh I got another one:

I was handed a code base and told "this is the production code on our web server" -- cool, I made a change a per my boss' request. We ship it. I roll in next morning:

"Will, our site is showing our wholesale pricing publicly to all of the internet, what did you do?"

"NOTHING!"

The change I made was completely unrelated to pricing, it's on the same page but it was a minor adjustment! We slam the backup we have from the server back into place.

...

I run a diff between what was handed me as "production" code vs. what was actually on the server -- it didn't match. The code I was handed had a bug that always showed wholesale, we can't even determine what the code I was handed was supposed to represent -- we shelve it and copy the code from the server and treat that as the master copy

I wrote up a really nice post-mortum for the C-levels and they loved it though, @PatriotBob was mad that I basically broke their site and got praised for it after.


If someone isn't handing me a repo I pretty much 100% don't trust that code, companies that don't store their code adequately rarely have the correct "production" code
 
One time I deleted the firewall rule to allow incoming traffic on the prod main app. Turns out I was logged into the wrong aws data center
 
I'm sure I've done something stupid that's messed stuff up in production, but I'm struggling to think of a time right now. The one that I recall vividly, wasn't even a big deal but:

I was at my first job, and I had just been upgraded to a different machine. I installed Ubuntu on it, got my dev environment setup, got connected to the dev database we all shared, the production machine, everything.

I worked through the morning, then at lunch I ate and was like "ok, time to get this machine properly configured the way I want now". I wanted to switch off of Unity desktop to some tiling window manager, I don't remember which one I was using at the time, probably `dwm`. So I tabbed over to my terminal emulator, typed:
Code:
sudo apt-get install dwm
Completed the install, setup a `.xinitrc` and logged out of my X session. But running `startx` took me back to Unity. So I checked my `.xinitrc` again and it wasn't there. Weird.... and then I realized I installed all of that on the production server. Went to the sysadmin, hat in hand, and he just laughed at me and told me he'd fix it and to pay attention to what terminal I'm typing in next time. I was very embarrassed.
 
I once broke the front page of my company's e-commerce website for all guest users. I'd called request()->user()->id without checking to see if request()->user() actually existed. It was fixed within a few minutes, but yeesh, not fun.

Bigger horror story was how many poor dev practices there were, leading to situations like that. Like zero automated tests and no issue tracking software. And I didn't realize they were issues because I was very new at the time :(
 
... I've been fighting with hours on getting this new Docker container to run correctly, I'm digging into a bunch of bullshit Java Wildfly documentation and obscure issues, modifying configuration files that should be fine as-is, stumped before I take a closer look at the documentation:

1720420963705.png


Hey, you should uh.... add that "-c standalone-apelondts.xml" onto the damn default docker image startup command


Everything works. 😖


Not a huge horror, but I've been putting off "fixing" this because it's such a pain, and I'm fighting not reading documentation close enough.
 
Back
Top