I was tasked to compact an Exchange database, I backed up the database, I thought it had a different extension -- compacted the production database, then deleted both the production and the backup with a del *.edb
command, I stared at my monitor just realizing what I did. We came up with a hacky way to restore data from Outlook cached data but that was a huge pain -- I had tried to recover the files but >2GB files were stored in a way that made most off-the-shelf restoration software barf -- oh well database was compacted. We find out their tape backup hasn't run for months.
I double check destructive items now, backups in place, working.
We had a project where we had to process a few hundred million data points from one format into another format, to be fair we designed the system to do it sort of "at random" instead of one customer at a time, we knew we'd have some "downtime" during this processing time at the beginning of the year (was happening through New Years), customer service gets a ticket that "data is missing" and they panic -- and I panic, I pull the plug on the conversion.
Yeah, no everything was fine, it was in the middle of processing that data for that customer, that behavior was expected. We talked about this, but the panic made us knee-jerk because support had CC'd all of C-level. We were multiple days into the process and had a now corrupt DB with the only option was to start over.
We stepped back for a bit, made the system more "intelligent" to process linearly so customers had their data offline for a shorter period of time, reminded customer support what would happen, and took a few days to rerun the process again (some of our tweaks made it significantly faster too).
People ask me why I'm so calm about issues -- it's because panic makes us do dumb things.
Another time I was in a server room, we had a chest-height APC UPS unit that was hard-wired to the wall, our Director of IT at the time was really adamant about moving it around for some reason (I forget why), I noticed the rear panel was loose and the giant power feed didn't look too safe. I told him "yeah, I wouldn't push on that, this looks really flaky" which he then insisted on pushing it against the wall pretty hard. Power went out in the racks killing all of our servers as the room goes silent and he just looks at me.
"Pull it away from the wall and don't touch it again"
Power back on, bloody lucky he didn't electrocute himself, but that was a pain to get everything back online.
We were getting alerts nearly daily from our Nagios system that a server was >90% full, watching it slowly climb up, I kept forwarding these e-mails to the IT department (I was technically not in charge of this) and telling them to please remedy the situation (expand the VM disk during downtime, clear some files, spit on it, I don't care, do something), but fell on deaf ears for weeks.
Last day I'm working there (got another job, this type of crap is why), I get awoken by a panicked phone call, our main file server filled up, all services stopped on it, everything is offline.
"Okay, power it off, expand the disk, bring it back online, extend the disk, restart all services, we should be back online, like 5 minute process, I'll head over to work and be there"
I work 5 minutes from work (a blessing and a curse)
I get in, things are still offline, I ask one of the system admins
"I decided to just restore from backup"
Bruh, that's a multi-terabyte system (it's why we couldn't hot expand the disks, VMware didn't allow it for disks that large at the time)... in lieu of a few minutes of downtime, you started a 4+ hour full system restore from backup.
I walk over the operations head and tell them "Yeah, our IT department failed to follow directions again so your downtime just got extended for another 4 hours"
I had a talk with the Director of IT at the time, who asked "is this over that machine you kept sending us alerts for?" I replied with "Yes" and he was just like "well, okay then"
I honestly could have just logged in and done it myself, would have taken a few minutes, but I wanted the IT department to handle anything on their own, but ultimately I still could have prevented it. They also complained constantly they needed an expensive monitoring system and Nagios "didn't cut it" -- worked for me for years.
Ok like 90% of this kind of stuff is why I have my own team these days, it's hard to put up with a crap team you didn't build that is just going to do a crap job