Black Friday/Cyber Monday Horror Stories

StrangeWill

Administrator
Staff member
So a thread to tell your woes of BFCM -- the day all of our e-commerce sites (and related services) get kicked in the teeth day after day.

So mine started actually on the 27th, we get a massive traffic spike after noon system is holding fine, some events are backing up in RabbitMQ, but that's it's job it'll be fine.

Then RabbitMQ's hard drive fills up and all hell breaks lose. Yeah we shoved 64GB of overflow messages into the thing in under 2 hours -- annoying but an easy fix. Once done everything has been relatively happy.

At one point we were at 36 million items in RabbitMQ ready for pickup (I only got it at 30m)
1733013148212.png



Some of our jobs were doing this bullshit:
1733013306717.png


I had been fine tuning our workers to try to get them consistently processing, I've pushed 17 releases to production since Wednesday
🫠 it's hard to explain how the queue may be able to process a bajillion messages, but that doesn't matter if the consumers are bottlenecked due to a not-so-great query.

Meanwhile CEO was showing off how many people were having meltdowns about other platforms blowing up. 😂 Funny enough ours blew up the day prior so we got ahead of all the traffic and it was a quick enough fix that no one actually complained.
 
17 releases during BFCM is intense!

Never seen queue volumes spike quite like that before. Was there one deployment that cracked it, or did you end up tuning things gradually?

Curious if you landed on queue prefetch adjustments or if you had to break out the bigger guns like DLX or shovel.
 
17 releases during BFCM is intense!
Yeah, but it's mostly tiny tweaks, fiddle a line or two, push, watch.

Never seen queue volumes spike quite like that before. Was there one deployment that cracked it, or did you end up tuning things gradually?
Not really, part of it is just a "we're taking in more volume than the current DB design can handle" and the queue just backing up, which is fine, it's designed to be a "buffer", the changes I made made it process about 3-4x faster -- which still isn't enough to keep up with peak, but was enough to make it come down pretty rapidly after peak.

Curious if you landed on queue prefetch adjustments or if you had to break out the bigger guns like DLX or shovel.

Funny enough most tweaks had little to do with RabbitMQ itself, we moved one job away from batch consumers because the job was basically "bring in the batch, process each in a loop" which doesn't take much advantage of batching and actually reduces parallelism -- later we want to make the batch process more effectively (allow the entire batch to do a "plan" in a single query then execute), but it was short notice to try to figure that out -- easier to just move to parallel items.

Upped default minimum threads in the threadpool to try to "kickstart" things to start/ramp up quicker

Little query optimizations that I could make.
 
That's awesome! Thanks for answering, I'm not ultra experienced there and its cool to hear/see what its capable of (far beyond what i've ever needed to build).
 
That's awesome! Thanks for answering, I'm not ultra experienced there and its cool to hear/see what its capable of (far beyond what i've ever needed to build).

Oh man I mean, RabbitMQ is pretty powerful, it gives you a lot more options vs. the more high-throughput "pipe A to B as fast as humanly possible" platforms like Kafka are (which I hate to even mention Kafka these days because I got people that keep coming back to Kafka for message queuing which it's an event streaming system, yes these act similar but different, yes Kafka is a firehose of immense flow, but at the end of the day, it also doesn't tackle certain problems well because of it).

I think @PatriotBob will have a stroke if he hears Kafka come up again :nervouslaughter:
 
Last edited:
Back
Top