I’ve been a little out of touch with this blog in the last month or so. Ever since Thanksgiving things have been crazy, especially at work with the busy season.
Over the last year we have made some great efforts to dramatically increase our stability as well as availability by increasing redundancy to remove single points of failure. This was on many levels including the networking layer by introducing an HA firewall pair, and an HA load balancer pair. We also built out our server infrastructure by implementing 3 web servers for the load balancing, as well as clustering our database hardware and our application server hardware. All of this was intended to be able to easily handle the load of the retail busy season, between Thanksgiving and New Year’s weekend. To be able to really know how much we could handle we wanted to load test the infrastructure top to bottom.
The point of the load test was to prove the scaled design, as well as give us some concrete numbers that we could publish to prospective customers. We had originally gotten a quote for a 1600 user load test from a professional services firm through our hosting provider.
This load test would run over an hour and would ramp the users in 1/10 increments every 6 minutes. Originally we thought this would be OK, but after some discussion we were not confident that this would be enough load to be able to really show us the tipping point in our systems. We wanted to really push the system to the breaking point so that we would know just how much it could take, not just when it would slow down. We decided to really push it and increase the concurrent users from 1600 to 10,000. We got the contract all squared away and were about to sign it when we got a little bit of a surprise.
We had recently brought a large national merchant online and things were going really well. We immediately noticed the new volume, and were keeping a close eye on things. We use Google Analytics on a good part of our system so it was easy to tell when this new volume hit. The new ‘Real Time’ reports in Analytics is pretty slick and gives you a great view of whats going on right now. We had been hovering between 100-200 concurrent users for most of the day.
One morning I was heading to the kitchen and glanced at the reporting screen on the wall, we had gone from~150 concurrent users to ~1000 literally in about 5 minutes time. At first I didn’t believe it, how could it be right?! It turns out that it was not an error, and it kept climbing, FAST. We ended up with a peak of about 1800 concurrent users and hovered between 1000-1400 for several hours. The new merchant had posted a promotion to Facebook and was driving all the increased volume to our system.
The following week the same merchant sent out a huge SMS Text message blast that once again slammed us with volume. The blast was to the same number of people we’re told, but this time our volume was much higher. In about the same amount of time we peaked out at ~4000 concurrent users! The bulk of the day we were between 2000-3000 users.
We were keeping a close eye on things the entire time. In the end we saw some slight increase in utilization, but things didn’t come crashing down on us, there weren’t complaints of unresponsiveness, or even slowness.
With this new data in hand we decided to hold off on the load test. In the end it saved us a ton of money, and time to get it setup just right. We now have some pretty good figures to give to our sales team, and some pretty good peace of mind that we are doing some things right!