We’d like to shed more light on the responsiveness issues experienced by our redirection services on February 6th. Firstly, we’d like to sincerely apologize that this incident occurred. The performance of our services over this period does not reflect our goal of 100% uptime and we recognize this affected our customers negatively. I hope the following information shows that we have understood what has happened, and that we have laid plans to reduce the likelihood of this occurring in the future.
February 6, 2020 @ 20:01 MST (2020-02-07 @ 03:01 UTC)
February 6, 2020 @ 20:33 MST (2020-02-07 @ 03:33 UTC)
The redirection services on 220.127.116.11 and 18.104.22.168 responded to requests very slowly, and in some instances requests were dropped.
A distributed-denial-of-service (DDOS) attack on a customer website.
Provision enough capacity to handle the full load of the attack while ensuring all traffic was processed within our typical response times.
The EasyRedir redirection services are hosted on AWS across multiple availability zones (AZ) within the US-West-2 region. There is an AWS Network Load Balancer (NLB) that has an interface in each AZ, and a fleet of EC2 instances in each AZ that actually processes the redirection requests. This architecture has proven to be highly reliable and easily scales to very high traffic levels.
On February 6, 2020 we received alerts from our monitoring tools of high loads on our redirection servers. We immediately began an investigation and determined the servers were receiving vastly higher traffic levels than we typically process at any given time. At peak, our servers were processing 44x our typical traffic levels. It’s important to note that although our systems were loaded much higher than is typical, we were still responding to this traffic within our typical response times.
Our systems have a variety of tools at their disposal to mitigate attacks from bad actors. Our AWS NLB has a variety of DDOS mitigation functions built into it (which typically operate at the IP or TCP layers of the network stack). Our redirection servers also have a variety of tools to handle this level of traffic (highly tuned Linux kernel parameters, iptables based IP blocking, request and connection limits built into the web server configuration, and crucially, a carefully constructed series of RAM-based caches that cache redirect configuration information).
The nexus of the customer visible impact originated from our action to make a web server configuration change to block this traffic at an earlier point in our processing pipeline. This change required a reload of our server configurations. What was not fully understood at that time was the degree to which our caches were contributing to our low (and typical) response times. When each server configuration was reloaded, the cache was cleared. This had a knock-on effect throughout our processing pipeline - connections to backing cache servers had to be reestablished, and RAM caches rebuilt. It was this action that caused the start of the customer visible incident as our systems struggled to respond to client requests in a timely manner.
We immediately began to provision additional EC2 instances and added them to the NLB. Once this capacity started to come online, response times started to drop back down towards normal levels. Fully normal response times and traffic processing capabilities were returned 32 minutes into the customer visible incident. It’s important to note that during this time, many requests were serviced successfully, albeit at times much longer than we typically take to process a request.
The redirection services were fully restored within 32 minutes of the start of the customer visible event.
This failure was regrettable on both a corporate and personal level. The decision to initiate the actions that led to this incident was taken by our staff - this was not a failure of our architecture or technology. This has been felt personally by us, and we are sincerely sorry.
We have already taken a number of actions as a result of this incident, and plan to take many more in the days to come.