How To Build Battle-Tested Websites
It doesn't matter whether your e-commerce D-Day is Black Friday, tax day, or some random Thursday when a post goes viral. Your websites need to be ready.
Download the entire new issue of InformationWeek Tech Digest, distributed in an all-digital format (free registration required).
Building a website that can survive the slings and arrows of visitors armed with outrageous bandwidth is hard work. While that hard work takes discipline, fortunately, the higher-level strategy you need to follow is not particularly complicated. In this article, we will explore the two main things you need to do to build battle-tested web applications: eliminate single points of failure and conduct accurate load testing.
We will discuss the architectures that most modern battle-tested websites use, highlight thoughts from engineers who have scaled some of the largest sites on the Internet, and finally walk through some "exception cases," where companies have flouted the modern conventions and are doing just fine with more traditional architectures.
Part 1: Eliminate single points of failure
When Amazon.com goes down, the company loses $66,240 per minute, if you calculate its quarterly revenue per minute. A one-second delay in page response can result in a 7% reduction in conversions, according to surveys from Akamai and Gomez. According to Jeremy Edberg, Reddit's former chief architect and currently a reliability architect at Netflix, a 10% improvement in response time resulted in 10% more traffic at Reddit. If you want to maximize revenue (and user happiness), you can't go down. And the best way to stay up is to have no single points of failure. The simplest way to have no single points of failure is to have lots of redundancy -- in individual machines, have redundant hardware (redundant power supplies, hard drives, network interfaces), and then have redundant machines and data centers. Unfortunately, even this simple redundancy is much easier said than done, and you will need more than simple redundancy. So the subsections here walk through the other high-level factors you need to think about in order to avoid single points of failure.
How to think about failure
Before we can eliminate single points of failure, we must understand what failure means. Many people mistakenly think about it only as a server that dies, but that is only one type of failure.
You need to think about failure as any improper response: either a failure to respond at all, or a failure to return an appropriate response. If your database server is up and running, but it's taking more than 30 seconds to respond to requests from your application, or provides errors rather than confirmation of successful transactions, then you have failure, even if your developers and system administrators tell you that technically all of the servers are up and running.
Application developers today often wrongly assume that requests made by an application will succeed in an appropriate amount of time. Case in point: Developers assume that network connections are reliable, and that once a network connection is established (say, between an application server and a database server), that connection will stay live as long as the application needs it (which is often over multiple user sessions). But we should not assume that network connections are reliable, even though it may appear that they are in application testing.
Perhaps the best explanation of why network connections are unreliable comes from Kyle Kingsbury and Peter Bailis (two respected engineers who spend a lot of time thinking about distributed systems and failure), and it should be required reading for all software developers, systems administrators, and network engineers. Here's the short summary:
Every company running a large number of servers reports significant failures related to network connectivity.
Failures happen in the networking wires, computer hardware, and also on the running server (e.g., a taxed server can drop network connections).
Connectivity problems are especially bad on the public cloud.
To cope, good application development uses a concept called the "try" block. The idea is that in your code, you "try" to do something (like write data to a database), and then either it will succeed (in which case your code keeps going), or it will throw an exception (in which case you have a separate section of code to handle that exception). To avoid single points of failure, you must be able to jump to exception-handling quickly if your original "try" isn't working properly, and you need your exception-handling code to be acceptable to the user (e.g., don't just tell the person, "Sorry, something is wrong").
The problem with the "try" block approach is it's hard to handle all of the various problems that come up while running large-scale websites. If the database server is under a lot of load, you might not hear anything back for 10-plus seconds, and thus you're waiting to decide whether you can keep going or need to handle an exception. In the next sections, we will talk about methods that will work to overcome the common types of failure that afflict websites today.
Read the rest of the story in the new issue of
InformationWeek Tech Digest (free registration required).
About the Author
You May Also Like