How To Build Battle-Tested Websites - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
IT Leadership // Enterprise Agility
News
9/17/2014
11:00 AM
Connect Directly
LinkedIn
Google+
Twitter
RSS
E-Mail
50%
50%

How To Build Battle-Tested Websites

It doesn't matter whether your e-commerce D-Day is Black Friday, tax day, or some random Thursday when a post goes viral. Your websites need to be ready.

Download the entire new issue of InformationWeek Tech Digest, distributed in an all-digital format (free registration required).

Building a website that can survive the slings and arrows of visitors armed with outrageous bandwidth is hard work. While that hard work takes discipline, fortunately, the higher-level strategy you need to follow is not particularly complicated. In this article, we will explore the two main things you need to do to build battle-tested web applications: eliminate single points of failure and conduct accurate load testing.

We will discuss the architectures that most modern battle-tested websites use, highlight thoughts from engineers who have scaled some of the largest sites on the Internet, and finally walk through some "exception cases," where companies have flouted the modern conventions and are doing just fine with more traditional architectures.

Part 1: Eliminate single points of failure

When Amazon.com goes down, the company loses $66,240 per minute, if you calculate its quarterly revenue per minute. A one-second delay in page response can result in a 7% reduction in conversions, according to surveys from Akamai and Gomez. According to Jeremy Edberg, Reddit's former chief architect and currently a reliability architect at Netflix, a 10% improvement in response time resulted in 10% more traffic at Reddit. If you want to maximize revenue (and user happiness), you can't go down. And the best way to stay up is to have no single points of failure. The simplest way to have no single points of failure is to have lots of redundancy -- in individual machines, have redundant hardware (redundant power supplies, hard drives, network interfaces), and then have redundant machines and data centers. Unfortunately, even this simple redundancy is much easier said than done, and you will need more than simple redundancy. So the subsections here walk through the other high-level factors you need to think about in order to avoid single points of failure.

How to think about failure

Before we can eliminate single points of failure, we must understand what failure means. Many people mistakenly think about it only as a server that dies, but that is only one type of failure.

You need to think about failure as any improper response: either a failure to respond at all, or a failure to return an appropriate response. If your database server is up and running, but it's taking more than 30 seconds to respond to requests from your application, or provides errors rather than confirmation of successful transactions, then you have failure, even if your developers and system administrators tell you that technically all of the servers are up and running.

Application developers today often wrongly assume that requests made by an application will succeed in an appropriate amount of time. Case in point: Developers assume that network connections are reliable, and that once a network connection is established (say, between an application server and a database server), that connection will stay live as long as the application needs it (which is often over multiple user sessions). But we should not assume that network connections are reliable, even though it may appear that they are in application testing.

Perhaps the best explanation of why network connections are unreliable comes from Kyle Kingsbury and Peter Bailis (two respected engineers who spend a lot of time thinking about distributed systems and failure), and it should be required reading for all software developers, systems administrators, and network engineers. Here's the short summary:

  • Every company running a large number of servers reports significant failures related to network connectivity.
  • Failures happen in the networking wires, computer hardware, and also on the running server (e.g., a taxed server can drop network connections).
  • Connectivity problems are especially bad on the public cloud.

To cope, good application development uses a concept called the "try" block. The idea is that in your code, you "try" to do something (like write data to a database), and then either it will succeed (in which case your code keeps going), or it will throw an exception (in which case you have a separate section of code to handle that exception). To avoid single points of failure, you must be able to jump to exception-handling quickly if your original "try" isn't working properly, and you need your exception-handling code to be acceptable to the user (e.g., don't just tell the person, "Sorry, something is wrong").

The problem with the "try" block approach is it's hard to handle all of the various problems that come up while running large-scale websites. If the database server is under a lot of load, you might not hear anything back for 10-plus seconds, and thus you're waiting to decide whether you can keep going or need to handle an exception. In the next sections, we will talk about methods that will work to overcome the common types of failure that afflict websites today.

Read the rest of the story in the new issue of
InformationWeek Tech Digest (free registration required).
Joe Emison is a serial technical cofounder, most recently with BuildFax, the nation's premier aggregator and supplier of property condition information to insurers, appraisers, and real estate agents. After BuildFax was acquired by DMGT, Joe worked with DMGT's portfolio ... View Full Bio

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Slideshows
Top-Paying U.S. Cities for Data Scientists and Data Analysts
Cynthia Harvey, Freelance Journalist, InformationWeek,  11/5/2019
Slideshows
10 Strategic Technology Trends for 2020
Jessica Davis, Senior Editor, Enterprise Apps,  11/1/2019
Commentary
Study Proposes 5 Primary Traits of Innovation Leaders
Joao-Pierre S. Ruth, Senior Writer,  11/8/2019
White Papers
Register for InformationWeek Newsletters
Video
Current Issue
Getting Started With Emerging Technologies
Looking to help your enterprise IT team ease the stress of putting new/emerging technologies such as AI, machine learning and IoT to work for their organizations? There are a few ways to get off on the right foot. In this report we share some expert advice on how to approach some of these seemingly daunting tech challenges.
Slideshows
Flash Poll