Brookhaven Lab Finds AWS Spot Instances Hit Sweet Spot - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Cloud // Infrastructure as a Service
11:11 AM
Connect Directly

Brookhaven Lab Finds AWS Spot Instances Hit Sweet Spot

When Brookhaven National Lab needed compute power to meet peak demand, it turned to a new Energy Sciences Network and Amazon Spot Instances.

10 Tools To Keep Your Agile Dev Projects On Track
10 Tools To Keep Your Agile Dev Projects On Track
(Click image for larger view and slideshow.)

Last September the Brookhaven National Laboratory discovered a way to expand its compute power for particle research without blowing up its research budget. As its needs outgrew its own facilities, it opted to use Amazon Spot Instances -- the virtual servers that customers can use for as long as their low bid isn't topped by someone else's.

It was a choice that seemed risky at the time. Scientists were lined up to run their research systems against mountains of data generated by the CERN Large Hadron Collider in Geneva, but neither Brookhaven nor participating university departments had enough compute capacity to satisfy their demands.

Furthermore, "science is highly competitive," observed lead computer scientist Michael Ernst for Brookhaven's ATLAS team, tied into CERN, in an interview with InformationWeek.

Ernst and the ATLAS team decided to test Spot Instance use in the cloud over a five-day period last September. Such a move might not sound like rocket science to major enterprises already liberally tapping AWS virtual servers. But the needs of particle researchers are extremely large-scale, and there were 1,500 of them waiting in line, and there are known drawbacks to using Spot Instances.

A given particle research system might need to run continuously for 24 hours. Simply because the research team lined up the Spot Instances they needed at the outset, didn't mean they'd still be available as the research ground into its 24th hour. Spot Instances are a bargain in the middle of the night, but can get shifted into higher priced Spot Instances or even On-Demand instances with the dawn of the business day.

"If a system has run 23 hours and 57 minutes, and the Spot Instance goes away, you lose everything," Ernst noted in an interview. That was one of the hazards of selecting what was, by definition, a temporary resource. Spot Instances are unused compute power in the Amazon cloud that is available at whatever price a customer cares to bid for them. They attract the low bidders and typically cost one-quarter to one-tenth of the AWS On-Demand class of servers, Ernst said.

(Image: Andrey Prokhorov/iStockphoto)

(Image: Andrey Prokhorov/iStockphoto)

But Brookhaven needed large numbers of them in one location to deal with the terabytes of data being generated by the Large Hadron Collider. For its first major test, Ernst sought the equivalent of 50,000 physical cores to power the Spot Instances needed. The rub was that 99% of them would need to remain available throughout the five-day test period.

All 50,000 wouldn't need to be continuously available. Ernst could afford to have 1% shifted to higher bidders at any one time by pre-arranging for jobs to failover to other virtual servers. But if there was a surge in demand for Spot Instances during his trial, too many servers would be lost to finish many of the running computations.

"Nodes acquired on the Spot market can be terminated at any time, meaning applications need to tolerate disruptions," said Ernst. If the disruptions exceeded the ability of the applications to failover, there were going to be many disappointed researchers, he said.

As Brookhaven prepared its test run on Amazon, it was a rare event to have sufficient data from Hadron/ATLAS loaded into the cloud to host hundreds of research explorations at one time. It takes a trillion proton collisions in the collider to produce evidence of a single Higgs boson particle's decay. Nevertheless, understanding the Higgs boson -- the goal of many ATLAS research workloads -- promises to provide the next refinements in our understanding of the universe, possibly unlocking the secrets to gravity.

[Want to learn more about AWS 2015 results? See Amazon, AWS Post Strong Results, Fail to Please Wall Street.]

Brookhaven was able to load the data into Amazon over the Energy Science Network, operated by the US Department of Energy at 100 Gbs. Moving vast amounts of data -- 50 PBs -- at the slower speeds available over the Internet would not have been tolerable to Amazon, he said.

(Michael Ernst)

(Michael Ernst)

In some cases, the workloads are using vast amount of data to simulate what should happen in the proton collisions, then search through mountains of ATAS detector data looking for evidence that the theories are correct. It's a compute-intensive task, Ernst explained.

When everything was ready, Brookhaven launched the five-day Spot Instance run. "Less than 1% of the instances were terminated," leaving operations with a margin of safety. Afterward, Ernst's view of Spot Instances changed from a risky experiment to "an ideal resource for deploying our peak demand."

Instead of investing in new data center capacity, Brookhaven was able to gain capacity for its peak demand for $45,000 for the five-day run.

"AWS has superb availability," Ernst said. "It appears to have unlimited capacity at competitive prices."

Even if that was true last September, that's not necessarily guaranteed for all future large-scale users of Spot Instances. With AWS's rapid, 71.7% revenue growth in 2015, compute capacity that's now available might not be in the future.

Nevertheless, Ernst is getting ready for a second experiment on Amazon this month, relying once again on Spot Instances.  He's seeking to establish once and for all that the cloud can serve as "a practical, production-grade, 100,000-core compute platform for doing science." It will be conducted over Amazon's three major North American regions: US East in Northern Virginia, US West in Northern California, and US West in Oregon.

Brookhaven has conducted a smaller, 4,000-core, month-long experiment on Google Compute Engine, but hasn't done any yet on Microsoft Azure. Ernst doesn't rule out use of any cloud site in the future.

Rising stars wanted. Are you an IT professional under age 30 who's making a major contribution to the field? Do you know someone who fits that description? Submit your entry now for InformationWeek's Pearl Award. Full details and a submission form can be found here.

Charles Babcock is an editor-at-large for InformationWeek and author of Management Strategies for the Cloud Revolution, a McGraw-Hill book. He is the former editor-in-chief of Digital News, former software editor of Computerworld and former technology editor of Interactive ... View Full Bio

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Newest First  |  Oldest First  |  Threaded View
Charlie Babcock
Charlie Babcock,
User Rank: Author
2/2/2016 | 3:35:38 PM
The Dept. of Energy's Energy Sciences Network does double duty
Brookhaven not only has the Energy Sciences Network over which to connect to Amazon at 100 Gbs, but it also has, compliments of that network two 320 Gbs lines to Europe over which it can exchange data with European partners. The high speed communications make collaborative research much more feasible.
Charlie Babcock
Charlie Babcock,
User Rank: Author
2/2/2016 | 3:28:07 PM
Brookhaven liked AWS memory-intensive virtual servers
The Brookhaven Lab gravitated toward the memory-intensive virtual server types in its Spot Instances. They included the R3 double extra-large, R3 quadruple extra-large, and R3 8X extra-large instances; the latter comes with two 320 GB solid state disks. Also used was the M3 double extra large server, a balanced compute, memory and networked server used with many different applications, lead scientist Ernst reported.
InformationWeek Is Getting an Upgrade!

Find out more about our plans to improve the look, functionality, and performance of the InformationWeek site in the coming months.

Becoming a Self-Taught Cybersecurity Pro
Jessica Davis, Senior Editor, Enterprise Apps,  6/9/2021
Ancestry's DevOps Strategy to Control Its CI/CD Pipeline
Joao-Pierre S. Ruth, Senior Writer,  6/4/2021
IT Leadership: 10 Ways to Unleash Enterprise Innovation
Lisa Morgan, Freelance Writer,  6/8/2021
White Papers
Register for InformationWeek Newsletters
Current Issue
Planning Your Digital Transformation Roadmap
Download this report to learn about the latest technologies and best practices or ensuring a successful transition from outdated business transformation tactics.
Flash Poll