How Manages Generations Of Big Data - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Data Management // Big Data Analytics
11:55 AM
Connect Directly

How Manages Generations Of Big Data

Over the past year, the genealogy site's repository of family historical data has more than doubled in size. Here's how Ancestry managed its growth.

Businesses often use -- or overuse -- the term "big data" to describe all sorts of data-related products and services, but the buzzword certainly applies in the case of, a popular genealogy service that helps people dig up their family roots.

A little over a year ago, Ancestry was managing about 4 petabytes of data, including more than 40,000 record collections with birth, census, death, immigration, and military documents, as well as photos, DNA test results, and other info. Today the collection has quintupled to more than 200,000 records, and Ancestry's data stockpile has soared from 4 petabytes to 10 petabytes.

According to Bill Yetman, senior director of engineering at, the big data explosion led to growing pains. "We measured every step in our process pipeline," said Yetman in a phone interview with InformationWeek. "We started with academic algorithms that people are using at universities, and they work great at smaller scales."

[How can K-12 education help train a new generation of data scientists? Read How Educators Can Narrow Big Data Skills Gap.]

But, he added, these algorithms were breaking down as the database got bigger and bigger and bigger. "There's a very specific algorithm we use in matching [DNA]. It's called Germline, and it was created by some very, very bright people at Columbia University," Yetman told us.

To analyze its growing stockpile of DNA data, Ancestry had to re-implement Germline using Hadoop and HBase. This process involved storing the data in HBase, and then using two map functions to run comparisons in parallel. "There are two MapReduce steps we use, and then we use HBase to hold the results, which makes it easy for us to do the [DNA] comparisons. If we couldn't run these things in parallel, we couldn't get it done nearly as fast."

Hadoop's vaunted expandability also helped Ancestry manage its growth. "If I need to improve my [performance] times, I can scale horizontally," said Yetman. "Just add more nodes to the cluster, and we can handle the growth."

Future growth, however, will require more innovation to keep things flowing smoothly. "You can't just say, 'OK, I've gotten over this 200,000 hump, and I can make it to 5 million.' I know there are going to be challenges all along the way, and I'm going to be looking for them."

Obviously, hardware performance must be monitored closely. "We've got to watch the memory in each node, how we're using memory, and how we're using the CPU." is also in the process of optimizing its Germline implementation to greatly reduce its memory usage. And it may team up with a cloud provider to boost its processing capacity.

The cloud option gained credence when recently updated its algorithm for its ethnicity test. "We had to go back to those 200,000 people to rerun their ethnicity," said Yetman. "We did that using machines in our datacenter." But local hardware won't be enough as the number of users climbs to 500,000 -- or 1 million. is currently evaluating several cloud providers, but Yetman acknowledges that privacy issues add a degree of complexity to the move. "It gets really tricky because DNA data is so sensitive. That's one of the things that we as a company are careful with."

One potential solution: "I'm looking at bursting to the cloud… to do these calculations," Yetman said. But rather than leaving the data in the cloud, he might "pull it all back" to local storage to alleviate customers' privacy concerns.

Emerging software tools now make analytics feasible -- and cost-effective -- for most companies. Also in the Brave The Big Data Wave issue of InformationWeek: Have doubts about NoSQL consistency? Meet Kyle Kingsbury's Call Me Maybe project. (Free registration required.)

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Newest First  |  Oldest First  |  Threaded View
User Rank: Ninja
12/14/2013 | 1:27:20 AM
Re: Cloud bursting and privacy
Excellent question, I think if privacy issues are going to cause harm to the customer then moving to the cloud in order to get access to more efficient hardware and framework will be difficult. Having said that, if DNA analysis can help pre flag for example, being lactose intolerant or having a greater chance of going into shock by something as miner as a bee sting, then customers will re-think their definition of privacy.

And I feel there is already a middle ground in place that cloud security and privacy can handle, even if privacy attitudes do not change.
User Rank: Strategist
12/10/2013 | 3:17:12 PM, cloud burster has got the ideal problem to using cloud bursting with -- if there is such a thing as an ideal problem. PCI-compliant transaction handlers send the data into the cloud but retain the identifier, the name. As results come back, they can match names to transactions on-premises. Couldn't Ancestry do something like that?
Li Tan
Li Tan,
User Rank: Ninja
12/10/2013 | 3:00:00 AM
Re: Privacy, cloud and big data
Privacy is not only the concern from but from all big enterprises. The companies are seeking for the possible way to improve their IT capability and efficiency. Obviously going for cloud is one necessary step. But the privacy and other security issues are really of concern. Starting to work with private cloud sounds promising but in fact you just start to create data silos, which is not good in the long run.
Ulf Mattsson
Ulf Mattsson,
User Rank: Strategist
12/9/2013 | 3:14:26 PM
Privacy, cloud and big data
I agree that "It gets really tricky because DNA data is so sensitive" and that the hard part is to "alleviate customers' privacy concerns".

Many organizations are looking to the cloud and outsourcing solutions for massive processing but international privacy laws are now escalating and organizations are desperately looking for effective ways to comply to these new stringent regulations. Europe and US are leading with very stringent privacy laws.

I studied one interesting project that addressed the challenge to protect sensitive information about individuals in a way that could satisfy European Cross Border Data Security requirements. This included incoming source data from various European banking entities, and existing data within those systems, which would be consolidated in one European country. The project achieved targeted compliance with EU Cross Border Data Security laws, Datenschutzgesetz 2000 - DSG 2000 in Austria, and Bundesdatenschutzgesetz in Germany by using a data tokenization approach.

I recently read an interesting report from the Aberdeen Group that revealed that "Over the last 12 months, tokenization users had 50% fewer security-related incidents(e.g., unauthorized access, data loss or data exposure than tokenization non-users". Nearly half of the respondents (47%) are currently using tokenization for something other than cardholder data The name of the study, released a few months ago, is "Tokenization Gets Traction".

Aberdeen has also seen "a steady increase in enterprise use of tokenization as an alternative to encryption for protecting sensitive data".

Ulf Mattsson, CTO Protegrity
User Rank: Author
12/9/2013 | 2:47:37 PM
Cloud bursting and privacy
Interesting discussion of cloud bursting. We tend to discuss bursting in terms of capacity problems -- more power on a busy shopping day, for example -- but the privacy angle begs examination. What do you think cloud community? Is this approach advantageous for privacy?
10 Top Cloud Computing Startups
Cynthia Harvey, Freelance Journalist, InformationWeek,  8/3/2020
How Enterprises Can Adopt Video Game Cloud Strategy
Joao-Pierre S. Ruth, Senior Writer,  7/28/2020
Conversational AI Comes of Age
Guest Commentary, Guest Commentary,  8/7/2020
White Papers
Register for InformationWeek Newsletters
Current Issue
Special Report: Why Performance Testing is Crucial Today
This special report will help enterprises determine what they should expect from performance testing solutions and how to put them to work most efficiently. Get it today!
Flash Poll