Scalability Challenges for Large Databases
When it comes to databases, popularity breeds challenges. Even as they amass more data, both OLTP and decision-support systems must get faster and more friendly to intense query activity. In a report from the only survey of its kind, we learn how key trends are shaping choices in technology.
It's a miracle the lights don't dim, the earth doesn't shake, and the sun isn't occasionally darkened by data clouds erupting from a system that just can't take it anymore. Across the globe, every minute of every day, enormous databases are running thousands, if not millions, of simultaneous operations to support online transaction processing (OLTP). Meanwhile, other databases take on massive data warehouses as they respond to the complex number-crunching needs of strategic business applications and their users.
Especially for the largest database-management systems, scalability is forever a work in progress. Boundaries are moving up and out constantly. For some, scalability is about the number of users, simultaneous queries or transactions, and query speed; for others, it is about the sheer amount of raw data the system must store and manage. And for many, it is about all of the above.
For the past 10 years, Winter Corp. has been tracking the world's largest and most heavily used databases. We have developed quantitative insights concerning the demographics, operating characteristics, and practices used to develop and manage large database. In this article, we discuss trends and directions in large databases, as revealed by our most recent survey — the 2003 TopTen Program, conducted from May to October 2003 - as well as other surveys and interactions with user organizations.
The TopTen Program received more than 300 survey results, which originated from 23 countries. Qualifying Windows-based databases had to contain more than 500 GB of data, while databases on all other platforms had to exceed 1 TB. Note that we define database size as the sum of user data, summaries, aggregates, and indexes — not the complement of attached disk. We required respondents to validate their entries by running queries developed by Winter Corp. and associated industry experts. We sorted the entry databases by usage and operating system, and assessed them according to four metrics: size, uncompressed data volume, number of rows/records, and workload. Based on these metrics, we developed category lists of the world's leading databases, along with data points that we are presenting in this article.
The most compelling findings from the data are the growth in database size and workloads. This growth has several dimensions, the most visible and perhaps most daunting of which is size. Large databases may contain thousands of tables and support as many users, who are producing and demanding access to more data than ever before. These users are scattered across the enterprise in geographically diverse locations, and their numbers are rising steadily. Database management software is expected to process data volumes whose sizes dwarf what they handled previously.
Our research shows that database size is accelerating. Figure 1 discloses the range in size of the 10 largest decision-support and transaction-processing databases in the 2003 and 2001 Winter Corp. surveys. In two years since the 2001 program, the largest transaction-processing database almost doubled in size, from 10.5 to 18.3 TB. On the decision-support system (DSS) side, the 10th largest database is nearly the size of the biggest database in the 2001 program. And the largest decision-support database in 2003, at 29.2 TB, is almost triple the size of the 2001 leader.
FIGURE 1 In two years, the largest databases have increased in size two- and threefold.
Increases in row count provide another example of the growth in database scale. Figure 2 shows the growth in number of rows over the past two years. Row count on every platform for both decision support and transaction processing grew significantly since 2001. Unix DSS databases experienced the greatest growth. Their average number of rows grew six times, propelling them past transaction processing z/OS systems to become the new champions of row count.
FIGURE 2 The Unix DSS database average is now almost 40 billion rows.
Windows-based databases also grew voraciously. Windows DSS systems contained an average of 11,112 rows, up from 1,176 rows just two years ago. Average row count for Windows OLTP systems exploded, undergoing a fourteenfold increase since 2001. In the 2003 program, AT&T's system led the pack with a whopping 496 billion rows.
Another indication of the stretch in database scale is the increase in system workloads. Data marts are growing into data warehouses, some of which are expected to perform in near-real time. Everyone along the consumer chain — the customer, the retailer, the distributor, the manufacturer, and the materials supplier — needs to exchange information quickly and reliably, in 24 x 7. The result is that workloads are spiraling ever upward. Our survey data shows that the average DSS database in the 2001 program processed 156 concurrent in-flight queries. By 2003, this figure had jumped to 224, a 44 percent rise. Transaction-processing workload climbed even more sharply, from an average of 2,094 transactions per second (tps) in 2001 to 3,223 tps two years later — an increase of 54 percent.
The storage systems that house this burgeoning data provide more evidence of the boom in database scalability. Storage solutions are becoming more complex, offering a wide range of products, platforms, operating costs, capabilities, and performance levels. Today's terabyte implementations may require thousands of storage devices, some even tens of thousands. With our data tracking the storage requirements of large databases since 1997, we can see that the escalation of storage requirements shows no signs of abating (see Figure 3).
FIGURE 3 Average storage capacity has grown 250% in two years.
In the aggregate, the 2001 survey pool reported 632 TB of storage. Just two years later, those surveyed were using almost 2 petabytes (2,000 TB) of storage. Based on the number of survey respondents, the average large database — whether used for decision support or transaction processing — increased its storage requirements three and one-half times in just two years.
Growth shown by such metrics tells us that demands on computing resources are becoming more intense. Public and private organizations expect systems to offer greater speed, availability, and reliability. The challenges facing IT personnel, from database administrators all the way up to the CIO, are formidable. They must meet increasing requirements even while reining in overall spending.
Many IT departments have turned to open systems and/or lower-cost hardware to hold spending in check. Findings from Winter Corp.'s survey programs disclose that these shifts are readily apparent in large database implementations. In Figure 4, we can see that Unix has become the clear platform leader for very large DSS databases. In fact, the 10 largest DSS databases in our 2003 TopTen Program were all running on Unix. At the head of the list is France Telecom, with more than 29 TB of data. France Telecom uses Oracle, Hewlett-Packard PA-RISC Superdome servers, and HP's StorageWorks systems.
FIGURE 4 Unix DSS databases now capture the high end.
Our data also indicates that Intel-based servers running Windows have become viable for large-scale decision support. The number of "Wintel" DSS systems rose 40 percent from 2001 to 2003; our 2003 data tells us that roughly one in five participating DSS databases is running on Wintel platforms. Moreover, the largest Windows-based database grew by nearly six times between 2001 and 2003, from 1.5 TB to 8.9 TB. ComScore Networks Inc. submitted the largest one in both the 2001 and 2003 programs. Sybase IQ manages the database, aboard Dell servers and EMC storage systems.
The most striking evidence of the emergence of Windows as a platform for large database processing, however, occurred among transaction-processing databases. Unix dominated Windows in OLTP in 2001, taking nearly 60 percent of databases that qualified our measurements (a quarter of the submissions ran on z/OS and OS/390). However, by 2003, Windows-based databases accounted for more than two out of five OLTP databases (43 percent), propelling Windows into a virtual tie with Unix as the most widely used transaction-processing platform, according to our program data. Moreover, the largest Unix OLTP system held 5.4 TB of data, barely more than the 5.3-TB Windows database reported by Verizon Communications. This implementation, which runs Microsoft SQL Server on HP ProLiant servers and Symmetrix disk arrays, signifies a twelvefold jump in the size of the largest Windows transaction-processing database that we assessed.
Despite the advance of Windows systems, we should note that the largest OLTP systems were still found on IBM mainframes running z/OS. These accounted for six of the 10 largest transaction-processing systems in the program. The list includes the largest OLTP database in the program, found at the Land Registry. The company's DB2 for z/OS implementation runs on IBM eServer z Series machines and Hitachi storage arrays.
The previous paragraph notwithstanding, there is little doubt that the floodgates for transaction processing on Wintel platforms have been opened. The drive to replace expensive older systems with lower-cost hardware is gaining momentum. We expect the next Winter Corp. survey to substantiate this shift even more strongly.
You might be wondering how Linux systems fared in the 2003 campaign. Linux is a hot topic in database circles, but the operating system has thus far barely penetrated the world of large production databases, according to our survey program. Of the four Linux surveys we received, only one met our minimum size requirements. The respondent did not validate database size, thus eliminating the survey from the final respondent pool. However, given the perceived benefits of Linux, including lower cost of ownership, portability, openness, and global support, we expect to see a stronger Linux presence in future results.
Program results also demonstrate shifts in server architecture preferences, both over time and in relation to database size. In Figures 5 and 6, you can see our results for server architectures in 2003 and 2001 for decision-support and transaction-processing databases, respectively.
FIGURE 5 The largest DSS databases favor cluster and MPP.
FIGURE 6 SMP and clusters dominate as OLTP architectures.
Regardless of usage, symmetric multiprocessing (SMP) prevailed as the leading architecture for databases with less than 2 TB of data in both programs. Among DSS databases, SMP systems strengthened their position in terms of midsize database platforms. Cluster systems have established a noticeable presence at the largest DSS database sites reporting in our program, penetrating territory previously outside their comfort zone.
You can see in Figures 5 and 6 that massively parallel processing (MPP) systems, whose demise has been predicted periodically by analysts, competing vendors, and some in the media, remain one of the principal architectural choices for the largest DSS databases. Their ongoing presence brings to mind Mark Twain's famous saying, "Reports of my death are greatly exaggerated." Although MPP platforms no longer play in the OLTP arena, it is abundantly clear from the survey data that MPP continues to enjoy vibrant growth for high-end decision support.
Among transaction-processing databases, SMP systems are used extensively, regardless of database size. They are the leading choice for the largest OLTP databases, weighing in at more than 5 TB, as well as for emerging large databases, which we measure as less than 2 TB. Cluster systems are the architecture of choice for midsize OLTP databases of 2 to 5 TB.
Overall, use of SMP and cluster architectures for both decision support and transaction processing grew significantly between 2001 and 2003. Uniprocessing, which barely had a presence in 2001 and even then was used only for transaction processing, has all but disappeared as a choice for large database architectures.
Putting all this together, just what do we foresee for large databases in the near future? Most assuredly, they will become bigger, hopefully better, and definitely more powerful. In the next couple of years, we expect respondents to our survey with DSS databases to have 100- to 200-TB databases at the inquisitive fingertips of a wide range of users. We could see processing in the range of 1,500 concurrent in-flight DSS queries. The accompanying sidebar interview with Ken Collins, director of information management at Amazon.com, testifies to increased business-critical demand for enormous DSS databases.
Meanwhile, transaction processing will reach 8,000 tps against 25 TB of operational data. While these DSS and OLTP numbers may have IT executives shaking their heads, they are supported by the projections of our survey respondents. Our program will soon be gathering and compiling data for 2004, and there is little doubt that we will see databases bursting out of their confines, making scalability of every sort a critical issue for IT managers going forward.
Authors' Note: The Winter Corp. 2003 TopTen Program was sponsored by HP, Microsoft, Oracle, Sybase, and Teradata. Sponsorship brought about promotional benefits but did not impact the program findings whatsoever. The authors gratefully acknowledge the efforts of Vic Goldberg and Rick Burns, who contributed to the collection, management, and analysis of the data employed in the TopTen Program.
RICHARD WINTER is president of Winter Corp. He leads an international consulting practice on the technology and implementation of large databases.
KATHY AUERBACH is research program director of Winter Corp., where she has managed the TopTen Research Program and related research programs for the past seven years.
For more information about the survey reported in this article, see: http://www.wintercorp.com/VLDB/2003_TopTen_Survey/TopTenwinners.
Scaling for Innovation: Amazon.com's Data Warehouse
Ken Collins is Director of information management at Amazon.com, which "opened its virtual doors" in 1995 and is now one of the world's most well-known consumer destinations on the Web. Practically since its founding, Amazon also has carried a strong reputation for its leading-edge business intelligence, analytics, data warehousing, and database operations. Amazon is perennially among the leaders in the Winter TopTen Program and other database-scalability rankings.
We talked to Collins about Amazon's data-warehouse database and the scalability challenges it presents.
Q: How is Amazon.com using its very large database?
Collins: Amazon has an integrated, central corporate data warehouse (DW). We use it broadly for applications that span daily business metrics reporting, analytic queries, and data mining. The primary users are analysts in finance, supply chain, and seller and vendor management. The DW database size is about 15 TB (raw data), and we currently manage it with Oracle running on Hewlett-Packard servers and the HP-UX operating system.
Q: How are you using the detailed information in the database to create business value?
Collins: Rather than evaluating each user's request for information, our philosophy is to allow anyone in the company to ask any question of any of the data in the DW. Our only up-front evaluation concerns what OLTP data sets need to be reflected in the data warehouse. Amazon is full of smart people, so we assume that the majority of these queries will be delivering real business value.
So, taking this approach, we try to understand users' needs by looking at what they are doing, rather than what they want to do. We don't try to figure out the business value of each query before we allow it to run against the data. Our approach provides a self-service environment for users; they can experiment and invent with the data set they care about.
Q: What requirements most make database scalability a challenge?
Collins: Well, the approach I've just described presents a scalability challenge. We support predominantly ad hoc queries, running against a large, complex, third-normal form, and denormalized schema. The database is available 24 hours a day worldwide. We do not try to prevent the running of expensive queries; we catch them after they have run and then educate the user about how to improve the query so that the next one performs better.
Q: What are you doing to meet scalability requirements?
Collins: The biggest challenge - and most interesting problems - relate to ensuring that the DW is self-service. What is the right set of products we can build to ensure that users have autonomy, reasonable performance, and the capability to explore new areas at the pace Amazon's business model dictates?
The second area is integration of the DW into the operational fabric of the company. We need to ensure that end users can work seamlessly with data that ranges from intraday near-real time to long-term, integrated, and analytic data. We will spend a great deal of time looking at ways to give users a seamless view into Amazon information this year.
Q: Do you have any concrete measures of the business value realized by the database?
Collins: We maintain extensive detail and aggregate metrics related to business value, but do not share this data outside of Amazon. I can say that the DW continues to be an extremely good investment.
Q: How do the database's large size and heavy use affect your work?
Collins: We work hard and like what we do. As with any organization in an industry such as ours, the challenge is to organize ourselves to scale with rapid increases in demand. The more we can empower users with strong, self-service analytic tools and processes, the more they will innovate. The more we focus on the product side of DW versus the service side, the more value our team will bring to Amazon.
Q: Where do you see the database scalability challenge going in the future, and how do you plan to meet the challenge?
Collins: Tactically, we feel we have solved many problems with self-service warehousing in the 10- to 30-TB range. As I said earlier, our issues relate to information integration and aggressive extensions to our self-service model.
We are considering data problems that encroach on the petabyte range that we need help solving. We also envision an information management environment in which development teams can build and deploy new analytic information systems in a matter of hours and days versus weeks and months. This is also an area that we are looking to the IT industry to help us solve.