Big Data Governance - Metadata Is the Key - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Data Management // Big Data Analytics
07:00 AM
Aroop Maliakkal Padmanabhan, Senior Manager, and Tiffany Nguyen, Senior Software Engineer, eBay
Aroop Maliakkal Padmanabhan, Senior Manager, and Tiffany Nguyen, Senior Software Engineer, eBay

Big Data Governance Metadata Is the Key

A new approach to data governance is needed in the age of big data, when data is scattered throughout the enterprise in many formats, and coming from many sources.

As the volume, variety and velocity of available data all continue to grow at astonishing rates, businesses face two urgent challenges: how to uncover actionable insights within this data, and how to protect it. Both of these challenges depend directly on a high level of data governance.

The Hadoop ecosystem can provide that level of governance using a metadata approach, ideally on a single data platform.

A new approach to governance is needed for several reasons. In the age of big data, data is scattered throughout the enterprise. It’s in structured, unstructured, semi-structured and various other formats.  Furthermore, the sources of the data are not under the control of the teams that need to manage it.

In this environment, data governance includes three important goals:

  • Maintaining the quality of the data
  • Implementing access control and other data security measures
  • Capturing the metadata of datasets to support security efforts and facilitate end-user data consumption

Solutions within the Hadoop Ecosystem

One way to approach big data governance in a Hadoop environment is through data tagging. In this approach, the metadata that will govern the data’s use is embedded with that data as it passes through various enterprise systems. Furthermore, this metadata is enhanced to include information beyond common attributes like filesize, permissions, modification dates and so on. For example, it might include business metadata that would help a data scientist evaluate its usefulness in a particular predictive model.

Finally, unlike enterprise data itself, metadata can be centralized on a single platform.

The standard Hadoop Distributed Filing System HDFS has an extended attributes capability that allows enriched metadata, but it isn’t always adequate for big data.  Fortunately, an alternate solution exists. The Apache Atlas metadata management system enables data tagging, and can also serve as a centralized metadata store, one that can offer “one stop shopping” for data analysts who are searching for relevant datasets. Also, users of the popular Hadoop-friendly Hive and Spark SQL data retrieval systems can do the tagging themselves.

For security, Atlas can be integrated with Apache Ranger, a system that provides role-based access to Hadoop platforms.

Platform loading challenges

The initial loading of metadata to the Atlas platform and incremental loading that will follow both present significant challenges. For large enterprises, the sheer volume of data will be the main problem in the initial phase, and it may be necessary to optimize some code in order to carry out this phase efficiently.

Incremental loading is a more complex issue, because tables, indexes and authorized users change all the time. If these changes aren’t quickly reflected in the available metadata, the ultimate result is a reduction in the quality of the data available to end users. To avoid this problem, event listeners should be included in the system’s building blocks so that changes can be captured and processed in near real time. A real-time solution not only means better data quality. It also improves developer productivity because the developers don’t have to wait for a batch process.

The foundation of digital transformation

As businesses pursue digital transformation and seek to be more data-driven, senior management needs to be aware that no results in this direction can be achieved without quality data, and that requires strong data governance. When big data is involved, governance based on enhanced metadata that resides in a central repository is a solution that works.

Aroop Maliakkal Padmanabhan is a Senior Manager on the Platform Engineering team at eBay. He leads the Hadoop team, which owns one of the biggest Hadoop clusters in the world. He has been actively working in the Hadoop space since 2008.

Tiffany Nguyen is a senior software engineer at eBay and has been a data enthusiast since 2015. She currently leads the data governance initiative on big data platform at eBay.

The InformationWeek community brings together IT practitioners and industry experts with IT advice, education, and opinions. We strive to highlight technology executives and subject matter experts and use their knowledge and experiences to help our audience of IT ... View Full Bio
We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Data Science: How the Pandemic Has Affected 10 Popular Jobs
Cynthia Harvey, Freelance Journalist, InformationWeek,  9/9/2020
The Growing Security Priority for DevOps and Cloud Migration
Joao-Pierre S. Ruth, Senior Writer,  9/3/2020
Dark Side of AI: How to Make Artificial Intelligence Trustworthy
Guest Commentary, Guest Commentary,  9/15/2020
White Papers
Register for InformationWeek Newsletters
Current Issue
IT Automation Transforms Network Management
In this special report we will examine the layers of automation and orchestration in IT operations, and how they can provide high availability and greater scale for modern applications and business demands.
Flash Poll