Is Your Data Lake Becoming a Swamp? Keep that Lake CleanIs Your Data Lake Becoming a Swamp? Keep that Lake Clean

Understanding past mistakes that organizations have made with their data can play a key role. Taking an actionable, aware approach to this process can help set your organization up for success.

Robin Das, Executive Director, DataBee

November 1, 2024

5 Min Read
alligator in swamp water
Meghan Gaspar-McCarthy via Pixabay

Organizations are drowning in data, including security data. Data lakes can help to manage this data tsunami, due to their ability to store diverse data types (including unstructured data), ease of scalability, the flexibility of schema-on-read and potential cost-management benefits. 

However, it’s important to ensure these lakes don’t turn into data swamps. Understanding past mistakes that organizations have made with their data can play a key role. Taking an actionable, aware approach to this process can help set your organization up for success.  

The Murky History of Data Lakes  

Enterprise data sciences teams were some of the first to transition from traditional data warehouses to data lakes in the early 2010s, but they quickly found their flexibility was also their downfall. This brings to mind the “Tragedy of the Commons,” an economics idea that states when people have unlimited access to a limited shared resource, they will likely overuse it, destroying its value. These data lakes turned into swamps as multiple teams dumped in data of varying quality and reliability. 

As the data lakes got bogged up, teams could no longer determine what data was good, how reliable the data was, or who owned it. Further challenges presented themselves at the organizational level: how to manage the growing data storage and compute costs, as well as the security and compliance issues that arose from not knowing what rogue or orphaned data was stored. Where this happened, implementing controls after the fact was impossible, as data was created faster than the organization could keep up with it, and they were constantly playing catch-up.  

Related:Using Embedded Databases for IoT

From Security Data Lake to Data Swamp: A Problem 

Security data is growing exponentially, with the average enterprise using anywhere from 40 to over 100 different security tools, all of which produce data using unique semantics and file formats that create data silos and limit the data’s usefulness. Collecting all that data into a centralized security data lake makes it more usable, enabling more insights. 

Data lakes are gaining prominence in cybersecurity, in part because of their flexibility; you can put any kind of data with any schema in a data lake. However, just like standard data lakes, a security data lake can get muddled up by a variety of data types of differing quality for vastly different use cases -- scaled production or open-ended experimentation -- bogging down the potential and usefulness of the data in the lake.  

For the security data lake users, perhaps the data you need is there, but finding it is a different story, especially when labeled with names like “Dave’s pen test data Friday” or when there are 300 files all labeled “Critical SOC Alerts.” Even once you find useful data, is it reliable? And who do you contact if you don’t see it updated or lose the feed entirely? 

Related:Federal Privacy Is Inevitable in The US (Prepare Now)

Compute and storage costs will increase and hinder performance organization wide as the swamp swallows more data. Beyond operations, if no one is responsible for tracking what gets put into the lake -- and, more importantly, deleted -- complying with privacy regulations becomes an issue. Finally, if you don’t know what data you have, how can you protect it or notice if it has been stolen or tampered with? 

3 Steps for a Clean Data Lake 

Security teams can learn from the challenges that their enterprise data counterparts have had with data lakes and apply these valuable lessons to their approach.  

1. Create an intentional data strategy. It’s important to be able to answer the “why” versus the “what for” and the “how” of the data going into your security data lake. A security lead needs to understand what data should be ingested and what outcomes the team is trying to drive rather than just putting the data in the data lake and hoping that you’ll find an outcome. 

Related:Data Quality: The Strategic Imperative Driving AI and Automation

2. Create strong data governance. Once the strategy is set, the next step is strong data governance, ownership and accountability. It goes beyond just access controls, such as who can access the data and who can’t. It’s clarity on who is responsible and has ownership for the quality and integrity of the data sets, as well as having the tools that can monitor it to enable consistent, timely, and accurate data. 

If, for instance, you’re talking about streaming log data into the security data lake, who is responsible for making sure that log data is consistent? Who oversees making sure it’s going to come in? Then, who is responsible for the management of the data lake once the data is in it? It’s important to collaborate with whoever has ongoing day-to-day governance and responsibility for access to the data lake.  

3. Pay attention to the metadata. An often-overlooked aspect of keeping the data lake clean and navigable is the metadata. Clean consistent metadata is going to make it easier for the different teams or stakeholders that are trying to consume that data.  

Data tagging with consistent file structures and naming conventions are just the minimum, incorporating data lineage to help easily answer ownership questions as well as usage metrics. These will help validate a data set’s usefulness, further keeping the data lake from getting murky. 

Avoid the Swamp 

Data lakes were introduced to help organizations make better sense of their data, a better way to store and organize the ever-growing volumes of data they’re drowning in today. But what too often happens is that these pristine data lakes become dank, viscous swamps as many teams use this shared resource for vastly different reasons. And when that happens, these data lakes can become a burden, not an enabler. As security teams embrace security data lakes to collate and integrate data from multiple security platforms and tools, it’s important to ensure these do not become data swamps. Learn from past mistakes and create a security data lake that serves your organization well. 

About the Author

Robin Das

Executive Director, DataBee

Robin Das is executive director, market growth strategy for DataBee, Comcast Technology Solutions’ cybersecurity business unit. In this role, Robin is responsible for defining DataBee’s unique value proposition in the market, long term strategy and product vision, and business development opportunities via outreach to strategic targets, partnerships, alliances, and other investments to continue to drive overall growth. 

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like


More Insights