Taking Plunge With Synthetic Data

Fake data has its uses. Learn how it can help your organization.

John Edwards, Technology Journalist & Author

January 30, 2024

4 Min Read
Graphic concept of synthetic data pool.
rico ploeg via Alamy Stock

At a Glance

  • Simulated data constructed algorithmically has numerous real world use cases and benefits.
  • Properly developed synthetic data is the key to getting started.
  • Using synthetic data in AI applications allows for creative use without risking sensitive data.

Instead of being created by real-world activities, like conventional data, synthetic data is totally artificial. Constructed algorithmically, synthetic data is frequently used as a substitute in test datasets, as well as to validate mathematical models and train AI and ML models.

Synthetic data is relatively inexpensive to create, easily accessible, and allows testing without any human impact concerns, says Viveca Pavon-Harr, chief data officer at Accenture Federal Services in an email interview. Synthetic data can also facilitate faster model testing and evaluations and, depending on the type of work an organization does, can allow quicker data acquisition and data documentation.

Synthetic data is prized for its ability to create balanced and unbiased datasets, a significant challenge in machine learning, observes Woody Zhu, an assistant professor of data analytics at Carnegie Mellon University’s Heinz College of Information Systems and Public Policy via email. “By simulating data, we can address issues of bias and fairness, particularly in high-stakes fields like healthcare, power systems, finance, and education,” he explains. “This leads to the development of more trustworthy and inclusive machine learning models.”

Numerous Benefits

Related:How Synthetic Data Can Help Train AI and Maintain Privacy

It’s frequently difficult to gain a high degree of accuracy when there’s only a limited availability of data, says Olga Kupriyanova, principal consultant with global technology research and advisory firm ISG via email. “Organizations can leverage synthetic data to train models that would otherwise not reach the necessary levels of performance,” she explains.

Perhaps the most typical synthetic data use case is fraud detection. “Fraudulent events are rare, yet models need to be trained to detect them,” Kupriyanova says. “The best way to do this is to generate synthetic events data to expand training opportunities.”

Synthetic data shines when real data is scarce, sensitive, or too risky to use. “In scenarios where gathering ample and diverse data is impossible, challenging, or unethical, synthetic data steps in as a reliable alternative,” Zhu says. “It allows organizations to model complex situations without compromising privacy or safety.”

Synthetic data becomes easily attainable and inexpensive to create when generative AI is used. “The data is not only easily generated, but it can also have embedded annotations already included,” Pavon-Harr notes. “This is a huge benefit for organizations, given that it reduces the labor-intensive task of going through data and identifying features and metadata.”

Related:Data Strategy: Synthetic Data and Other Tech for AI's Next Phase

Yet another benefit is that data can be generated in a way that removes or limits biases and vulnerabilities. This attribute can help reduce the creation of unintentional information or information that may not be truly representative of a particular group. “If we think about the medical space, for example, using patient information could violate privacy concerns,” Pavon-Harr observes. By using synthetic data, private information about individuals can be completely removed. “This provides great opportunities for research and scenario building without exposing negative events or consequences.”

Any model-generated content, whether a prediction or a set of synthetic variables or outputs, can be subject to bias or inaccurate content. “This is especially a risk for synthetic data, which by its nature is tied to the rules set to it by the model creator,” Kupriyanova says. “It’s important to remember that synthetic data effectively generates data via generative AI capabilities, which means it can hallucinate when given the direction to create something it doesn’t have enough context for.” In other words, all of the risks associated with generative AI also exist for synthetic data.

Related:Building Strong Data Pipelines Crucial to AI Training

Getting Started

Synthetic data initiatives should be driven by need. “If you have a business use case that requires an AI solution, but you can’t get enough data to generate the right kind of behavior, then it’s time to consider ways the model can be improved,” Kupriyanova says. “One of your options will be synthetic data.”

On the downside, if the synthetic data isn’t correctly developed, the resulting models won’t perform as expected. “If the data created isn’t a true representation of what’s being evaluated, the models will not converge,” Pavon-Harr says.

Initiating work with synthetic data requires a foundation in high-quality real data or substantial domain knowledge, Zhu warns.

An Opportunity

Synthetic data offers an opportunity to study new methodologies and infuse creativity into various approaches to AI without putting humans or sensitive data at risk. Synthetic data should be used to exemplify human populations, expedite research opportunities, and remove bias whenever possible. “All generalized assumptions should be vetted to ensure as much truth as possible is included in the data, not just what was conveniently gathered,” Pavon-Harr states.

While synthetic data is immensely useful, it’s important to be wary of over-reliance. “There’s always a risk of missing out on subtle real-world nuances,” Zhu explains. “Ensuring accuracy in simulation and being mindful of ethical considerations in data representation and usage are key.”

About the Author(s)

John Edwards

Technology Journalist & Author

John Edwards is a veteran business technology journalist. His work has appeared in The New York Times, The Washington Post, and numerous business and technology publications, including Computerworld, CFO Magazine, IBM Data Management Magazine, RFID Journal, and Electronic Design. He has also written columns for The Economist's Business Intelligence Unit and PricewaterhouseCoopers' Communications Direct. John has authored several books on business technology topics. His work began appearing online as early as 1983. Throughout the 1980s and 90s, he wrote daily news and feature articles for both the CompuServe and Prodigy online services. His "Behind the Screens" commentaries made him the world's first known professional blogger.

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like


More Insights