Solving the data quality problem in generative AI

By Alex Watson

The potential of generative AI has captivated both businesses and consumers alike, but growing concerns around issues like privacy, accuracy, and bias have prompted a burning question: What are we feeding these models?

The current supply of public data has been adequate to produce high-quality general purpose models, but is not enough to fuel the specialized models enterprises need. Meanwhile, emerging AI regulations are making it harder to safely handle and process raw sensitive data within the private domain. Developers need richer, more sustainable data sources—the reason many leading tech companies are turning to synthetic data.

Earlier this year, major AI companies like Google and Anthropic started to tap into synthetic data to train models like Gemma and Claude. Even more recently, Meta’s Llama 3 and Microsoft’s Phi-3 were released, both trained partially on synthetic data and both attributing strong performance gains to the use of synthetic data.

On the heels of these gains, it has become abundantly clear that synthetic data is essential for scaling AI innovation. At the same time, there’s understandably a lot of skepticism and trepidation surrounding the quality of synthetic data. But in reality, synthetic data has a lot of promise for addressing the broader data quality challenges that developers are grappling with. Here’s why.

Data quality in the AI era

Traditionally, industries leveraging the “big data” necessary for training powerful AI models have defined data quality by the “three Vs” (volume, velocity, variety). This framework addresses some of the most common challenges enterprises face with “dirty data” (data that is outdated, insecure, incomplete, inaccurate, etc.) or not enough training data. But in the context of modern AI training, there are two additional dimensions to consider: veracity (the data’s accuracy and utility) and privacy (assurances that the original data is not compromised). Absent any of these five elements, data quality bottlenecks that hamper model performance and business value are bound to occur. Even more problematic, enterprises risk noncompliance, heavy fines, and loss of trust among customers and partners.

Mark Zuckerberg and Dario Amodei have also pointed out the importance of retraining models with fresh, high-quality data to build and scale the next generation of AI systems. However, doing so will require sophisticated data generation engines, privacy-enhancing technologies, and validation mechanisms to be baked into the AI training life cycle. This comprehensive approach is necessary to safely leverage real-time, real-world “seed data,” which often contains personally identifiable information (PII), to produce truly novel insights. It ensures that AI models are continuously learning and adapting to dynamic, real-world events. However, to do this safely and at scale, the privacy problem must be solved first. This is where privacy-preserving synthetic data generation comes into play.

Many of today’s LLMs are trained entirely with public data, a practice that creates a critical bottleneck to innovation with AI. Often for privacy and compliance reasons, valuable data that businesses collect such as patient medical records, call center transcripts, and even doctors notes cannot be used to teach the model. This can be solved by a privacy-preserving approach called differential privacy, which makes it possible to generate synthetic data with mathematical privacy guarantees.

The next major advance in AI will be built on data that is not public today. The organizations that manage to safely train models on sensitive and regulatory-controlled data will emerge as leaders in the AI era.

What qualifies as high-quality synthetic data?

First, let’s define synthetic data. “Synthetic data” has long been a loose term that refers to any AI-generated data. But this broad definition ignores variation in how the data is generated, and to what end. For instance, it’s one thing to create software test data, and it’s another to train a generative AI model on 1M synthetic patient medical records.

There has been substantial progress in synthetic data generation since it first emerged. Today, the standards for synthetic data are much higher, particularly when we are talking about training commercial AI models. For enterprise-grade AI training, synthetic data processes must include the following:

When synthetic data meets the above criteria, it is just as effective or better than real-world data at improving AI performance. It has the power not only to protect private information, but to balance or boost existing records, and to simulate novel and diverse samples to fill in critical gaps in training data. It can also dramatically reduce the amount of training data developers need, significantly accelerating experimentation, evaluation, and deployment cycles.

But what about model collapse?

One of the biggest misconceptions surrounding synthetic data is model collapse. However, model collapse stems from research that isn’t really about synthetic data at all. It is about feedback loops in AI and machine learning systems, and the need for better data governance.

For instance, the main issue raised in the paper The Curse of Recursion: Training on Generated Data Makes Models Forget is that future generations of large language models may be defective due to training data that contains data created by older generations of LLMs. The most important takeaway from this research is that to remain performant and sustainable, models need a steady flow of high-quality, task-specific training data. For most high-value AI applications, this means fresh, real-time data that is grounded in the reality these models must operate in. Because this often includes sensitive data, it also requires infrastructure to anonymize, generate, and evaluate vast amounts of data—with humans involved in the feedback loop.

Without the ability to leverage sensitive data in a secure, timely, and ongoing manner, AI developers will continue to struggle with model hallucinations and model collapse. This is why high-quality, privacy-preserving synthetic data is a solution to model collapse, not the cause. It provides a private, compelling interface to real-time sensitive data, allowing developers to safely build more accurate, timely, and specialized models.

The highest quality data is synthetic

As high-quality data in the public domain is exhausted, AI developers are under intense pressure to leverage proprietary data sources. Synthetic data is the most reliable and effective means to generate high-quality data, without sacrificing performance or privacy.

To stay competitive in today’s fast-paced AI landscape, synthetic data has become a tool that developers cannot afford to overlook.

Alex Watson is co-founder and chief product officer at Gretel.

—

Generative AI Insights provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss the challenges and opportunities of generative artificial intelligence. The selection is wide-ranging, from technology deep dives to case studies to expert opinion, but also subjective, based on our judgment of which topics and treatments will best serve InfoWorld’s technically sophisticated audience. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Contact doug_dineley@foundryco.com.