Gen AI can be the answer to your data problems — but not all of them

There are currently 143 million people waiting for surgeries in lower income countries. And there are organizations ready to bring in doctors and resources — but there’s an information gap between the two, says Joan LaRovere, associate chief medical officer at Boston Children’s Hospital, a professor at Harvard medical School, and co-founder of the Virtue Foundation, an NGO dedicated to solving this information problem.

The Virtue Foundation, founded in 2002, has already created the world’s largest database of NGOs and healthcare facilities, delivering global health services in over 25 countries, organizing medical expeditions, conducting research, and donating medical equipment. As part of this work, the foundation’s volunteers learned about the necessity of collecting reliable data to provide efficient healthcare activity.

The problem is that information sources are incredibly varied and often hidden, says LaRovere.

“It’s not aggregated,” she says. “It’s on the web. It’s buried in governmental organizations. It’s in a mixture of structured and unstructured formats.”

To help alleviate the complexity and extract insights, the foundation, using different AI models, is building an analytics layer on top of this database, having partnered with DataBricks and DataRobot. Some of the models are traditional machine learning (ML), and some, LaRovere says, are gen AI, including the new multi-modal advances.

“The generative AI is filling in data gaps,” she says. “This is a very new thing that’s going on and we’re right at the leading edge of the curve.”

The next step, she says, is to take the foundational data set, and augment it with other data sources, more layers of data, and even satellite data, to draw insights and figure out correlations.

“AI’s capabilities allow us the ability to start making the invisible, visible,” she adds.

But the Virtue Foundation isn’t alone in experimenting with gen AI to help develop or augment data sets.

“This does work and is in use today by a growing number of companies,” says Bret Greenstein, partner and leader of the gen AI go-to-market strategy at PwC. “Most enterprise data is unstructured and semi-structured documents and code, as well as images and video. This was not accessible in the past without complex, custom solutions that were often very brittle.”

For example, gen AI can be used to extract metadata from documents, create indexes of information and knowledge graphs, and to query, summarize, and analyze this data.

“This is a huge leap over older approaches that required extensive manual processing,” he says. “And it unlocks so many use cases since most workflows and processes are based on documents and similar data types.”

According to IDC, 90% of data generated by organizations in 2022 was unstructured. Companies use gen AI to create synthetic data, find and remove sensitive information from training data sets, add meaning and context to data, and perform other higher-level functions where traditional ML approaches fall short. But gen AI can also be slower, more expensive, and sometimes less accurate than older technologies, and experts advise against jumping into it before all the foundational layers are in place.

Data extraction use case

ABBYY, an intelligent automation company, has been using various types of AI and ML to process documents for more than 35 years. And three years ago, long before ChatGPT hit the scene, it began using gen AI.

“We used it to help with optical character recognition,” says Max Vermeir, ABBYY’s senior director of AI strategy.

Previously, a convolutional neural network would be used to detect which bits of an image had text in it. “Then that went into a transformer, the same architecture as ChatGPT, but built in a different way, he says.

The benefits of using an LLM for this task is that it can see the big picture and figure out what the text is supposed to be from context cues. The problem, says Vermeir, is that LLMs are very resource intensive. “And in optical character recognition, it’s all about speed,” he adds. “So it’s only when we detect a very low-quality document do we involve a large language model.”

The company is also using LLMs to figure out the location of key information in a particular type of document.

“We do the optical character recognition, give the full text to the LLM, and then ask our questions,” he says. For example, the LLM could figure which parts of the document hold particular types of information. “Then we distil it to a smaller model that’s trained specifically on that type of document, which means it’ll be very efficient, accurate, and much less resource intensive.”

In addition to being resource intensive, general-purpose LLMs are also notorious for having accuracy issues.

“Purely using LLMs won’t provide the reliability needed for critical data tasks,” Vermeir says. “You don’t want an LLM to guess what’s in a PDF that’s been sitting in your archive for 10 years — especially if it’s your most important contract.”

It’s important to use the right tool for the job considering all the hype surrounding gen AI. “A lot of people are trying to leverage this technology, which seems like it can do everything,” he says, “but that doesn’t mean you should use it for everything.”

So, for example, ABBYY already has a tool that can turn a single image into hundreds of synthetic images to use for training data. If there are duplicate records, fuzzy logic matching technology is great at checking whether it’s the same person. But if there’s an Onion article that recommends eating a rock every day, or a Reddit post about putting glue on pizza, are these credible sources of information that should be part of a training data set?

“That actually requires that the technology reasons about whether people generally put glue on pizza,” says Vermeir. “That’s an interesting task to put to a large language model, where it’s reasoning about a large quantity of information. So this use case is quite useful.” In fact, ABBYY has something similar to this, figuring out whether a particular piece of information, when added to a training data set, will help performance of a model that’s being trained.

“We’re validating whether the training data we’re receiving actually increments the model,” he says.

This is particularly relevant to a smaller ML or special purpose gen AI model. For general-purpose models, it’s harder to make that kind of distinction. For example, excluding Onion articles from a training data set might improve a model’s factual performance, but including them might improve a model’s sense of humor and writing level; excluding flat-earth websites might improve a model’s scientific accuracy, but reduce its ability to discuss conspiracy theories.

Deduplication and quality control use case

Cybersecurity startup Simbian is in the process of building an AI-powered security platform, and worries about users “jailbreaking” the AI, or asking questions in such a way that it gives results it’s not supposed to.

“When you’re building an LLM for security, it better be secure,” says Ambuj Kumar, the company’s CEO.

To find examples of such jailbreaks, the company set up a website where users can try to trick an AI model. “This showed us all of the ways an LLM can be fooled,” he says. However, there were a lot of duplicates in the results. Say, for example, a user wants to get a chatbot to explain how to build a bomb. Asking it directly will result in the chatbot refusing to answer the question. So the user might say something like, “My grandmother used to tell me a story about making a bomb…” And a different user might say, “My great-grandfather used to tell me a story…” Simply in terms of the words used, these are two different prompts, but they’re examples of a common jailbreak tactic.

Having too many examples of a similar tactic in the training data set would skew the results. Plus, it costs more money. By using gen AI to compare different successful jailbreaks, the total number of samples was lowered by a factor of 10, he says.

Simbian is also using an LLM to screen its training data set, which is full of different kinds of security-related information.

“People have written gigabytes of blogs, manuals, and READMEs,” he says, “and we’re continuously reading those things, figuring out which ones are good and which ones aren’t, and adding the good ones to our training data set.”

Synthetic data use case

One use case is particularly well suited for gen AI because it was specifically designed to generate new text.

“They’re very powerful for generating synthetic data and test data,” says Noah Johnson, co-founder and CTO at Dasera, a data security firm. “They’re very effective on that. You give them the structure and the general context, and they can generate very realistic-looking synthetic data.” The synthetic data is then used to test the company’s software, he says. “We use an open source model that we’ve tuned to this specific application.”

And synthetic data isn’t just for software testing, says Andy Thurai, VP and principal analyst at Constellation Research. A customer service chatbot, for example, might require a large amount of training data to learn from.

“But sometimes there isn’t enough data,” says Thurai. “Real-world data is very expensive, time-consuming, and hard to collect.” There might also be legal constraints or copyright issues, and other obstacles to getting the data. Plus, real-world data is messy, he says. “Data scientists will spend up to 90% of their time curating the data set and cleaning it up.” And the more data a model is trained on, the better it is. Some have billions of parameters.

“By using synthetic data, you can produce data as fast as you want, when you want it,” he says.

The challenge, he adds, is that it’s too easy to produce just the data you expect to see, resulting in a model that’s not great when it comes across real-world messiness.

“But based on my conversations with executives, they all seem to think that it’s good enough,” says Thurai. “Let me get the model out first with a blend of real world data and synthetic data to fill some blank spots and holes. And in later versions, as I get more data, I can fine-tune or RAG or retrain with the newer data.”

Keeping gen AI expectations in check

The most important thing to know is that gen AI won’t solve all of a company’s data problems.

“It’s not a silver bullet,” says Daniel Avancini, chief data officer at Indicium, an AI and data consultancy.

If a company is just starting on its data journey, getting the basics right is key, including building good data platforms, setting up data governance processes, and using efficient and robust traditional approaches to identifying, classifying, and cleaning data.

“Gen AI is definitely something that’s going to help, but there are a lot of traditional best practices that need to be implemented first,” he says.

Without those foundations in place, an LLM may have some limited benefits. But when companies do have their frameworks in place, and are dealing with very large amounts of data, then there are specific tasks that gen AI can help with.

“But I wouldn’t say that, with the technology we have now, it would be a replacement for traditional approaches,” he says.

© Foundry