The promise and dangers of synthetic data

December 24, 2024

394

Is it possible for an AI to be trained only on data generated by one other AI? It may sound like a crazy idea. But it has been around for a while – and because it becomes increasingly difficult to acquire latest, real data, it’s becoming increasingly vital.

Anthropic used some synthetic data to coach one in all its flagship models, Claude 3.5 Sonnet. Meta refined its Llama 3.1 models using AI-generated data. And OpenAI is anticipated to acquire synthetic training data from o1, its “reasoning” model, in the approaching period Orion.

But why does AI need data in any respect – and what data does it need? And can this data get replaced with synthetic data?

The Importance of Annotations

AI systems are statistical machines. Using numerous examples, they learn the patterns in these examples to make predictions, corresponding to the “to whom” often comes before “it would mean something” in an email.

Annotations, typically text identifying the meaning or parts of the info collected by these systems, are a key element in these examples. They function guideposts and “teach” a model for distinguishing between things, places and concepts.

Consider a photograph classification model that shows many images of kitchens labeled with the word “kitchen.” During training, the model begins to make associations between “kitchen” and general concepts of kitchens (e.g., that they contain fridges and countertops). After training, the model should find a way to discover it as such based on a photograph of a kitchen that was not included in the primary examples. (Of course, if the photographs of kitchens were labeled “cow,” they might be labeled as cows, highlighting the importance of fine annotations.)

The appetite for AI and the necessity to supply labeled data for its development have driven the marketplace for annotation services higher. Market research dimension Estimates that it’s price $838.2 million today – and might be price $10.34 billion in the subsequent 10 years. Although there aren’t any exact estimates of how many individuals might be involved in labeling work in 2022 Paper puts the number in “thousands and thousands”.

Companies large and small depend on employees of information annotation firms to create labels for AI training sets. Some of those jobs pay reasonably well, especially if the title requires specialized skills (e.g., math skills). Others will be stressful. commentators in developing countries are paid, on average, just a number of dollars per hourwith none advantages or guarantees for future performances.

A drying well of information

So there are humanistic reasons to search for alternatives to human-made labels. Uber, for instance, is expanding its fleet of gig staff to work on AI annotation and data labeling. But there are also practical ones.

People can only label so quickly. Annotators even have biases that may manifest themselves of their annotations and subsequently in any models trained on them. commentators do Mistakeor stumble through the labeling of instructions. And it's expensive to pay people to do things.

By the best way, data is dear. Shutterstock charges AI providers tens of thousands and thousands of dollars for access to it Archivewhile Reddit has made a whole lot of thousands and thousands by licensing data to Google, OpenAI and others.

Finally, it’s becoming increasingly difficult to acquire data.

Most models are based on vast collections of public data—data that owners are increasingly withholding out of fear that it is likely to be plagiarized or that they won't receive credit or attribution for it. More than 35% of the highest 1,000 web sites on the earth Block OpenAI's web scraper now. And about 25% of information from “high-quality” sources has recently been excluded from key datasets used to coach models study found.

If the present trend towards blocking access continues, the research group Epoch AI Projects that developers will run out of information to coach generative AI models between 2026 and 2032. This, coupled with fears of copyright lawsuits and offensive material leaking into open datasets, has forced AI vendors to face a reckoning.

Synthetic alternatives

At first glance, synthetic data appears to be the answer to all these problems. Do you wish comments? Generate them. More example data? No problem. The sky is the limit.

And to a certain extent that's true.

“If 'data is the brand new oil,' synthetic data presents itself as a biofuel that will be produced without the negative externalities of the actual thing,” said Os Keyes, a graduate student on the University of Washington who studies the moral implications of latest technologies investigated, told TechCrunch. “You can take a small initial data set and simulate and extrapolate latest entries from it.”

The AI industry has taken up the concept and implemented it.

This month, Writer, an enterprise-focused generative AI company, introduced a model, Palmyra X 004, that was trained almost entirely on synthetic data. The creator claims that the event only cost $700,000 – compared Estimates put the value at $4.6 million for an OpenAI model of comparable size.

Microsoft's open Phi models were partially trained using synthetic data. This also applied to Google’s Gemma models. Nvidia this summer introduced a family of models for generating synthetic training data, and AI startup Hugging Face recently released what it claims to be largest AI training data set of synthetic text.

Synthetic data generation has turn into a business in its own right – and it could possibly be Value $2.34 billion by 2030. Gartner predicted that 60% of the info used for AI and analytics projects this 12 months might be generated synthetically.

Luca Soldaini, senior research scientist on the Allen Institute for AI, noted that synthetic data techniques will be used to generate training data in a format that isn’t easily obtained through scraping (and even through content licensing). For example, when training its video generator Movie Gen, Meta used Llama 3 to create captions for footage within the training data, which were then refined by humans so as to add further details corresponding to descriptions of lighting.

With that in mind, OpenAI says it refined GPT-4o using synthetic data to create the sketchpad-like canvas feature for ChatGPT. And Amazon has said that it generates synthetic data to complement the actual data it uses to coach voice recognition models for Alexa.

“Synthetic data models will be used to quickly extend human intuition and determine what data is required to attain a selected model behavior,” Soldaini said.

Synthetic risks

However, synthetic data isn’t a panacea. It suffers from the identical “garbage in, garbage out” problem as all AI. Models synthetic data, and if the info used to coach these models has biases and limitations, their results might be similarly biased. For example, groups which are poorly represented within the baseline data are also not represented within the synthetic data.

“The problem is there’s only a lot you’ll be able to do,” Keyes said. “Say you simply have 30 black people in an information set. An extrapolation is likely to be helpful, but when those 30 persons are all middle class or all light-skinned, the “representative” data will all appear to be this.”

So far a 2023 study Researchers at Rice University and Stanford University found that an over-reliance on synthetic data during training can result in models that “progressively decline in quality or diversity.” Sampling bias – a poor representation of the actual world – causes a model's diversity to deteriorate after a number of generations of coaching, in keeping with the researchers (although additionally they found that mixing in some real-world data helps mitigate this) .

Keyes sees additional risks in complex models like OpenAI's o1, which he says may lead to harder-to-detect hallucinations of their synthetic data. This, in turn, could reduce the accuracy of the models trained on the info – especially if the sources of the hallucinations will not be easily identified.

“Complex models hallucinate; “Data produced by complex models incorporates hallucinations,” Keyes added. “And with a model like o1, the developers themselves can’t necessarily explain why artifacts occur.”

Worsening hallucinations can lead to models spewing gibberish. A study The study, published within the journal Nature, reveals how models trained on flawed data produce flawed data and the way this feedback loop affects future generations of models. The researchers found that over generations, models lose their grasp of more esoteric knowledge – becoming more general and infrequently providing answers which are irrelevant to the questions asked of them.

Photo credit:Ilia Shumailov et al.

An addendum study shows that other varieties of models, corresponding to image generators, will not be proof against such a breakdown:

Soldaini agrees that “raw” synthetic data can’t be trusted, at the very least if the goal is to avoid training forgetful chatbots and homogeneous image generators. In his view, to make use of it “safely” requires thorough review, curation and filtering, and ideally linking it to fresh, real data – just as you’d with another data set.

If this doesn't occur, it could occur in some unspecified time in the future result in the collapse of the modelwhere a model becomes less “creative” – and more biased – in its output, eventually seriously affecting its functionality. Although this process could possibly be detected and stopped before it becomes serious, it poses a risk.

“Researchers must examine the info generated, repeat the generation process, and discover safeguards to remove low-quality data points,” Soldaini said. “Synthetic data pipelines will not be a self-improving machine; Their results should be rigorously examined and improved before they’re used for training.”

Sam Altman, CEO of OpenAI, once argued that AI will do that in the future generates synthetic data adequate to coach itself effectively. But – assuming that’s even possible – the technology doesn’t exist yet. No major AI lab has released a model trained solely on synthetic data.

At least for the foreseeable future, it appears that evidently we are going to need people to remain within the loop to be sure that training a model doesn't go unsuitable.

The promise and dangers of synthetic data

The Importance of Annotations

A drying well of information

Synthetic alternatives

Synthetic risks

LEAVE A REPLY Cancel reply

Must Read

I used AI chatbots as a news source for a month they usually were unreliable and buggy

As a part of the “physical AI” takeover of CES 2026

Humanoid robots or human connection? What Elon Musk's Optimus reveals about our AI ambitions

3 questions: How AI could optimize the ability grid

Decoding the Arctic to predict winter weather

Gmail introduces personalized AI inbox, AI digests in search, and more

Envisioning a Better Future Together – A message from our Founder and CEO about purpose, unity and motion

Latest articles

I used AI chatbots as a news source for a month they usually were unreliable and buggy

As a part of the “physical AI” takeover of CES 2026

Humanoid robots or human connection? What Elon Musk's Optimus reveals about our AI ambitions

Our Newsletter

The promise and dangers of synthetic data

The Importance of Annotations

A drying well of information

Synthetic alternatives

Synthetic risks

RELATED ARTICLES

LEAVE A REPLY Cancel reply

Must Read

Latest articles

Our Newsletter