What you need to know about tech companies using AI to train their own AI

OpenAI, Google and other tech companies are training their chatbots with vast amounts of data culled from books, Wikipedia articles, news and other sources on the internet. But in the future, they hope to use something called synthetic data.

That’s because tech companies may be running out of high-quality text that the internet has to offer for AI development. And companies have faced copyright lawsuits from authors, news organizations and computer programmers for using their works without permission. (In one such case, The New York Times is suing OpenAI and Microsoft.)

They believe synthetic data will help reduce copyright issues and increase the supply of educational materials needed for AI. Here’s what you need to know about it.

What is synthetic data?

This is data generated by artificial intelligence.

Does this mean tech companies want AI to be trained by AI?

yes Instead of training AI models with text written by humans, tech companies like Google, OpenAI and Anthropic hope to train their technologies with data generated by other AI models.

Does synthetic data work?

Not exactly. AI models make mistakes and make things up. They also showed that they pick up the biases that appear in the internet data they were trained on. So if companies use AI to train AI, they may end up reinforcing their own flaws.

Is synthetic data widely used by technology companies right now?

No. Tech companies are experimenting with this. But because of the potential drawbacks of synthetic data, it’s not a big part of how AI systems are built today.

So why are tech companies saying synthetic data is the future?

Companies believe they can fine-tune how synthetic data is created. OpenAI and others have explored a technique where two different AI models work together to generate synthetic data that is more useful and reliable.

An AI model generates the data. A second model then evaluates the data in a human-like manner, deciding whether the data is good or bad, accurate or not. AI models are actually better at judging text than writing it.

“If you give technology two things, it’s pretty good at picking which looks best,” said Nathan Lyle, CEO of AI startup SynthLabs.

The idea is that this will provide the high-quality data needed to train an even better chatbot.

Does this technique work?

Something like. It all comes down to this second AI model. How good is it at judging text?

Anthropic has been most vocal about its efforts to make this work. It refined the second AI model using a “constitution” prepared by the company’s researchers. This teaches the model to choose a text that supports certain principles, such as liberty, equality, and fraternity, or life, liberty, and personal security. Anthropic’s method is known as “Constitutional AI”

Here’s how two AI models work in tandem to produce synthetic data using a process like Anthropic’s:

However, humans are needed to ensure that the second AI model stays on track. This limits how much synthetic data this process can generate. And researchers disagree on whether a method like Anthropic’s will continue to improve AI systems.

Does synthetic data help companies avoid using copyrighted information?

The AI models that generate the synthetic data were themselves trained on human-created data, much of which was copyrighted. So copyright holders can still claim that companies like OpenAI and Anthropic have used copyrighted text, images and video without permission.

Jeff Clune, a professor of computer science at the University of British Columbia who previously worked as a researcher at OpenAI, said AI models could eventually become more powerful than the human brain in some ways. But they will because they learned from the human brain.

“To borrow from Newton: AI sees further by standing on the shoulders of gigantic human data sets,” he said.