AI art generators can simply copy existing images

The image on the right was generated by taking the training data caption for the left image “Life in the Light with Ann Graham Lotz” and then feeding it to the stable diffusion prompt.
Image: Cornell University/Training Data Mining from Diffusion Models

One of the main defenses used by those optimistic about AI art generators is that although the models are trained on existing images, everything they create is new. AI evangelists often compare these systems to real-life artists. Creative people are inspired by all who came before them, so why can’t AI be similarly emotional about previous work?

A new study could undermine that argument and even become a major obstacle numerous ongoing lawsuits regarding AI-generated content and copyright. Researchers in both industry and academia have discovered that the most popular and upcoming AI image generators can “remember” images from the data they are trained on. Instead of creating something completely new, certain prompts will cause the AI ​​to simply reproduce an image. Some of these recreated images may be copyrighted. But even worse, modern generative AI models have the ability to remember and reproduce sensitive information collected for use in an AI training set.

Learning was conducted by researchers in both the technology industry—in particular Google and DeepMind—and at universities like Berkeley and Princeton. The same team is working on earlier study which identified a similar problem with AI language models, specifically GPT2, the predecessor of OpenAI extremely popular ChatGPT. Bringing the group back together, the researchers, led by Google Brain researcher Nicholas Carlini, found that both Google’s Imagen and the popular open source Stable Diffusion are capable of reproducing images, some of which have obvious implications against copyright or image licenses.

The first image in this tweet was generated using the caption listed in the Stable Diffusion dataset, the multiterabyte database of depleted images known as LAION. The team put the caption into the stable diffusion prompt and the same exact image came out, albeit slightly distorted by digital noise. The process for finding these duplicate images was relatively simple. The team played the same prompt several times, and after receiving the same resulting image, the researchers manually checked whether the image was in the training set.

A series of images at the top and bottom revealing images taken from an AI training set and the AI ​​itself.

The bottom images were traced to the top images, which were taken directly from the AI ​​training data. All of these images may have a license or copyright associated with them.
Image: Cornell University/Training Data Mining from Diffusion Models

Two of the paper’s researchers, Eric Wallace, a postdoctoral fellow at the University of California, Berkeley, and Vikash Sehwag, a postdoctoral fellow at Princeton University, told Gizmodo in a Zoom interview that duplicate images are rare. Their team tested about 300,000 different captions and found only a 0.03% retention rate. Duplicated images were even rarer for models like Stable Diffusion, which worked to remove duplicate images in their training set, although eventually all diffusion models will have the same problem, to a greater or lesser extent. degree. The researchers found that Imagen was perfectly capable of remembering images that only existed once in the dataset.

“The caveat here is that the model has to generalize, it has to generate new images instead of spitting out a stored version,” Sehwag said.

Their research showed that as the AI ​​systems themselves become larger and more complex, the AI ​​is more likely to generate copycat material. A smaller model like the Stable Diffusion simply doesn’t have the same amount of storage space to store most of that training data. Che a lot can change in the next few years.

“Maybe next year, whatever new model comes out that’s much bigger and much more powerful, then potentially those recall risks will be much higher than they are now,” Wallace said.

Through a complex process that involves destroying the training data with noise before removing the same distortion, diffusion-based machine learning models produce data—in this case, images—similar to what they were trained on. Diffusion models are an evolution from generative adversarial networks or GAN-based machine learning.

The researchers found that GAN-based models don’t have the same problem with memorizing images, but it’s unlikely that large companies will move beyond Diffusion unless an even more sophisticated machine learning model emerges that produces even more realistic images with high quality.

Florian Tramer, professor of computer science at ETH Zurich, who participated in the research, noted how many AI companies advise users, both free and paid, to obtain a license to share or even monetize AI-generated content . The AI ​​companies themselves also retain some of the rights to these images. This can be a problem if the AI ​​generates an image that is exactly the same as an existing copyright.

With only a 0.03% retention rate, AI developers could look at this study and determine that there is not much risk. Companies could work to remove duplicate images in training data, making them less likely to be memorized. Heck, they could even develop AI systems to detect if an image is a direct replication of an image in the training data and flag it for deletion. However, it masks the full privacy risk posed by generative AI. Carlini and Tramer also helped another recent article who claims that even attempts at data filtering still do not prevent training data from leaking through the model.

And of course, there’s a big risk that images that no one would want copied end up showing up on users’ screens. Wallace asked if a researcher wanted to generate a whole set of synthetic medical data on people’s X-rays, for example. What should happen if diffusion-based AI remembers and duplicate a person’s actual medical records?

“It’s pretty rare, so you might not notice it happening at first, and then you might actually put that data set out there on the web,” said the UC Berkeley student. “The purpose of this work is to prevent possible types of mistakes that people can make.”

Leave a Comment

Your email address will not be published. Required fields are marked *