AI art generators can be tricked into creating NSFW images

Nonsense words can trick popular text-to-image AIs like DALL-E 2 and Midjourney into producing pornographic, violent and other questionable images. A new algorithm generates these commands to bypass the safety filters of these AIs, in an attempt to find ways to strengthen these safeguards in the future. The group that developed the algorithm, which includes researchers from Johns Hopkins University in Baltimore and Duke University in Durham, North Carolina, will provide details their discoveries in May 2024 in IEEE Symposium on Security and Privacy in San Francisco.

AI generators often rely on large language models, the same kind of systems powering AI chatbots such as ChatGPT. Big language models are essentially supercharged versions of the auto-completion feature that smartphones have used for years to predict the rest of the word a person types.

Most online art generators are designed with security filters to reject requests for pornographic, violent and other questionable images. Johns Hopkins and Duke researchers have developed what they say is the first automated attack framework that explores text-to-image generating AI safety filters.

“Our group is usually interested in breaking things. Breaking things is part of making them stronger,” said senior study author Yinzhi Cao, a cybersecurity researcher at Johns Hopkins. “We’ve found vulnerabilities in thousands of websites in the past, and now we’re turning to AI models for their vulnerabilities.”

Scientists have developed a new algorithm called SneakyPrompt. In the experiments, they started with prompts that security filters would have blocked, such as “naked man riding a bicycle.” SneakyPrompt then tested the DALL-E 2 and Stable diffusion with alternatives for the filtered words in these prompts. The algorithm examines responses from generative AIs and then incrementally adjusts these alternatives to find commands that can bypass safety filters to create images.

Safety filters don’t just review a list of banned terms like “naked”. They also look for terms, such as “naked,” with meanings that are closely related to banned words.

The researchers found that nonsense words could prompt these generative AIs to create innocent pictures. For example, they found that DALL-E 2 would read the word “thwif” and “mowwly” as cat, and “lcgrfy” and “butnip fwngho” as dog.

DALLE-2 sometimes confuses words like “glucose” with “cat”. The researchers suspect that the AI ​​will “infer” the correct word from the context.Johns Hopkins University/Duke University

Scientists aren’t sure why a generative AI would mistake these nonsense words for commands. Cao notes that these systems are trained cases besides English, and some syllable or combination of syllables that are similar to, say, “thwif” in other languages, may be related to words like cat.

“Large language models see things differently than human beings,” Cao says.

The researchers also found that nonsense words can cause generative AI to produce images that are not safe for work (NSFW). Obviously, safety filters don’t see these prompts as strongly enough associated with banned terms to block them, but AI systems see these words as commands to create questionable content nonetheless.

Beyond nonsense words, scientists have found that generative AIs can mistake regular words for other common words—for example, DALL-E 2 can mistake “glucose” or “Gregory Wright’s face” for a cat and “support” or “dangerous thinks Walt ” for a dog. In these cases, the explanation may lie in the context in which these words are placed. When given the prompt “Dangerous thinks Walt growled menacingly at the stranger who approached his owner,” the systems deduced that “dangerous thinks Walt” meant a dog from the rest of the sentence.

“If ‘glucose’ is used in another context, it may not mean cat,” Cao says.

Previous manual attempts to circumvent these security filters were limited to specific generative AIs, such as Stable Diffusion, and could not be generalized to other text-to-image systems. The researchers found that SneakyPrompt can work on both DALL-E 2 and Stable Diffusion.

Additionally, previous manual attempts to bypass the Stable Diffusion safety filter showed a low success rate of roughly 33 percent, Cao and his colleagues calculated. In contrast, SneakyPrompt had an average bypass rate of about 96 percent when facing Stable Diffusion and roughly 57 percent with DALL-E 2.

These findings reveal that generative AI can be used to create disruptive content. For example, Tsao says, generative AIs can create images of real people involved in wrongdoing that they never did.

“We hope the attack will help people understand how vulnerable such text-to-image patterns could be,” Cao says.

Scientists now aim to explore ways to make generative AIs more resilient to adversaries. “The point of [our] attack work is about making the world a safer place,” Cao says. “First you need to understand the weaknesses of the AI ​​models and then make them resistant to attacks.”

Leave a Comment

Your email address will not be published. Required fields are marked *