Home AI News Unleashing Objectionable Content: New Attack Method for Language Models

Unleashing Objectionable Content: New Attack Method for Language Models

Unleashing Objectionable Content: New Attack Method for Language Models

Title: The Impact of Large Language Models on AI-generated Content


Large language models (LLMs) are advanced deep learning models that are specifically trained to understand and generate human-like text. These models utilize vast amounts of data from various sources such as books, articles, websites, and the internet in general. Their capabilities include language translation, text summarization, question answering, and more.

Concerns and Research on Objectionable Content:

Recently, there has been growing concern about the ability of LLMs to generate objectionable content. To address this, researchers from Carnegie Mellon University’s School of Computer Science (SCS), the CyLab Security and Privacy Institute, and the Center for AI Safety in San Francisco conducted a study on objectionable behaviors generated by language models.

The Attack Method:

The researchers proposed a new attack method involving the addition of a suffix to different queries. This resulted in a significant increase in the likelihood that both open-source and closed-source language models would generate affirmative responses to typically refused questions.

Effects on Language Models:

The researchers successfully applied the attack suffix to various language models, including public interfaces like ChatGPT, Bard, and Claude, and open-source models like LLaMA-2-Chat, Pythia, Falcon, among others. This attack method effectively induced objectionable content in the outputs of these language models.

Success Rates:

The attack suffix achieved a high success rate, generating objectionable content in 99 out of 100 instances on Vicuna. It also produced 88 out of 100 exact matches with a target harmful string in Vicuna’s output. When tested against other models like GPT-3.5, GPT-4, and PaLM-2, the attack method achieved success rates of up to 84% and 66%, respectively.

Implications and Future Concerns:

While the immediate harm caused by prompting a chatbot to produce objectionable content may not be severe, the researchers expressed concerns about the potential impact of these models in autonomous systems without human supervision. They emphasized the need for reliable methods to prevent hijacking of autonomous systems through attacks like these.

The Importance of Fixing Models:

The researchers noted that their intention was not to attack proprietary language models and chatbots. However, their research demonstrated that even large closed-source models can be vulnerable to attacks when smaller, open-source models are targeted and studied. As a next step, efforts should be made to develop solutions to fix these models and prevent adversarial attacks.


The study revealed the broad applicability of the attack method across various language models, including those with public interfaces and open-source implementations. Addressing adversarial attacks on these models is crucial to ensure their safe and responsible use in the future.

To learn more about this research, you can read the paper and the blog article. Stay updated with the latest AI research news, cool projects, and more by joining our ML SubReddit, Facebook Community, Discord Channel, and subscribing to our Email Newsletter.

Source link


Please enter your comment!
Please enter your name here