In recent years, Artificial Intelligence (AI) systems have made significant progress thanks to the development of Large Language Models (LLMs). The introduction of LLMs like ChatGPT by OpenAI, Bard by Google, and Llama-2 has showcased their impressive capabilities, from assisting with tool utilization to simulating human behavior. However, as these LLMs become more widely deployed, ensuring the security and reliability of their responses presents a major challenge.
The Significance of Safety Alignment in Non-Natural Language
A team of researchers has conducted a study on the application of LLMs in non-natural languages, specifically ciphers. Their research highlights the need for safety alignment methods in this linguistic setting to match the capabilities of LLMs. While LLMs excel in understanding and producing human languages, they have also demonstrated unexpected proficiency in comprehending non-natural languages. This emphasizes the importance of developing safety regulations that cover both traditional linguistics and non-traditional forms of communication like ciphers.
The Introduction of CipherChat
To evaluate the applicability of safety alignment methods in non-natural languages, the team has created CipherChat, a framework designed for human interaction with LLMs using cipher-based prompts and enciphered demonstrations. This architecture allows for a thorough examination of the LLMs’ understanding of ciphers, their participation in conversations, and their sensitivity to inappropriate content.
The Need for Customized Safety Alignment Mechanisms
The team’s experiments with modern LLMs, including ChatGPT and GPT-4, using various realistic human ciphers have revealed a concerning pattern. Certain ciphers were able to circumvent GPT-4’s safety alignment procedures with nearly 100% success rates in some safety domains. This empirical result highlights the urgent need for customized safety alignment mechanisms for non-natural languages like ciphers. These mechanisms are crucial to ensure the robustness and reliability of LLMs’ responses in different linguistic scenarios.
Additionally, the research has uncovered the presence of a secret cipher within LLMs. Drawing parallels to secret languages observed in other language models, the team hypothesizes that LLMs possess a latent ability to decipher certain encoded inputs, suggesting the existence of a unique cipher-related capability. Building on this observation, the team has introduced SelfCipher, a framework that taps into and activates LLMs’ hidden abilities to enhance their performance in deciphering encoded inputs and generating meaningful responses.
Don’t forget to join our 28k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter where we share the latest AI research news, cool AI projects, and more.
Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning. She is a Data Science enthusiast with good analytical and critical thinking skills, and a keen interest in acquiring new skills, leading groups, and managing work efficiently.