News

Cybersecurity researchers discover “Bad Likert Judge,” a new AI jailbreaking technique

data
data

The “Bad Likert Judge” jailbreaking technique boasts a high attack success rate by using a three-step approach which employs the target LLM’s own understanding of harmful content to bypass the target LLM’s safety guardrails.

Researchers have identified a new AI jailbreaking technique, referred to as the “Bad Likert Judge”  AI jailbreaking techniques are strategies to circumvent protections AI tools have in place to attempt to prevent their use for problematic purposes, such as creating hate speech or malware.  This technique, when tested against six advanced LLM models, was shown to increase the success rate for an attack by an average of 75%. 

This technique works in a three-step approach: 

Step 1: The Bad Likert Judge will ask the target LLM to act like a judge and evaluate the responses that “another” LLM generates. This acts as a trick because there is no other LLM and it is just the Bad Likert Judge using the target LLM’s own guardrails as a judgment system.

Step 2: The target LLM is given certain guidelines on how to score the responses based on what is considered “harmful” content. For example, the target LLM may be given instructions on how to score responses based on their potential to promote violence. 

Step 3: Rather than directly asking the target LLM to produce harmful content, the Bad Likert Judge will ask it to give examples of responses that would score high according to the guidelines provided. 

By using the LLM’s own judgement capabilities, the Bad Likert Judge can convince it to create outputs that the LLM’s creator does not intend it to produce. The same researchers who discovered this technique also found that the use of content filters were able to reduce the success rate of the attack by an average of 89.2%. 


Authored by Nathan Salminen and Surya Swaroop.

Search

Register now to receive personalized content and more!