You are currently viewing Skeleton Key can jailbreak most of the largest AI models

Skeleton Key can jailbreak most of the largest AI models

Skeleton Key can be used to trick many AI models into revealing their darkest secrets.
REUTERS/Kacper Pempel/Illustration/File photo

  • A jailbreaking method called Skeleton Key can trick AI models into revealing malicious information.
  • The technique bypasses security measures in models such as Metas Llama3 and OpenAI GPT 3.5.
  • Microsoft recommends introducing additional protection measures and monitoring AI systems to counteract Skeleton Key.

It doesn’t take much to use a large language model to develop a recipe for all kinds of dangerous things.

Using a jailbreaking technique called the “Skeleton Key,” users can trick models like Meta’s Llama3, Google’s Gemini Pro and OpenAI’s GPT 3.5 into telling them the recipe for a rudimentary firebomb or worse, according to a blog post by Mark Russinovich, chief technology officer of Microsoft Azure.

The technique works through a multi-stage strategy that forces a model to ignore its guardrails, Russinovich wrote. Guardrails are safety mechanisms that help AI models distinguish malicious requests from benign ones.

“Like all jailbreaks,” Skeleton Key works by “narrowing the gap between what the model can do (given user credentials, etc.) and what it wants to do,” Russinovich wrote.

But it’s more destructive than other jailbreaking techniques that can only extract information from AI models “indirectly or with codes.” Instead, Skeleton Key can force AI models to reveal information on topics ranging from explosives to bioweapons to self-harm through simple natural language prompts. These results often reveal all of a model’s knowledge on a given topic.

Microsoft tested Skeleton Key on several models and found that it works on Meta Llama3, Google Gemini Pro, OpenAI GPT 3.5 Turbo, OpenAI GPT 4o, Mistral Large, Anthropic Claude 3 Opus, and Cohere Commander R Plus. The only model that showed some resistance was OpenAI’s GPT-4.

Russinovich said Microsoft has made some software updates to mitigate the impact of Skeleton Key on its own large language models, including its Copilot AI Assistants.

But his general advice to companies building AI systems is to equip them with additional safeguards. He also noted that they should monitor the inputs and outputs of their systems and implement controls to detect abusive content.

Leave a Reply