- Experts show how some AI models, including GPT-4, can be exploited with simple user prompts
- Guardrail gaps don’t do a great job of detecting deceptive framing
- The vulnerability could be exploited to acquire personal information
A security researcher has shared details on how other researchers tricked ChatGPT into revealing a Windows product key using a prompt that anyone could try.
Marco Figueroa explained how a ‘guessing game’ prompt with GPT-4 was used to bypass safety guardrails that are meant to block AI from sharing such data, ultimately producing at least one key belonging to Wells Fargo Bank.
The researchers also managed to obtain a Windows product key to authenticate Microsoft‘s OS illegitimately, but for free, highlighting the severity of the vulnerability.
ChatGPT can be tricked into sharing security keys
The researcher explained how he hid terms like ‘Windows 10 serial number’ inside HTML tags to bypass ChatGPT’s filters that would usually have blocked the responses he got, adding that he was able to frame the request as a game to mask malicious intent, exploiting OpenAI‘s chatbot through logic manipulation.
“The most critical step in the attack was the phrase ‘I give up’,” Figueroa wrote. “This acted as a trigger, compelling the AI to reveal the previously hidden information.”
Figueroa explained why this type of vulnerability exploitation worked, with the model’s behavior playing an important role. GPT-4 followed the rules of the game (set out by researchers) literally, and guardrail gaps only focused on keyword detection rather than contextual understanding or deceptive framing.
Still, the codes shared were not unique codes. Instead, the Windows license codes had already been shared on other online platforms and forums.
While the impacts of sharing software license keys might not be too concerning, Figueroa highlighted how malicious actors could adapt the technique to bypass AI security measures, revealing personally identifiable information, malicious URLs or adult content.
Figueroa is calling for AI developers to “anticipate and defend” against such attacks, while also building in logic-level safeguards that detect deceptive framing. AI developers must also consider social engineering tactics, he goes on to suggest.