Stress testing
Mindgard deployed these two filters in front of ChatGPT 3.5 Turbo using Azure OpenAI, then accessed the target LLM through Mindgard’s Automated AI Red Teaming Platform.
Two attack methods were used against the filters: Character injection (adding specific types of characters and irregular text patterns, etc.) and adversarial ML evasion (finding blind spots within ML classification).
Character injection reduced Prompt Guard’s jailbreak detection effectiveness from 89% to 7% when exposed to diacritics (e.g., changing the letter a to á), homoglyphs (e.g., close resembling characters such as 0 and O), numerical replacement (“Leet speak”), and spaced characters. The effectiveness of AI Text Moderation was also reduced using similar techniques.