A group of researchers from Intel, Idaho State University, and the University of Illinois has unveiled new methods for bypassing security filters in large language models (LLMs) such as ChatGPT and Gemini. This has been reported by 404 Media.
The study found that chatbots could be manipulated into providing restricted information when prompts are phrased in complex or ambiguous ways, or by citing non-existent sources. This approach is termed "information overload".
The experts utilized a specialized tool called InfoFlood, which automates the process of "overloading" models with information. As a result, the systems become disoriented and may start to deliver prohibited or dangerous content that is typically blocked by built-in security filters.
The vulnerability lies in the models' focus on the superficial structure of the text, failing to recognize dangerous content that is hidden. This opens up opportunities for malicious actors to evade restrictions and access harmful information.
As part of a responsible disclosure of vulnerabilities, the authors of the study will share their findings with companies working on large LLMs to improve their security systems. The researchers will also provide solutions they identified during their investigation.
"LLM models primarily rely on protective mechanisms during data input and output to detect harmful content. InfoFlood can be used to train these protective mechanisms—it allows the extraction of relevant information from potentially dangerous queries, making models more resilient to such attacks," the research states.