As the use of LLM agents becomes increasingly prevalent in everyday life, ensuring their security is becoming more critical than ever. Currently, several approaches can be employed to detect hidden prompt injection attacks.
One common method is keyword filtering, where a blacklist of suspicious keywords is defined. If an input contains any of these keywords, it is flagged as potentially unsafe. Another approach is to restrict an agent’s tool-calling capabilities, allowing it to interact only with trusted and authenticated tools. A third option is to deploy a more powerful LLM as a guard model that analyzes incoming inputs and determines whether they contain suspicious or malicious instructions before they reach the target agent.
However, each of these approaches has significant limitations. Keyword-based filtering often suffers from high false-positive rates, causing benign content to be incorrectly flagged as malicious. Restricting tool usage can reduce the flexibility and usefulness of the agent, limiting its ability to perform complex tasks. Using an additional LLM as a guard model is computationally expensive and incurs substantial financial and energy costs.
To address these challenges, we aim to develop an efficient and low-cost defense mechanism. Our current approach combines token scoring with causality checking. Instead of relying on manually defined keyword blacklists, the token scoring component employs a machine learning model to identify potentially suspicious content more accurately. Once suspicious content has been detected, the causality checking component evaluates the relationship between the current input and the intended target objective by measuring their semantic distance. At present, this distance metric is provided by an auxiliary LLM. However, we are actively working toward a fully LLM-free distance estimation framework by leveraging token-level representations, embedding similarity, and causal attribution techniques. This would enable scalable and cost-effective deployment while maintaining robust protection against prompt injection attacks.
Our goal is to develop a defense mechanism that can mitigate a wide range of prompt injection attacks, including the distributed prompt injection attacks introduced in our previous work. At present, the proposed approach remains under active development and evaluation. While preliminary results are promising, additional experiments are needed to assess its effectiveness and scalability in real-world deployment scenarios.
Author: Ching-Yu Kao from Fraunhofer IAIS
Header © rawpixel.com – Freepik Company S.L.U. / edited Fraunhofer IIS
