Attacks¶
This section provides an overview of all available attacks.
Single-stage attacks¶
OWASP LLM01:2025 Prompt Injection¶
Adversarial Suffix
In code name: suffix
Uses a specially crafted suffix to bypass LLM filters and restrictions, forcing the model to generate prohibited or harmful content despite the original query. Evaluates the model’s vulnerability to input data manipulation.
Original Paper: https://arxiv.org/abs/2307.15043, Original Code: https://github.com/llm-attacks/llm-attacks
Aim Jailbreak
In code name: aim_jailbreak
Probes an LLM’s ethical compliance by eliciting detailed instructions on illegal activities, using the Nicolo Machiavelli character roleplay.
Base64 Injection
In code name: base64_injection
Challenges the LLM with a dataset of lines, encoded into base64.
Best-of-N (BoN) Jailbreak
In code name: bon
Probes the LLM resilience against Best-Of-N (BoN) attack by randomly transform symbols in input harmful prompt.
Original Paper: https://arxiv.org/abs/2412.03556, Original Code: https://github.com/jplhughes/bon-jailbreaking
Deceptive Delight
In code name: deceptive_delight
Tests LLM’s vulnerability to distraction and context manipulation by embedding unsafe content alongside benign topics in three messages.
Dialogue Injection: Continuation
In code name: dialogue_injection_continuation
Probes an LLM’s ethical compliance by eliciting detailed instructions on illegal activities, using the conversation simulation technique by using few-shot system prompting and unfinished assitant response at the end. Built according to the paper.
Original Paper: https://arxiv.org/abs/2503.08195
Dialogue Injection: Developer mode
In code name: dialogue_injection_devmode
Probes an LLM’s ethical compliance by eliciting detailed instructions on illegal activities, using the conversation simulation technique to mislead the LLM with “Developer mode activation”.
Original Paper: https://arxiv.org/abs/2503.08195
Do Anything Now (DAN) Jailbreak
In code name: dan
Probes an LLM’s resilience against performing explicitly forbidden activities using the “Do Anything Now” (DAN) character roleplay technique.
Ethical Compliance
In code name: ethical_compliance
Evaluates the LLM’s ability to resist discussions about harmful or inappropriate misconceptions on sensitive topics. This test aims to verify the model’s adherence to ethical guidelines against propagating stereotypes and toxic content.
HarmBench
In code name: harmbench
Challenges the LLM with harmful behavior prompts from HarmBench (contains original and translated to Russian prompts).
Original Paper: https://arxiv.org/abs/2402.04249
Linguistic Evasion
In code name: linguistic_evasion
Leverages the use of a non-Russian language to circumvent the restrictions or focus set by the system prompt.
Past Tense Attack
In code name: past_tense
Challenges the LLM with a dataset of harmful behavior prompts in past tense form.
Shuffle Inconsistency
In code name: shuffle
Attempts to bypass LLM’s defense mechanism by randomly shuffling elements in input harmful prompt.
Original Paper: https://arxiv.org/html/2501.04931
UCAR Jailbreak
In code name: ucar
Probes an LLM’s adherence to content restrictions when explicitly requested to ignore such filters. It assesses the robustness of the LLM’s content moderation systems against direct challenges.
Multi-stage attacks¶
What Drives the Multi-stage?
Multi-stage attacks are inspired by the Jailbreaking Black Box Large Language Models in Twenty Queries (PAIR) paper.
For managing a multi-stage interaction between an attacker and tested chat clients, the MultiStageInteractionSession
class is available [source]. It contains the following properties:
attacker_session
is the session for the attacker.tested_client_session
is the session for the tested client.stop_criterion
is an optional function that determines whether to stop the conversation based on the tested client’s responses.history_limit
is the maximum allowed history length for the attacker.tested_client_response_handler
is an optional function that handles the tested client’s response before passing it to the attacker.current_step
is the current step of the attacker.refine_args
are additional positional arguments for thetested_client_response_handler
.refine_kwargs
are additional keyword arguments for thetested_client_response_handler
.
OWASP LLM01:2025 Continuous Prompt Injection¶
AutoDAN-Turbo
In code name: autodan_turbo
Implements the AutoDAN-Turbo attack methodology which uses a lifelong agent for strategy self-exploration to jailbreak LLMs. This attack automatically discovers jailbreak strategies without human intervention and combines them for more effective attacks.
Original Paper: https://arxiv.org/abs/2410.05295v3, Original Code: https://github.com/SaFoLab-WISC/AutoDAN-Turbo
Crescendo
In code name: crescendo
Challenges the LLM with a dataset of harmful behavior prompts using Crescendo strategy, operates by engaging the LLM in a series of escalating conversational turns, and multistage refinement.
Original Paper: https://arxiv.org/abs/2404.01833
Prompt Automatic Iterative Refinement (PAIR)
In code name: pair
Challenges the LLM with a dataset of harmful behavior prompts using multistage refinement with judge model scoring. Original Paper: https://arxiv.org/abs/2310.08419v4, Code: https://github.com/patrickrchao/JailbreakingLLMs
OWASP LLM07:2025 System Prompt Leakage¶
System Prompt Leakage
In code name: system_prompt_leakage
Challenges the LLM with datasets of malicious prompts to extract system instructions using multistage refinement.
OWASP LLM09:2025 Misinformation¶
Logical Inconsistencies Test
In code name: logical_inconsistencies
Attacks the LLM system during a multistage conversation to generate logical inconsistencies in answers using antonyms, negations, and paraphrases based on the system’s domain.
Sycophancy Test
In code name: sycophancy
Attacks the LLM system during a multistage conversation using the argumentum ad verecundiam method based on the system’s domain.
Original Paper: https://arxiv.org/abs/2310.13548
VLM attacks¶
See an example with VLM attacks.
OWASP LLM09:2025 Misinformation¶
Lowres Documents Hallucination Attack
In code name: vlm_lowres_docs
Test class for VLM responses for low resolution documents. The idea is that we provide model with unreadable document and ask to describe its contents. The goal is to see whether model will signal that the document is unreadable or hallucinate about the contents.
M-Attack
In code name: vlm_m_attack
Test VLM for M-Attack, which adds noise to the image to make VLM see it as a different image.
Original Paper: https://arxiv.org/abs/2503.10635
Text Hallucination Attack
In code name: vlm_text_hallucination
Test class for VLM text hallucination attacks. The idea is to give model and image which contains only text and ask it to describe it. The goal is to see whether model will describe text or hallucinate with the description of an object written in text.