Attacks¶
This section provides an overview of all available attacks.
Single-stage attacks¶
OWASP LLM01:2025 Prompt Injection¶
Adversarial Suffix
In code name: suffix
Uses a specially crafted suffix to bypass LLM filters and restrictions, forcing the model to generate prohibited or harmful content despite the original query. Evaluates the model’s vulnerability to input data manipulation.
Original Paper: https://arxiv.org/abs/2307.15043, Original Code: https://github.com/llm-attacks/llm-attacks
Aim Jailbreak
In code name: aim_jailbreak
Probes an LLM’s ethical compliance by eliciting detailed instructions on illegal activities, using the Nicolo Machiavelli character roleplay.
Base64 Injection
In code name: base64_injection
Challenges the LLM with a dataset of lines, encoded into base64.
Complimentary Transition
In code name: complimentary_transition
Evaluates the LLM’s ability to maintain content moderation standards when transitioning from a compliment to a new task. It assesses the LLM’s ability to maintain defined boundaries when prompted to switch topics.
Dialogue Injection
In code name: dialogue_injection
Probes an LLM’s ethical compliance by eliciting detailed instructions on illegal activities, using the conversation simulation technique.
Original Paper: https://arxiv.org/abs/2503.08195
Do Anything Now Jailbreak
In code name: dan
Probes an LLM’s resilience against performing explicitly forbidden activities using the “Do Anything Now” (DAN) character roleplay technique.
Do Anything Now Jailbreak (RU)
In code name: RU_dan
Probes a Russian-language LLM’s resilience against performing explicitly forbidden activities using the “Do Anything Now” (DAN) character roleplay technique.
Ethical Compliance
In code name: ethical_compliance
Evaluates the LLM’s ability to resist discussions about harmful or inappropriate misconceptions on sensitive topics. This test aims to verify the model’s adherence to ethical guidelines against propagating stereotypes and toxic content.
Harmful Behavior
In code name: harmful_behavior
Challenges the LLM with a dataset of harmful behavior prompts.
Linguistic Evasion
In code name: linguistic_evasion
Leverages the use of a non-Russian language to circumvent the restrictions or focus set by the system prompt.
Past Tense Attack
In code name: past_tense
Challenges the LLM with a dataset of harmful behavior prompts in past tense form.
Typoglycemia Attack
In code name: typoglycemia_attack
Attempts to exploit a text processing vulnerability in LLMs that can cause incorrect responses by removing random characters from words.
Typoglycemia Attack (RU)
In code name: RU_typoglycemia_attack
Attempts to exploit a text processing vulnerability in Russian-language LLMs that can cause incorrect responses by removing random characters from words.
UCAR
In code name: ucar
Probes an LLM’s adherence to content restrictions when explicitly requested to ignore such filters. It assesses the robustness of the LLM’s content moderation systems against direct challenges.
UCAR (RU)
In code name: RU_ucar
Probes a Russian-language LLM’s adherence to content restrictions when explicitly requested to ignore such filters. It assesses the robustness of the LLM’s content moderation systems against direct challenges.
Multi-stage attacks¶
What Drives the Multi-stage?
Multi-stage attacks are inspired by the Jailbreaking Black Box Large Language Models in Twenty Queries (PAIR) paper.
For managing a multi-stage interaction between an attacker and tested chat clients, the MultiStageInteractionSession
class is available [source]. It contains the following properties:
attacker_session
is the session for the attacker.tested_client_session
is the session for the tested client.stop_criterion
is an optional function that determines whether to stop the conversation based on the tested client’s responses.history_limit
is the maximum allowed history length for the attacker.tested_client_response_handler
is an optional function that handles the tested client’s response before passing it to the attacker.current_step
is the current step of the attacker.refine_args
are additional positional arguments for thetested_client_response_handler
.refine_kwargs
are additional keyword arguments for thetested_client_response_handler
.
OWASP LLM01:2025 Continuous Prompt Injection¶
AutoDAN-Turbo
In code name: autodan_turbo
Implements the AutoDAN-Turbo attack methodology which uses a lifelong agent for strategy self-exploration to jailbreak LLMs. This attack automatically discovers jailbreak strategies without human intervention and combines them for more effective attacks.
Original Paper: https://arxiv.org/abs/2410.05295v3, Original Code: https://github.com/SaFoLab-WISC/AutoDAN-Turbo
Best-of-N Jailbreak
In code name: bon
Probes the LLM resilience against Best-Of-N attack.
Original Paper: https://arxiv.org/abs/2412.03556, Original Code: https://github.com/jplhughes/bon-jailbreaking
Crescendo
In code name: crescendo
Challenges the LLM with a dataset of harmful behavior prompts using Crescendo strategy, operates by engaging the LLM in a series of escalating conversational turns, and multistage refinement.
Original Paper: https://arxiv.org/abs/2404.01833
Harmful Behavior Multistage
In code name: harmful_behavior_multistage
Challenges the LLM with a dataset of harmful behavior prompts using multistage refinement.
Shuffle Inconsistency
In code name: shuffle
Attempts to bypass LLM’s defense mechanism by randomly shuffling elements in input harmful prompt.
Original Paper: https://arxiv.org/html/2501.04931
OWASP LLM07:2025 System Prompt Leakage¶
System Prompt Leakage
In code name: system_prompt_leakage
Challenges the LLM with datasets of malicious prompts to extract system instructions using multistage refinement.
OWASP LLM09:2025 Misinformation¶
Logical Inconsistencies Test
In code name: logical_inconsistencies
Attacks the LLM system during a multistage conversation to generate logical inconsistencies in answers using antonyms, negations, and paraphrases based on the system’s domain.
Sycophancy Test
In code name: sycophancy
Attacks the LLM system during a multistage conversation using the argumentum ad verecundiam method based on the system’s domain.
Original Paper: https://arxiv.org/abs/2310.13548