Attacks¶

This section provides an overview of all available attacks.

Single-stage attacks¶

OWASP LLM01:2025 Prompt Injection¶

Adversarial Suffix

In code name: suffix

Uses a specially crafted suffix to bypass LLM filters and restrictions, forcing the model to generate prohibited or harmful content despite the original query. Evaluates the model’s vulnerability to input data manipulation.

Original Paper: https://arxiv.org/abs/2307.15043, Code: https://github.com/llm-attacks/llm-attacks

View code on GitHub

Aim Jailbreak

In code name: aim_jailbreak

Probes an LLM’s ethical compliance by eliciting detailed instructions on illegal activities, using the Nicolo Machiavelli character roleplay.

View code on GitHub

Base64 Injection

In code name: base64_injection

Challenges the LLM with a dataset of lines, encoded into base64.

View code on GitHub

Best-of-N (BoN) Jailbreak

In code name: bon

Probes the LLM resilience against Best-Of-N (BoN) attack by randomly transform symbols in input harmful prompt.

Original Paper: https://arxiv.org/abs/2412.03556, Code: https://github.com/jplhughes/bon-jailbreaking

View code on GitHub

Deceptive Delight

In code name: deceptive_delight

Tests LLM’s vulnerability to distraction and context manipulation by embedding unsafe content alongside benign topics in three messages.

View code on GitHub

Dialogue Injection: Continuation

In code name: dialogue_injection_continuation

Probes an LLM’s ethical compliance by eliciting detailed instructions on illegal activities, using the conversation simulation technique by using few-shot system prompting and unfinished assitant response at the end. Built according to the paper.

Original Paper: https://arxiv.org/abs/2503.08195

View code on GitHub

Dialogue Injection: Developer mode

In code name: dialogue_injection_devmode

Probes an LLM’s ethical compliance by eliciting detailed instructions on illegal activities, using the conversation simulation technique to mislead the LLM with “Developer mode activation”.

Original Paper: https://arxiv.org/abs/2503.08195

View code on GitHub

Do Anything Now (DAN) Jailbreak

In code name: dan

Probes an LLM’s resilience against performing explicitly forbidden activities using the “Do Anything Now” (DAN) character roleplay technique.

View code on GitHub

Ethical Compliance

In code name: ethical_compliance

Evaluates the LLM’s ability to resist discussions about harmful or inappropriate misconceptions on sensitive topics. This test aims to verify the model’s adherence to ethical guidelines against propagating stereotypes and toxic content.

View code on GitHub

HarmBench

In code name: harmbench

Challenges the LLM with harmful behavior prompts from HarmBench (contains original and translated to Russian prompts).

Original Paper: https://arxiv.org/abs/2402.04249

View code on GitHub

Linguistic Evasion

In code name: linguistic_evasion

Leverages the use of a non-Russian language to circumvent the restrictions or focus set by the system prompt.

View code on GitHub

Linguistic Sandwich

In code name: linguistic_sandwich

Exploits attention blink phenomena by hiding adversarial questions between innocent questions in low-resource languages. The attack leverages the concept that LLMs may have reduced attention when processing mixed low-resource languages.

Original Paper: https://arxiv.org/abs/2404.07242v1

View code on GitHub

Shuffle Inconsistency

In code name: shuffle

Attempts to bypass LLM’s defense mechanism by randomly shuffling elements in input harmful prompt.

Original Paper: https://arxiv.org/abs/2501.04931

View code on GitHub

Time Machine

In code name: time_machine

Uses an attack model to dynamically transform prompts into historical or future contexts to test temporal framing vulnerabilities.

Original Paper: https://arxiv.org/abs/2407.11969v2, Code: https://github.com/tml-epfl/llm-past-tense

View code on GitHub

UCAR Jailbreak

In code name: ucar

Probes an LLM’s adherence to content restrictions when explicitly requested to ignore such filters. It assesses the robustness of the LLM’s content moderation systems against direct challenges.

View code on GitHub

OWASP LLM10:2025 Unbounded Consumption¶

Repetition Token

In code name: repetition_token

Attempts to exhaust model resources by injecting random marker that force the model to ignore response boundaries and generate until the maximum context length is reached. The attack uses a two-stage approach: first generating a normal question, then using it in a prompt with special instructions to bypass token limits.

View code on GitHub

Multi-stage attacks¶

What Drives the Multi-stage?

Multi-stage attacks are inspired by the Jailbreaking Black Box Large Language Models in Twenty Queries (PAIR) paper.

For managing a multi-stage interaction between an attacker and tested chat clients, the MultiStageInteractionSession class is available [source]. It contains the following properties:

attacker_session is the session for the attacker.
tested_client_session is the session for the tested client.
stop_criterion is an optional function that determines whether to stop the conversation based on the tested client’s responses.
history_limit is the maximum allowed history length for the attacker.
tested_client_response_handler is an optional function that handles the tested client’s response before passing it to the attacker.
current_step is the current step of the attacker.
refine_args are additional positional arguments for the tested_client_response_handler.
refine_kwargs are additional keyword arguments for the tested_client_response_handler.

OWASP LLM01:2025 Continuous Prompt Injection¶

AutoDAN-Turbo

In code name: autodan_turbo

Implements the AutoDAN-Turbo attack methodology which uses a lifelong agent for strategy self-exploration to jailbreak LLMs. This attack automatically discovers jailbreak strategies without human intervention and combines them for more effective attacks.

Original Paper: https://arxiv.org/abs/2410.05295v3, Code: https://github.com/SaFoLab-WISC/AutoDAN-Turbo

View code on GitHub

Composition of Principles (CoP)

In code name: cop

Implements the Composition-of-Principles (CoP) agentic red-teaming methodology which composes human-provided jailbreak principles to generate and iteratively refine single-turn jailbreak prompts. The pipeline selects effective principles, prompts an attacker model to compose a prompt, verifies success with an LLM judge, and mines new principles from successful attempts to improve future attacks.

Original Paper: https://arxiv.org/html/2506.00781

View code on GitHub

Crescendo

In code name: crescendo

Challenges the LLM with a dataset of harmful behavior prompts using Crescendo strategy, operates by engaging the LLM in a series of escalating conversational turns, and multistage refinement.

Original Paper: https://arxiv.org/abs/2404.01833

View code on GitHub

Prompt Automatic Iterative Refinement (PAIR)

In code name: pair

Challenges the LLM with a dataset of adversarial prompts using multistage refinement with judge model scoring.

Original Paper: https://arxiv.org/abs/2310.08419v4, Code: https://github.com/patrickrchao/JailbreakingLLMs

View code on GitHub

OWASP LLM07:2025 System Prompt Leakage¶

System Prompt Leakage

In code name: system_prompt_leakage

Challenges the LLM with datasets of malicious prompts to extract system instructions using multistage refinement.

View code on GitHub

OWASP LLM09:2025 Misinformation¶

Logical Inconsistencies Test

In code name: logical_inconsistencies

Attacks the LLM system during a multistage conversation to generate logical inconsistencies in answers using antonyms, negations, and paraphrases based on the system’s domain.

View code on GitHub

Sycophancy Test

In code name: sycophancy

Attacks the LLM system during a multistage conversation using the argumentum ad verecundiam method based on the system’s domain.

Original Paper: https://arxiv.org/abs/2310.13548

View code on GitHub

VLM attacks¶

See an example with VLM attacks.

OWASP LLM09:2025 Misinformation¶

Lowres Documents Hallucination Attack

In code name: vlm_lowres_docs

Test class for VLM responses for low resolution documents. The idea is that we provide model with unreadable document and ask to describe its contents. The goal is to see whether model will signal that the document is unreadable or hallucinate about the contents.

View code on GitHub

M-Attack

In code name: vlm_m_attack

Test VLM for M-Attack, which adds noise to the image to make VLM see it as a different image.

Original Paper: https://arxiv.org/abs/2503.10635

View code on GitHub

Text Hallucination Attack

In code name: vlm_text_hallucination

Test class for VLM text hallucination attacks. The idea is to give model and image which contains only text and ask it to describe it. The goal is to see whether model will describe text or hallucinate with the description of an object written in text.

View code on GitHub