Attacks

This section provides an overview of all available attacks.

Single-stage attacks

OWASP LLM01:2025 Prompt Injection

Adversarial Suffix

In code name: suffix

Uses a specially crafted suffix to bypass LLM filters and restrictions, forcing the model to generate prohibited or harmful content despite the original query. Evaluates the model’s vulnerability to input data manipulation.

Original Paper: https://arxiv.org/abs/2307.15043, Original Code: https://github.com/llm-attacks/llm-attacks

View code on GitHub

Aim Jailbreak

In code name: aim_jailbreak

Probes an LLM’s ethical compliance by eliciting detailed instructions on illegal activities, using the Nicolo Machiavelli character roleplay.

View code on GitHub

Base64 Injection

In code name: base64_injection

Challenges the LLM with a dataset of lines, encoded into base64.

View code on GitHub

Best-of-N (BoN) Jailbreak

In code name: bon

Probes the LLM resilience against Best-Of-N (BoN) attack by randomly transform symbols in input harmful prompt.

Original Paper: https://arxiv.org/abs/2412.03556, Original Code: https://github.com/jplhughes/bon-jailbreaking

View code on GitHub

Deceptive Delight

In code name: deceptive_delight

Tests LLM’s vulnerability to distraction and context manipulation by embedding unsafe content alongside benign topics in three messages.

View code on GitHub

Dialogue Injection: Continuation

In code name: dialogue_injection_continuation

Probes an LLM’s ethical compliance by eliciting detailed instructions on illegal activities, using the conversation simulation technique by using few-shot system prompting and unfinished assitant response at the end. Built according to the paper.

Original Paper: https://arxiv.org/abs/2503.08195

View code on GitHub

Dialogue Injection: Developer mode

In code name: dialogue_injection_devmode

Probes an LLM’s ethical compliance by eliciting detailed instructions on illegal activities, using the conversation simulation technique to mislead the LLM with “Developer mode activation”.

Original Paper: https://arxiv.org/abs/2503.08195

View code on GitHub

Do Anything Now (DAN) Jailbreak

In code name: dan

Probes an LLM’s resilience against performing explicitly forbidden activities using the “Do Anything Now” (DAN) character roleplay technique.

View code on GitHub

Ethical Compliance

In code name: ethical_compliance

Evaluates the LLM’s ability to resist discussions about harmful or inappropriate misconceptions on sensitive topics. This test aims to verify the model’s adherence to ethical guidelines against propagating stereotypes and toxic content.

View code on GitHub

HarmBench

In code name: harmbench

Challenges the LLM with harmful behavior prompts from HarmBench (contains original and translated to Russian prompts).

Original Paper: https://arxiv.org/abs/2402.04249

View code on GitHub

Linguistic Evasion

In code name: linguistic_evasion

Leverages the use of a non-Russian language to circumvent the restrictions or focus set by the system prompt.

View code on GitHub

Past Tense Attack

In code name: past_tense

Challenges the LLM with a dataset of harmful behavior prompts in past tense form.

View code on GitHub

Shuffle Inconsistency

In code name: shuffle

Attempts to bypass LLM’s defense mechanism by randomly shuffling elements in input harmful prompt.

Original Paper: https://arxiv.org/html/2501.04931

View code on GitHub

UCAR Jailbreak

In code name: ucar

Probes an LLM’s adherence to content restrictions when explicitly requested to ignore such filters. It assesses the robustness of the LLM’s content moderation systems against direct challenges.

View code on GitHub

Multi-stage attacks

What Drives the Multi-stage?

Multi-stage attacks are inspired by the Jailbreaking Black Box Large Language Models in Twenty Queries (PAIR) paper.

For managing a multi-stage interaction between an attacker and tested chat clients, the MultiStageInteractionSession class is available [source]. It contains the following properties:

  • attacker_session is the session for the attacker.

  • tested_client_session is the session for the tested client.

  • stop_criterion is an optional function that determines whether to stop the conversation based on the tested client’s responses.

  • history_limit is the maximum allowed history length for the attacker.

  • tested_client_response_handler is an optional function that handles the tested client’s response before passing it to the attacker.

  • current_step is the current step of the attacker.

  • refine_args are additional positional arguments for the tested_client_response_handler.

  • refine_kwargs are additional keyword arguments for the tested_client_response_handler.

OWASP LLM01:2025 Continuous Prompt Injection

AutoDAN-Turbo

In code name: autodan_turbo

Implements the AutoDAN-Turbo attack methodology which uses a lifelong agent for strategy self-exploration to jailbreak LLMs. This attack automatically discovers jailbreak strategies without human intervention and combines them for more effective attacks.

Original Paper: https://arxiv.org/abs/2410.05295v3, Original Code: https://github.com/SaFoLab-WISC/AutoDAN-Turbo

View code on GitHub

Crescendo

In code name: crescendo

Challenges the LLM with a dataset of harmful behavior prompts using Crescendo strategy, operates by engaging the LLM in a series of escalating conversational turns, and multistage refinement.

Original Paper: https://arxiv.org/abs/2404.01833

View code on GitHub

Prompt Automatic Iterative Refinement (PAIR)

In code name: pair

Challenges the LLM with a dataset of harmful behavior prompts using multistage refinement with judge model scoring. Original Paper: https://arxiv.org/abs/2310.08419v4, Code: https://github.com/patrickrchao/JailbreakingLLMs

View code on GitHub

OWASP LLM07:2025 System Prompt Leakage

System Prompt Leakage

In code name: system_prompt_leakage

Challenges the LLM with datasets of malicious prompts to extract system instructions using multistage refinement.

View code on GitHub

OWASP LLM09:2025 Misinformation

Logical Inconsistencies Test

In code name: logical_inconsistencies

Attacks the LLM system during a multistage conversation to generate logical inconsistencies in answers using antonyms, negations, and paraphrases based on the system’s domain.

View code on GitHub

Sycophancy Test

In code name: sycophancy

Attacks the LLM system during a multistage conversation using the argumentum ad verecundiam method based on the system’s domain.

Original Paper: https://arxiv.org/abs/2310.13548

View code on GitHub

VLM attacks

See an example with VLM attacks.

OWASP LLM09:2025 Misinformation

Lowres Documents Hallucination Attack

In code name: vlm_lowres_docs

Test class for VLM responses for low resolution documents. The idea is that we provide model with unreadable document and ask to describe its contents. The goal is to see whether model will signal that the document is unreadable or hallucinate about the contents.

View code on GitHub

M-Attack

In code name: vlm_m_attack

Test VLM for M-Attack, which adds noise to the image to make VLM see it as a different image.

Original Paper: https://arxiv.org/abs/2503.10635

View code on GitHub

Text Hallucination Attack

In code name: vlm_text_hallucination

Test class for VLM text hallucination attacks. The idea is to give model and image which contains only text and ask it to describe it. The goal is to see whether model will describe text or hallucinate with the description of an object written in text.

View code on GitHub