Skip to content

A Survey on Jailbreak Attacks and Defenses against Multimodal Generative Models

Notifications You must be signed in to change notification settings

liuxuannan/Awesome-Multimodal-Jailbreak

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 

Repository files navigation

😈🛡️Awesome-Jailbreak-against-Multimodal-Generative-Models

🔥🔥🔥 Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey

Paper

We've curated a collection of the latest 😋, most comprehensive 😎, and most valuable 🤩 resources on Jailbreak Attack and Defense against Multimodel Generative Models.
But we don't stop there; Our repository is constantly updated to ensure you have the most current information at your fingertips.

survey model

🤗Introduction

This survey presents a comprehensive review of existing jailbreak attack and defense against multimodal generative models.
Given the generalized lifecycle of multimodal jailbreak, we systematically explore attacks and corresponding defense strategies across four levels: input, encoder, generator, and output.

🧑‍💻 Four Levels of Multimodal Jailbreak lifecycle

  • Input Level: Attackers and defenders operate solely on the input data. Attackers modify inputs to execute attacks, while defenders incorporate protective cues to enhance detection.
  • Encoder Level: With access to the encoder, attackers optimize adversarial inputs to inject malicious information into the encoding process, while defenders work to prevent harmful information from being encoded within the latent space.
  • Generator Level: : With full access to the generative models, attackers leverage inference information, such as activations and gradients, and fine-tune models to increase adversarial effectiveness, while defenders use these techniques to strengthen model robustness.
  • Output Level: With the output from the generative model, attackers can iteratively refine adversarial inputs, while defenders can apply post-processing techniques to enhance detection.

Based on this analysis, we present a detailed taxonomy of attack methods, defense mechanisms, and evaluation frameworks specific to multimodal generative models.
We cover a wide range of input-output configurations, including modalities such as Any-to-Text, Any-to-Vision, and Any-to-Any within generative systems.

survey model

🚀Table of Contents

🔥Multimodal Generative Models

Below are tables of model short name and representative generative models used for jailbreak. For input/output modalities, I: Image, T: Text, V: Video, A: Audio.

📑Any-to-Text Models (LLM Backbone)

Short Name Modality Representative Model
I+T→T I + T → T LLaVA, MiniGPT4, InstructBLIP
VT2T V + T → T Video-LLaVA, Video-LLaMA
AT2T A + T → T Audio Flamingo, Audiopalm

📖Any-to-Vision (Diffusion Backbone)

Short Name Modality Representative Model
T→I T → I Stable Diffusion, Midjourney, DALLE
IT→I I + T → I DreamBooth, InstructP2P
T2V T → V Open-Sora, Stable Video Diffusion
IT2V I + T → V VideoPoet, CogVideoX

📰Any-to-Any (Unified Backbone)

Short Name Modality Representative Model
IT→IT I + T → I + T Next-GPT, Chameleon
TIV2TIV T + I + V → T + I + V EMU3
Any2Any Any → Any GPT-4o, Gemini Ultra

😈JailBreak Attack

📖Attack-Intro

We categorize attack methods into black-box, gray-box, and white-box attacks. in a black-box setting where the model is inaccessible to the attacker, the attack is limited to surface-level interactions, focusing solely on the model’s input and/or output. Regarding gray-box and white-box attacks, we consider model-level attacks, including attacks at both the encoder and generator.

  • Input-level attack: attackers are compelled to develop more sophisticated input templates across prompt engineering, image engineering, and role-ploy techniques.
  • Output-level attack: Attackers focus on querying outputs across multiple input variants. Driven by specific adversarial goals, attackers employ estimation-based and search-based attack techniques to iteratively refine these input variants.

jailbreak_attack_black_box

  • Encoder-level attack: Attackers are restricted to accessing only the encoders to provoke harmful responses. In this case, attackers typically seek to maximize cosine similarity within the latent space, ensuring the adversarial input retains similar semantics to the target malicious content while still being classified as safe.
  • Generator-level attack: Attackers have unrestricted access to the generative model’s architecture and checkpoint, enabling attackers to conduct thorough investigations and manipulations, thus enabling sophisticated attacks.

jailbreak_attack_white_and_gray_box

📑Papers

Below are the papers related to jailbreak attacks.

Jailbreak Attack of Any-to-Text Models

Title Venue Date Code Taxonomy Multimodal Model
From Compliance to Exploitation: Jailbreak Prompt Attacks on Multimodal LLMs Arxiv 2025 2025/02/02 None --- A+T→T
"I am bad": Interpreting Stealthy, Universal and Robust Audio Jailbreaks in Audio-Language Models Arxiv 2025 2025/02/02 None --- A+T→T
Tune In, Act Up: Exploring the Impact of Audio Modality-Specific Edits on Large Audio Language Models in Jailbreak Arxiv 2025 2025/01/23 None --- A+T→T
Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency Arxiv 2025 2025/01/09 None --- I+T→T
Divide and Conquer: A Hybrid Strategy Defeats Multimodal Large Language Models Arxiv 2024 2024/12/21 None --- I+T+A→T
Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models Arxiv 2024 2024/12/8 None --- I+T→T
PBI-Attack: Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for Toxicity Maximization Arxiv 2024 2024/12/8 None --- I+T→T
Jailbreak Large Vision-Language Models Through Multi-Modal Linkage Arxiv 2024 2024/11/30 Github --- I+T→T
VLSBench: Unveiling Visual Leakage in Multimodal Safety Arxiv 2024 2024/11/29 Homepage Input Level I+T→T
Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models Arxiv 2024 2024/11/18 None Output Level I+T→T
IDEATOR: Jailbreaking Large Vision-Language Models Using Themselves Arxiv 2024 2024/11/15 None Output Level I+T→T
Zer0-Jack: A memory-efficient gradient-based jailbreaking method for black box Multi-modal Large Language Models Neurips SafeGenAi Workshop 2024 2024/11/12 None Output Level I+T→T
Audio is the achilles’heel: Red teaming audio large multimodal models Arxiv 2024 2024/10/31 None Input Level A+T→T
Advweb: Controllable black-box attacks on vlm-powered web agents Arxiv 2024 2024/10/22 None Input Level I+T→T
Can Large Language Models Automatically Jailbreak GPT-4V? NAACL Workshop 2024 2024/07/23 None Input Level I+T→T
Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts ACM MM 2024 2024/07/21 None Input Level I+T→T
Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything Arxiv 2024 2024/07/01 None Input Level I+T→T
From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking EMNLP 2024 2024/06/21 None Encoder Level I+T→T
Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt Arxiv 2024 2024/06/06 Github Generator Level I+T→T
Efficient LLM-Jailbreaking by Introducing Visual Modality Arxiv 2024 2024/05/30 None Generator Level I+T→T
White-box Multimodal Jailbreaks Against Large Vision-Language Models ACM Multimedia 2024 2024/05/28 None Generator Level I+T→T
Medical MLLM is Vulnerable: Cross-Modality Jailbreak and Mismatched Attacks on Medical Multimodal Large Language Models Arxiv 2024 2024/05/26 Github --- I+T→T
Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Character Arxiv 2024 2024/05/25 None Input Level I+T→T
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models ECCV 2024 2024/05/14 Github Generator Level I+T→T
Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast ICML 2024 2024/02/13 Github Generator Level I+T→T
Jailbreaking Attack against Multimodal Large Language Model Arxiv 2024 2024/02/04 None Generator Level I+T→T
Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models ICLR 2024 Spotlight 2024/01/16 Github Encoder Level I+T→T
MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models ECCV 2024 2023/11/29 Github Input Level I+T→T
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs ECCV 2024 2023/11/27 Github Encoder Level I+T→T
Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts Arxiv 2023 2023/11/15 None Output Level I+T→T
FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts AAAI 2025 2023/11/09 Github Input Level I+T→T
Image Hijacks: Adversarial Images can Control Generative Models at Runtime ICML 2024 2023/09/01 Github Generator Level I+T→T
Are aligned neural networks adversarially aligned? NeurIPS 2023 2023/06/26 None Generator Level I+T→T
Visual Adversarial Examples Jailbreak Aligned Large Language Models AAAI 2024 2023/06/22 Github Generator Level I+T→T
On Evaluating Adversarial Robustness of Large Vision-Language Models NeurIPS 2023 2023/05/26 Homepage Encoder Level I+T→T

Jailbreak Attack of Any-to-Vision Models

Title Venue Date Code Taxonomy Multimodal Model
CogMorph: Cognitive Morphing Attacks for Text-to-Image Models Arxiv 2025 2024/01/21 None --- T→I
FameBias: Embedding Manipulation Bias Attack in Text-to-Image Models Arxiv 2024 2024/12/24 None --- T→I
Antelope: Potent and Concealed Jailbreak Attack Strategy Arxiv 2024 2024/12/11 None --- T→I
In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models Arxiv 2024 2024/11/25 None Output Level T→I
Unfiltered and Unseen: Universal Multimodal Jailbreak Attacks on Text-to-Image Model Defenses Openreview 2024/11/13 None --- T→I
AdvI2I: Adversarial Image Attack on Image-to-Image Diffusion models Arxiv 2024 2024/10/28 Github Encoder Level T→I
Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step Arxiv 2024 2024/10/4 None Output Level T→I
ColJailBreak: Collaborative Generation and Editing for Jailbreaking Text-to-Image Deep Generation NeurIPS 2024 2024/9/25 Github Input Level T→I
HTS-Attack: Heuristic Token Search for Jailbreaking Text-to-Image Models Arxiv 2024 2024/08/25 None Output Level T→I
Perception-guided Jailbreak against Text-to-Image Models AAAI 2025 2024/08/20 None Input Level T→I
DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization Arxiv 2024 2024/08/18 None Output Level T→I
Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models Arxiv 2024 2024/08/02 None Encoder Level T→I
Jailbreaking Text-to-Image Models with LLM-Based Agents Arxiv 2024 2024/08/01 None Output Level T→I
Automatic Jailbreaking of the Text-to-Image Generative AI Systems Arxiv 2024 2024/05/26 None Output Level T→I
UPAM: Unified Prompt Attack in Text-to-Image Generation Models Against Both Textual Filters and Visual Checkers ICML 2024 2024/05/18 None Input Level T→I
BSPA: Exploring Black-box Stealthy Prompt Attacks against Image Generators Arxiv 2024 2024/02/23 None Input Level T→I
Harnessing LLM to Attack LLM-Guarded Text-to-Image Models Arxiv 2023 2023/12/12 Github Input Level T→I
MMA-Diffusion: MultiModal Attack on Diffusion Models CVPR 2024 2023/11/29 Github Encoder Level T→I
VA3: Virtually Assured Amplification Attack on Probabilistic Copyright Protection for Text-to-Image Generative Models CVPR 2024 2023/11/29 Github Generator Level T→I
To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now ECCV 2024 2023/10/18 Github Generator Level T→I
Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models? ICLR 2024 2023/10/16 Github Encoder Level T→I
SurrogatePrompt: Bypassing the Safety Filter of Text-To-Image Models via Substitution CCS 2024 2023/09/25 None Input Level T→I
Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts ICML 2024 2023/09/12 Github Generator Level T→I
SneakyPrompt: Jailbreaking Text-to-image Generative Models Symposium on Security and Privacy 2024 2023/05/20 Github Output Level T→I
Red-Teaming the Stable Diffusion Safety Filter NeurIPSW 2022 2022/10/03 None Input Level T→I

Jailbreak Attack of Any-to-Any Models

Title Venue Date Code Taxonomy Multimodal Model
Gradient-based Jailbreak Images for Multimodal Fusion Models Arxiv 2024 2024/10/4 Github Generator Level I+T→I+T
Voice jailbreak attacks against gpt-4o Arxiv 2024 2024/05/29 Github Output Level Any→Any

🛡️Jailbreak Defense

📖Defense-Intro

Current efforts made in the jailbreak defense of multimodal generative models include two lines of work: Discriminative defense and Transformative defense.

  • Discriminative defenses: is constrained to classification tasks for assigning binary labels.

jailbreak_discriminative_defense

  • Transformative Defense: aims to produce appropriate and safe responses in the presence of malicious or adversarial inputs.

jailbreak_transformative_defense

📑Papers

Below are the papers related to jailbreak defense.

Jailbreak Defense of Any-to-Text Models

Title Venue Date Code Taxonomy Multimodal Model
Towards Robust Multimodal Large Language Models Against Jailbreak Attacks Arxiv 2025 2025/02/02 None --- I+T→T
Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models Arxiv 2025 2025/01/30 None --- I+T→T
Internal Activation Revision: Safeguarding Vision Language Models Without Parameter Update Arxiv 2025 2025/01/24 None --- I+T→T
MSTS: A Multimodal Safety Test Suite for Vision-Language Models Arxiv 2025 2025/01/17 Github --- I+T→T
Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models Arxiv 2025 2025/01/03 Github --- I+T→T
Defending LVLMs Against Vision Attacks through Partial-Perception Supervision Arxiv 2024 2024/12/17 None --- I+T→T
Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment Arxiv 2024 2024/11/27 None Output Level I+T→T
Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks Arxiv 2024 2024/11/23 Github Generator Level I+T→T
Uniguard: Towards universal safety guardrails for jailbreak attacks on multimodal large language models Arxiv 2024 2024/11/03 None Input Level I+T→T
Effective and Efficient Adversarial Detection for Vision-Language Models via A Single Vector Arxiv 2024 2024/10/30 None Generator Level I+T→T
BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks Arxiv 2024 2024/10/28 None Input Level I+T→T
The Great Contradiction Showdown: How Jailbreak and Stealth Wrestle in Vision-Language Models? Arxiv 2024 2024/10/02 None Input Level I+T→T
CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration COLM 2024 2024/9/17 None Output Level I+T→T
Securing Vision-Language Models with a Robust Encoder Against Jailbreak and Adversarial Attacks Arxiv 2024 2024/09/11 None Encoder Level I+T→T
Bathe: Defense against the jailbreak attack in multimodal large language models by treating harmful instruction as backdoor trigger Arxiv 2024 2024/08/17 None Generator Level I+T→T
Defending jailbreak attack in vlms via cross-modality information detector Arxiv 2024 2024/07/31 Github Encoder Level I+T→T
Sim-clip: Unsupervised siamese adversarial fine-tuning for robust and semantically-rich vision-language models Arxiv 2024 2024/07/20 Github Encoder Level I+T→T
Cross-modal safety alignment: Is textual unlearning all you need? Arxiv 2024 2024/05/27 None Generator Level I+T→T
Safety alignment for vision language models Arxiv 2024 2024/05/22 None Generator Level I+T→T
Adashield: Safeguarding multimodal large language models from structure-based attack via adaptive shield prompting ECCV 2024 2024/05/14 Github Input Level I+T→T
Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation ECCV 2024 2024/03/14 Github Output Level I+T→T
Safety fine-tuning at (almost) no cost: A baseline for vision large language models ICML 2024 2024/02/03 Github Generator Level I+T→T
Inferaligner: Inference-time alignment for harmlessness through cross-model guidance EMNLP 2024 2024/01/20 Github Generator Level I+T→T
Mllm-protector: Ensuring mllm’s safety without hurting performance EMNLP 2024 2024/01/05 Github Output Level I+T→T
Jailguard: A universal detection framework for llm prompt-based attacks Arxiv 2023 2023/12/17 Github Output Level I+T→T
Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions ICLR 2024 2023/09/14 Github Generator Level I+T→T

Jailbreak Defense of Any-to-Vision Models

Title Venue Date Code Taxonomy Multimodal Model
Distorting Embedding Space for Safety: A Defense Mechanism for Adversarially Robust Diffusion Models Arxiv 2025 2025/01/30 Github --- T→I
CE-SDWV: Effective and Efficient Concept Erasure for Text-to-Image Diffusion Models via a Semantic-Driven Word Vocabulary Arxiv 2025 2025/01/26 None --- T→I
CROPS: Model-Agnostic Training-Free Framework for Safe Image Synthesis with Latent Diffusion Models Arxiv 2025 2025/01/09 None --- T→I
PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models Arxiv 2025 2025/01/07 Homepage --- T→I
DuMo: Dual Encoder Modulation Network for Precise Concept Erasure AAAI 2025 2025/01/02 Github --- T→I
AEIOU: A Unified Defense Framework against NSFW Prompts in Text-to-Image Models Arxiv 2024 2024/12/24 None --- T→I
SafeCFG: Redirecting Harmful Classifier-Free Guidance for Safe Generation Arxiv 2024 2024/12/20 None --- T→I
SafetyDPO: Scalable Safety Alignment for Text-to-Image Generation Arxiv 2024 2024/12/13 Github --- T→I
TraSCE: Trajectory Steering for Concept Erasure Arxiv 2024 2024/12/10 Github --- T→I
Buster: Incorporating Backdoor Attacks into Text Encoder to Mitigate NSFW Content Generation Arxiv 2024 2024/12/10 None --- T→I
Safeguarding Text-to-Image Generation via Inference-Time Prompt-Noise Optimization Arxiv 2024 2024/12/05 None --- T→I
Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models Arxiv 2024 2024/11/30 None --- T→I
Safety Without Semantic Disruptions: Editing-free Safe Image Generation via Context-preserving Dual Latent Reconstruction Arxiv 2024 2024/11/21 None --- T→I
Safe Text-to-Image Generation:Simply Sanitize the Prompt Embedding Arxiv 2024 2024/11/15 None Encoder Level T→I
Safree: Training-free and adaptive guard for safe text-to-image and video generation ICLR 2025 2024/10/16 Github Generator Level T→I/T→V
Shielddiff: Suppressing sexual content generation from diffusion models through reinforcement learning Arxiv 2024 2024/10/04 None Generator Level T→I
Dark miner: Defend against unsafe generation for text-to-image diffusion models Arxiv 2024 2024/09/26 None Generator Level T→I
Score forgetting distillation: A swift, data-free method for machine unlearning in diffusion models Arxiv 2024 2024/09/17 None Generator Level T→I
EIUP: A Training-Free Approach to Erase Non-Compliant Concepts Conditioned on Implicit Unsafe Prompts Arxiv 2024 2024/08/02 None Generator Level T→I
Direct Unlearning Optimization for Robust and Safe Text-to-Image Models NeurIPS 2024 2024/07/17 Github Generator Level T→I
Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models ECCV 2024 2024/07/17 Github Generator Level T→I
Conceptprune: Concept editing in diffusion models via skilled neuron pruning Arxiv 2024 2024/05/29 Github Generator Level T→I
Pruning for Robust Concept Erasing in Diffusion Models Arxiv 2024 2024/05/26 None Generator Level T→I
Defensive unlearning with adversarial training for robust concept erasure in diffusion models NeurIPS 2024 2024/05/24 Github Encoder Level T→I
Unlearning concepts in diffusion model via concept domain correction and concept preserving gradient AAAI 2025 2024/05/24 Github Generator Level T→I
Espresso: Robust Concept Filtering in Text-to-Image Models Arxiv 2024 2024/04/30 None Output Level T→I
Latent Guard: a Safety Framework for Text-to-image Generation ECCV 2024 2024/04/11 Github Encoder Level T→I
SafeGen: Mitigating Sexually Explicit Content Generation in Text-to-Image Models ACM CCS 2024 2024/04/10 Github Generator Level T→I
Salun: Empowering machine unlearning via gradient-based weight saliency in both image classification and generation ICLR 2024 2024/04/04 Github Generator Level T→I
GuardT→I: Defending Text-to-Image Models from Adversarial Prompts NeurIPS 2024 2024/03/03 None Encoder Level T→I
Universal prompt optimizer for safe text-to-image generation NAACL 2024 2024/02/16 Github Input Level T→I
Erasediff: Erasing data influence in diffusion models Arxiv 2024 2024/01/11 None Generator Level T→I
Localization and manipulation of immoral visual cues for safe text-to-image generation WACV 2024 2024/01/01 None Output Level T→I
Receler: Reliable concept erasing of text-to-image diffusion models via lightweight erasers ECCV 2024 2023/11/29 Github Generator Level T→I
Self-discovering interpretable diffusion latent directions for responsible text-to-image generation CVPR 2024 2023/11/28 Github Encoder Level T→I
Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models ECCV 2024 2023/11/27 Github Encoder Level T→I
Mace: Mass concept erasure in diffusion models CVPR 2024 2023/10/19 Github Generator Level T→I
Implicit concept removal of diffusion models ECCV 2024 2023/10/09 None Input Level T→I
Unified concept editing in diffusion models WACV 2024 2023/08/25 Github Generator Level T→I
Towards safe self-distillation of internet-scale text-to-image diffusion models ICML 2023 Workshop on Challenges in Deployable Generative AI 2023/07/12 Github Generator Level T→I
Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models CVPR 2024 2023/05/30 Github Generator Level T→I
Erasing concepts from diffusion models ICCV 2023 2023/05/13 Github Generator Level T→I
Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models CVPR 2023 2022/11/09 Github Generator Level T→I

Jailbreak Defense of Any-to-Any Models

Title Venue Date Code Taxonomy Multimodal Model

💯Evaluation

⭐️Evaluation Datasets

Below is a comparison table of publicly available representative evaluation datasets and a description of each attribute in the table.

  • Collected: raw data created by humans or collected from real-world websites.
  • Reconstructed: Data reorganized from other existing datasets.
  • Synthesized: AI-generated data using LLM or diffusion models.
  • Adversarial: Adversarial data generated by jailbreak attack methods.

Used to Any-to-Text Models

Dataset Text Source Image Source Volume Theme Access
Figstep Synthesized Adversarial 500 10 Github
AdvBench Synthesized --- 500 --- Github
ReadTeam-2K Collected & Reconstructed & Synthesized N/A 2000 16 Huggingface
HarmBench Collected --- 510 4 Github
HADES Synthesized Collected & Synthesized & Adversarial 750 5 Github
MM-SafetyBench Synthesized Synthesized & Adversarial 5040 13 Github
JailBreakV-28K Adversarial Reconstructed & Synthesized 28000 16 Huggingface

Used to Any-to-Vision Models

Dataset Text Source Image Source Volume Access Theme
NSFW-200 Synthesized --- 200 --- Github
MMA Reconstructed & Adversarial Adversarial 1000 --- Huggingface
VBCDE Reconstructed & Adversarial --- 100 5 Github
I2P Collected Collected 4703 7 Huggingface
Unsafe Diffusion Collected & Reconstructed --- 1434 --- Github
MACE-Celebrity Collected --- 1000 --- Github
MACE-Art Reconstructed --- 1000 --- Github
MPUP Synthesized --- 1200 4 Huggingface
T2VSafetyBench Reconstructed & Synthesized & Adversarial --- 4400 12 Github

📚Evaluation Methods

Current evaluation methods are primarily classified into two categories: manual evaluation and automated evaluation.

  • Manual evaluation involves human assessment to determine if the content is toxic, offering a direct and interpretable method of evaluation.
  • Automated approaches assess the safety of multimodal generative models by employing a range of techniques, including detector-based, GPT-based, and rule-based methods.

jailbreak_evaluation

Text Detector

Toxicity detector Access
LLama-Guard Huggingface
LLama-Guard2 Huggingface
Detoxify Github
GPTFUZZER Huggingface
Perspective API Website

Image Detector

Toxicity detector Access
NudeNet Github
Q16 Github
Safety Checker Huggingface
Imgcensor Github
Multi-headed Safety Classifier Github

😉Citation

If you find this work useful in your research, Please kindly cite using the following BibTex:

@article{liu2024jailbreak,
    title={Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey},
    author={Liu, Xuannan and Cui, Xing and Li, Peipei and Li, Zekun and Huang, Huaibo and Xia, Shuhan and Zhang, Miaoxuan and Zou, Yueying and He, Ran},
    journal={arXiv preprint arXiv:2411.09259},
    year={2024},
}

About

A Survey on Jailbreak Attacks and Defenses against Multimodal Generative Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •