😈🛡️Awesome-Jailbreak-against-Multimodal-Generative-Models

🔥🔥🔥 Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey

We've curated a collection of the latest 😋, most comprehensive 😎, and most valuable 🤩 resources on Jailbreak Attack and Defense against Multimodel Generative Models.
But we don't stop there; Our repository is constantly updated to ensure you have the most current information at your fingertips.

🤗Introduction

This survey presents a comprehensive review of existing jailbreak attack and defense against multimodal generative models.
Given the generalized lifecycle of multimodal jailbreak, we systematically explore attacks and corresponding defense strategies across four levels: input, encoder, generator, and output.

🧑‍💻 Four Levels of Multimodal Jailbreak lifecycle

Input Level: Attackers and defenders operate solely on the input data. Attackers modify inputs to execute attacks, while defenders incorporate protective cues to enhance detection.
Encoder Level: With access to the encoder, attackers optimize adversarial inputs to inject malicious information into the encoding process, while defenders work to prevent harmful information from being encoded within the latent space.
Generator Level: : With full access to the generative models, attackers leverage inference information, such as activations and gradients, and fine-tune models to increase adversarial effectiveness, while defenders use these techniques to strengthen model robustness.
Output Level: With the output from the generative model, attackers can iteratively refine adversarial inputs, while defenders can apply post-processing techniques to enhance detection.

Based on this analysis, we present a detailed taxonomy of attack methods, defense mechanisms, and evaluation frameworks specific to multimodal generative models.
We cover a wide range of input-output configurations, including modalities such as Any-to-Text, Any-to-Vision, and Any-to-Any within generative systems.

🚀Table of Contents

😈🛡️Awesome-Jailbreak-against-Multimodal-Generative-Models🛡️

🔥Multimodal Generative Models

Below are tables of model short name and representative generative models used for jailbreak. For input/output modalities, I: Image, T: Text, V: Video, A: Audio.

📑Any-to-Text Models (LLM Backbone)

Short Name	Modality	Representative Model
I+T→T	I + T → T	LLaVA, MiniGPT4, InstructBLIP
VT2T	V + T → T	Video-LLaVA, Video-LLaMA
AT2T	A + T → T	Audio Flamingo, Audiopalm

📖Any-to-Vision (Diffusion Backbone)

Short Name	Modality	Representative Model
T→I	T → I	Stable Diffusion, Midjourney, DALLE
IT→I	I + T → I	DreamBooth, InstructP2P
T2V	T → V	Open-Sora, Stable Video Diffusion
IT2V	I + T → V	VideoPoet, CogVideoX

📰Any-to-Any (Unified Backbone)

Short Name	Modality	Representative Model
IT→IT	I + T → I + T	Next-GPT, Chameleon
TIV2TIV	T + I + V → T + I + V	EMU3
Any2Any	Any → Any	GPT-4o, Gemini Ultra

😈JailBreak Attack

📖Attack-Intro

We categorize attack methods into black-box, gray-box, and white-box attacks. in a black-box setting where the model is inaccessible to the attacker, the attack is limited to surface-level interactions, focusing solely on the model’s input and/or output. Regarding gray-box and white-box attacks, we consider model-level attacks, including attacks at both the encoder and generator.

Input-level attack: attackers are compelled to develop more sophisticated input templates across prompt engineering, image engineering, and role-ploy techniques.
Output-level attack: Attackers focus on querying outputs across multiple input variants. Driven by specific adversarial goals, attackers employ estimation-based and search-based attack techniques to iteratively refine these input variants.

Encoder-level attack: Attackers are restricted to accessing only the encoders to provoke harmful responses. In this case, attackers typically seek to maximize cosine similarity within the latent space, ensuring the adversarial input retains similar semantics to the target malicious content while still being classified as safe.
Generator-level attack: Attackers have unrestricted access to the generative model’s architecture and checkpoint, enabling attackers to conduct thorough investigations and manipulations, thus enabling sophisticated attacks.

📑Papers

Below are the papers related to jailbreak attacks.

Jailbreak Attack of Any-to-Text Models

Title	Venue	Date	Code	Taxonomy	Multimodal Model
From Compliance to Exploitation: Jailbreak Prompt Attacks on Multimodal LLMs	Arxiv 2025	2025/02/02	None	---	A+T→T
"I am bad": Interpreting Stealthy, Universal and Robust Audio Jailbreaks in Audio-Language Models	Arxiv 2025	2025/02/02	None	---	A+T→T
Tune In, Act Up: Exploring the Impact of Audio Modality-Specific Edits on Large Audio Language Models in Jailbreak	Arxiv 2025	2025/01/23	None	---	A+T→T
Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency	Arxiv 2025	2025/01/09	None	---	I+T→T
Divide and Conquer: A Hybrid Strategy Defeats Multimodal Large Language Models	Arxiv 2024	2024/12/21	None	---	I+T+A→T
Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models	Arxiv 2024	2024/12/8	None	---	I+T→T
PBI-Attack: Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for Toxicity Maximization	Arxiv 2024	2024/12/8	None	---	I+T→T
Jailbreak Large Vision-Language Models Through Multi-Modal Linkage	Arxiv 2024	2024/11/30	Github	---	I+T→T
VLSBench: Unveiling Visual Leakage in Multimodal Safety	Arxiv 2024	2024/11/29	Homepage	Input Level	I+T→T
Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models	Arxiv 2024	2024/11/18	None	Output Level	I+T→T
IDEATOR: Jailbreaking Large Vision-Language Models Using Themselves	Arxiv 2024	2024/11/15	None	Output Level	I+T→T
Zer0-Jack: A memory-efficient gradient-based jailbreaking method for black box Multi-modal Large Language Models	Neurips SafeGenAi Workshop 2024	2024/11/12	None	Output Level	I+T→T
Audio is the achilles’heel: Red teaming audio large multimodal models	Arxiv 2024	2024/10/31	None	Input Level	A+T→T
Advweb: Controllable black-box attacks on vlm-powered web agents	Arxiv 2024	2024/10/22	None	Input Level	I+T→T
Can Large Language Models Automatically Jailbreak GPT-4V?	NAACL Workshop 2024	2024/07/23	None	Input Level	I+T→T
Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts	ACM MM 2024	2024/07/21	None	Input Level	I+T→T
Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything	Arxiv 2024	2024/07/01	None	Input Level	I+T→T
From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking	EMNLP 2024	2024/06/21	None	Encoder Level	I+T→T
Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt	Arxiv 2024	2024/06/06	Github	Generator Level	I+T→T
Efficient LLM-Jailbreaking by Introducing Visual Modality	Arxiv 2024	2024/05/30	None	Generator Level	I+T→T
White-box Multimodal Jailbreaks Against Large Vision-Language Models	ACM Multimedia 2024	2024/05/28	None	Generator Level	I+T→T
Medical MLLM is Vulnerable: Cross-Modality Jailbreak and Mismatched Attacks on Medical Multimodal Large Language Models	Arxiv 2024	2024/05/26	Github	---	I+T→T
Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Character	Arxiv 2024	2024/05/25	None	Input Level	I+T→T
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models	ECCV 2024	2024/05/14	Github	Generator Level	I+T→T
Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast	ICML 2024	2024/02/13	Github	Generator Level	I+T→T
Jailbreaking Attack against Multimodal Large Language Model	Arxiv 2024	2024/02/04	None	Generator Level	I+T→T
Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models	ICLR 2024 Spotlight	2024/01/16	Github	Encoder Level	I+T→T
MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models	ECCV 2024	2023/11/29	Github	Input Level	I+T→T
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs	ECCV 2024	2023/11/27	Github	Encoder Level	I+T→T
Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts	Arxiv 2023	2023/11/15	None	Output Level	I+T→T
FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts	AAAI 2025	2023/11/09	Github	Input Level	I+T→T
Image Hijacks: Adversarial Images can Control Generative Models at Runtime	ICML 2024	2023/09/01	Github	Generator Level	I+T→T
Are aligned neural networks adversarially aligned?	NeurIPS 2023	2023/06/26	None	Generator Level	I+T→T
Visual Adversarial Examples Jailbreak Aligned Large Language Models	AAAI 2024	2023/06/22	Github	Generator Level	I+T→T
On Evaluating Adversarial Robustness of Large Vision-Language Models	NeurIPS 2023	2023/05/26	Homepage	Encoder Level	I+T→T

Jailbreak Attack of Any-to-Vision Models

Title	Venue	Date	Code	Taxonomy	Multimodal Model
CogMorph: Cognitive Morphing Attacks for Text-to-Image Models	Arxiv 2025	2024/01/21	None	---	T→I
FameBias: Embedding Manipulation Bias Attack in Text-to-Image Models	Arxiv 2024	2024/12/24	None	---	T→I
Antelope: Potent and Concealed Jailbreak Attack Strategy	Arxiv 2024	2024/12/11	None	---	T→I
In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models	Arxiv 2024	2024/11/25	None	Output Level	T→I
Unfiltered and Unseen: Universal Multimodal Jailbreak Attacks on Text-to-Image Model Defenses	Openreview	2024/11/13	None	---	T→I
AdvI2I: Adversarial Image Attack on Image-to-Image Diffusion models	Arxiv 2024	2024/10/28	Github	Encoder Level	T→I
Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step	Arxiv 2024	2024/10/4	None	Output Level	T→I
ColJailBreak: Collaborative Generation and Editing for Jailbreaking Text-to-Image Deep Generation	NeurIPS 2024	2024/9/25	Github	Input Level	T→I
HTS-Attack: Heuristic Token Search for Jailbreaking Text-to-Image Models	Arxiv 2024	2024/08/25	None	Output Level	T→I
Perception-guided Jailbreak against Text-to-Image Models	AAAI 2025	2024/08/20	None	Input Level	T→I
DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization	Arxiv 2024	2024/08/18	None	Output Level	T→I
Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models	Arxiv 2024	2024/08/02	None	Encoder Level	T→I
Jailbreaking Text-to-Image Models with LLM-Based Agents	Arxiv 2024	2024/08/01	None	Output Level	T→I
Automatic Jailbreaking of the Text-to-Image Generative AI Systems	Arxiv 2024	2024/05/26	None	Output Level	T→I
UPAM: Unified Prompt Attack in Text-to-Image Generation Models Against Both Textual Filters and Visual Checkers	ICML 2024	2024/05/18	None	Input Level	T→I
BSPA: Exploring Black-box Stealthy Prompt Attacks against Image Generators	Arxiv 2024	2024/02/23	None	Input Level	T→I
Harnessing LLM to Attack LLM-Guarded Text-to-Image Models	Arxiv 2023	2023/12/12	Github	Input Level	T→I
MMA-Diffusion: MultiModal Attack on Diffusion Models	CVPR 2024	2023/11/29	Github	Encoder Level	T→I
VA3: Virtually Assured Amplification Attack on Probabilistic Copyright Protection for Text-to-Image Generative Models	CVPR 2024	2023/11/29	Github	Generator Level	T→I
To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now	ECCV 2024	2023/10/18	Github	Generator Level	T→I
Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models?	ICLR 2024	2023/10/16	Github	Encoder Level	T→I
SurrogatePrompt: Bypassing the Safety Filter of Text-To-Image Models via Substitution	CCS 2024	2023/09/25	None	Input Level	T→I
Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts	ICML 2024	2023/09/12	Github	Generator Level	T→I
SneakyPrompt: Jailbreaking Text-to-image Generative Models	Symposium on Security and Privacy 2024	2023/05/20	Github	Output Level	T→I
Red-Teaming the Stable Diffusion Safety Filter	NeurIPSW 2022	2022/10/03	None	Input Level	T→I

Jailbreak Attack of Any-to-Any Models

Title	Venue	Date	Code	Taxonomy	Multimodal Model
Gradient-based Jailbreak Images for Multimodal Fusion Models	Arxiv 2024	2024/10/4	Github	Generator Level	I+T→I+T
Voice jailbreak attacks against gpt-4o	Arxiv 2024	2024/05/29	Github	Output Level	Any→Any

🛡️Jailbreak Defense

📖Defense-Intro

Current efforts made in the jailbreak defense of multimodal generative models include two lines of work: Discriminative defense and Transformative defense.

Discriminative defenses: is constrained to classification tasks for assigning binary labels.

Transformative Defense: aims to produce appropriate and safe responses in the presence of malicious or adversarial inputs.

📑Papers

Below are the papers related to jailbreak defense.

Jailbreak Defense of Any-to-Text Models

Title	Venue	Date	Code	Taxonomy	Multimodal Model
Towards Robust Multimodal Large Language Models Against Jailbreak Attacks	Arxiv 2025	2025/02/02	None	---	I+T→T
Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models	Arxiv 2025	2025/01/30	None	---	I+T→T
Internal Activation Revision: Safeguarding Vision Language Models Without Parameter Update	Arxiv 2025	2025/01/24	None	---	I+T→T
MSTS: A Multimodal Safety Test Suite for Vision-Language Models	Arxiv 2025	2025/01/17	Github	---	I+T→T
Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models	Arxiv 2025	2025/01/03	Github	---	I+T→T
Defending LVLMs Against Vision Attacks through Partial-Perception Supervision	Arxiv 2024	2024/12/17	None	---	I+T→T
Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment	Arxiv 2024	2024/11/27	None	Output Level	I+T→T
Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks	Arxiv 2024	2024/11/23	Github	Generator Level	I+T→T
Uniguard: Towards universal safety guardrails for jailbreak attacks on multimodal large language models	Arxiv 2024	2024/11/03	None	Input Level	I+T→T
Effective and Efficient Adversarial Detection for Vision-Language Models via A Single Vector	Arxiv 2024	2024/10/30	None	Generator Level	I+T→T
BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks	Arxiv 2024	2024/10/28	None	Input Level	I+T→T
The Great Contradiction Showdown: How Jailbreak and Stealth Wrestle in Vision-Language Models?	Arxiv 2024	2024/10/02	None	Input Level	I+T→T
CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration	COLM 2024	2024/9/17	None	Output Level	I+T→T
Securing Vision-Language Models with a Robust Encoder Against Jailbreak and Adversarial Attacks	Arxiv 2024	2024/09/11	None	Encoder Level	I+T→T
Bathe: Defense against the jailbreak attack in multimodal large language models by treating harmful instruction as backdoor trigger	Arxiv 2024	2024/08/17	None	Generator Level	I+T→T
Defending jailbreak attack in vlms via cross-modality information detector	Arxiv 2024	2024/07/31	Github	Encoder Level	I+T→T
Sim-clip: Unsupervised siamese adversarial fine-tuning for robust and semantically-rich vision-language models	Arxiv 2024	2024/07/20	Github	Encoder Level	I+T→T
Cross-modal safety alignment: Is textual unlearning all you need?	Arxiv 2024	2024/05/27	None	Generator Level	I+T→T
Safety alignment for vision language models	Arxiv 2024	2024/05/22	None	Generator Level	I+T→T
Adashield: Safeguarding multimodal large language models from structure-based attack via adaptive shield prompting	ECCV 2024	2024/05/14	Github	Input Level	I+T→T
Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation	ECCV 2024	2024/03/14	Github	Output Level	I+T→T
Safety fine-tuning at (almost) no cost: A baseline for vision large language models	ICML 2024	2024/02/03	Github	Generator Level	I+T→T
Inferaligner: Inference-time alignment for harmlessness through cross-model guidance	EMNLP 2024	2024/01/20	Github	Generator Level	I+T→T
Mllm-protector: Ensuring mllm’s safety without hurting performance	EMNLP 2024	2024/01/05	Github	Output Level	I+T→T
Jailguard: A universal detection framework for llm prompt-based attacks	Arxiv 2023	2023/12/17	Github	Output Level	I+T→T
Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions	ICLR 2024	2023/09/14	Github	Generator Level	I+T→T

Jailbreak Defense of Any-to-Vision Models

Title	Venue	Date	Code	Taxonomy	Multimodal Model
Distorting Embedding Space for Safety: A Defense Mechanism for Adversarially Robust Diffusion Models	Arxiv 2025	2025/01/30	Github	---	T→I
CE-SDWV: Effective and Efficient Concept Erasure for Text-to-Image Diffusion Models via a Semantic-Driven Word Vocabulary	Arxiv 2025	2025/01/26	None	---	T→I
CROPS: Model-Agnostic Training-Free Framework for Safe Image Synthesis with Latent Diffusion Models	Arxiv 2025	2025/01/09	None	---	T→I
PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models	Arxiv 2025	2025/01/07	Homepage	---	T→I
DuMo: Dual Encoder Modulation Network for Precise Concept Erasure	AAAI 2025	2025/01/02	Github	---	T→I
AEIOU: A Unified Defense Framework against NSFW Prompts in Text-to-Image Models	Arxiv 2024	2024/12/24	None	---	T→I
SafeCFG: Redirecting Harmful Classifier-Free Guidance for Safe Generation	Arxiv 2024	2024/12/20	None	---	T→I
SafetyDPO: Scalable Safety Alignment for Text-to-Image Generation	Arxiv 2024	2024/12/13	Github	---	T→I
TraSCE: Trajectory Steering for Concept Erasure	Arxiv 2024	2024/12/10	Github	---	T→I
Buster: Incorporating Backdoor Attacks into Text Encoder to Mitigate NSFW Content Generation	Arxiv 2024	2024/12/10	None	---	T→I
Safeguarding Text-to-Image Generation via Inference-Time Prompt-Noise Optimization	Arxiv 2024	2024/12/05	None	---	T→I
Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models	Arxiv 2024	2024/11/30	None	---	T→I
Safety Without Semantic Disruptions: Editing-free Safe Image Generation via Context-preserving Dual Latent Reconstruction	Arxiv 2024	2024/11/21	None	---	T→I
Safe Text-to-Image Generation:Simply Sanitize the Prompt Embedding	Arxiv 2024	2024/11/15	None	Encoder Level	T→I
Safree: Training-free and adaptive guard for safe text-to-image and video generation	ICLR 2025	2024/10/16	Github	Generator Level	T→I/T→V
Shielddiff: Suppressing sexual content generation from diffusion models through reinforcement learning	Arxiv 2024	2024/10/04	None	Generator Level	T→I
Dark miner: Defend against unsafe generation for text-to-image diffusion models	Arxiv 2024	2024/09/26	None	Generator Level	T→I
Score forgetting distillation: A swift, data-free method for machine unlearning in diffusion models	Arxiv 2024	2024/09/17	None	Generator Level	T→I
EIUP: A Training-Free Approach to Erase Non-Compliant Concepts Conditioned on Implicit Unsafe Prompts	Arxiv 2024	2024/08/02	None	Generator Level	T→I
Direct Unlearning Optimization for Robust and Safe Text-to-Image Models	NeurIPS 2024	2024/07/17	Github	Generator Level	T→I
Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models	ECCV 2024	2024/07/17	Github	Generator Level	T→I
Conceptprune: Concept editing in diffusion models via skilled neuron pruning	Arxiv 2024	2024/05/29	Github	Generator Level	T→I
Pruning for Robust Concept Erasing in Diffusion Models	Arxiv 2024	2024/05/26	None	Generator Level	T→I
Defensive unlearning with adversarial training for robust concept erasure in diffusion models	NeurIPS 2024	2024/05/24	Github	Encoder Level	T→I
Unlearning concepts in diffusion model via concept domain correction and concept preserving gradient	AAAI 2025	2024/05/24	Github	Generator Level	T→I
Espresso: Robust Concept Filtering in Text-to-Image Models	Arxiv 2024	2024/04/30	None	Output Level	T→I
Latent Guard: a Safety Framework for Text-to-image Generation	ECCV 2024	2024/04/11	Github	Encoder Level	T→I
SafeGen: Mitigating Sexually Explicit Content Generation in Text-to-Image Models	ACM CCS 2024	2024/04/10	Github	Generator Level	T→I
Salun: Empowering machine unlearning via gradient-based weight saliency in both image classification and generation	ICLR 2024	2024/04/04	Github	Generator Level	T→I
GuardT→I: Defending Text-to-Image Models from Adversarial Prompts	NeurIPS 2024	2024/03/03	None	Encoder Level	T→I
Universal prompt optimizer for safe text-to-image generation	NAACL 2024	2024/02/16	Github	Input Level	T→I
Erasediff: Erasing data influence in diffusion models	Arxiv 2024	2024/01/11	None	Generator Level	T→I
Localization and manipulation of immoral visual cues for safe text-to-image generation	WACV 2024	2024/01/01	None	Output Level	T→I
Receler: Reliable concept erasing of text-to-image diffusion models via lightweight erasers	ECCV 2024	2023/11/29	Github	Generator Level	T→I
Self-discovering interpretable diffusion latent directions for responsible text-to-image generation	CVPR 2024	2023/11/28	Github	Encoder Level	T→I
Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models	ECCV 2024	2023/11/27	Github	Encoder Level	T→I
Mace: Mass concept erasure in diffusion models	CVPR 2024	2023/10/19	Github	Generator Level	T→I
Implicit concept removal of diffusion models	ECCV 2024	2023/10/09	None	Input Level	T→I
Unified concept editing in diffusion models	WACV 2024	2023/08/25	Github	Generator Level	T→I
Towards safe self-distillation of internet-scale text-to-image diffusion models	ICML 2023 Workshop on Challenges in Deployable Generative AI	2023/07/12	Github	Generator Level	T→I
Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models	CVPR 2024	2023/05/30	Github	Generator Level	T→I
Erasing concepts from diffusion models	ICCV 2023	2023/05/13	Github	Generator Level	T→I
Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models	CVPR 2023	2022/11/09	Github	Generator Level	T→I

Jailbreak Defense of Any-to-Any Models

Title	Venue	Date	Code	Taxonomy	Multimodal Model

💯Evaluation

⭐️Evaluation Datasets

Below is a comparison table of publicly available representative evaluation datasets and a description of each attribute in the table.

Collected: raw data created by humans or collected from real-world websites.
Reconstructed: Data reorganized from other existing datasets.
Synthesized: AI-generated data using LLM or diffusion models.
Adversarial: Adversarial data generated by jailbreak attack methods.

Used to Any-to-Text Models

Dataset	Text Source	Image Source	Volume	Theme	Access
Figstep	Synthesized	Adversarial	500	10	Github
AdvBench	Synthesized	---	500	---	Github
ReadTeam-2K	Collected & Reconstructed & Synthesized	N/A	2000	16	Huggingface
HarmBench	Collected	---	510	4	Github
HADES	Synthesized	Collected & Synthesized & Adversarial	750	5	Github
MM-SafetyBench	Synthesized	Synthesized & Adversarial	5040	13	Github
JailBreakV-28K	Adversarial	Reconstructed & Synthesized	28000	16	Huggingface

Used to Any-to-Vision Models

Dataset	Text Source	Image Source	Volume	Access	Theme
NSFW-200	Synthesized	---	200	---	Github
MMA	Reconstructed & Adversarial	Adversarial	1000	---	Huggingface
VBCDE	Reconstructed & Adversarial	---	100	5	Github
I2P	Collected	Collected	4703	7	Huggingface
Unsafe Diffusion	Collected & Reconstructed	---	1434	---	Github
MACE-Celebrity	Collected	---	1000	---	Github
MACE-Art	Reconstructed	---	1000	---	Github
MPUP	Synthesized	---	1200	4	Huggingface
T2VSafetyBench	Reconstructed & Synthesized & Adversarial	---	4400	12	Github

📚Evaluation Methods

Current evaluation methods are primarily classified into two categories: manual evaluation and automated evaluation.

Manual evaluation involves human assessment to determine if the content is toxic, offering a direct and interpretable method of evaluation.
Automated approaches assess the safety of multimodal generative models by employing a range of techniques, including detector-based, GPT-based, and rule-based methods.

Text Detector

Toxicity detector	Access
LLama-Guard	Huggingface
LLama-Guard2	Huggingface
Detoxify	Github
GPTFUZZER	Huggingface
Perspective API	Website

Image Detector

Toxicity detector	Access
NudeNet	Github
Q16	Github
Safety Checker	Huggingface
Imgcensor	Github
Multi-headed Safety Classifier	Github

😉Citation

If you find this work useful in your research, Please kindly cite using the following BibTex:

@article{liu2024jailbreak,
    title={Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey},
    author={Liu, Xuannan and Cui, Xing and Li, Peipei and Li, Zekun and Huang, Huaibo and Xia, Shuhan and Zhang, Miaoxuan and Zou, Yueying and He, Ran},
    journal={arXiv preprint arXiv:2411.09259},
    year={2024},
}

Name		Name	Last commit message	Last commit date
Latest commit History 334 Commits
pic		pic
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

😈🛡️Awesome-Jailbreak-against-Multimodal-Generative-Models

🤗Introduction

🚀Table of Contents