This is the official repository of SafetyDPO: Scalable Safety Alignment for Text-to-Image Generation (arXiv).
The code and checkpoints will be released soon.
🔥🔥🔥 The checkpoint Safe-StableDiffusionV2.1 has been released in HuggingFace! Welcome downloading!
🔥🔥🔥 The checkpoint Safe-StableDiffusionXL has been released in HuggingFace! Welcome downloading!
🔥🔥🔥 The checkpoint Safe-StableDiffusionV1.5 has been released in HuggingFace! Welcome downloading! The testing and inference code are also released.
🔥🔥🔥 The dataset CoProV2 for Stable Diffusion 1.5 has been released!
Runtao Liu1*, I Chieh Chen1*, Jindong Gu2, Jipeng Zhang1, Renjie Pi1,
Qifeng Chen1, Philip Torr2, Ashkan Khakzar2, Fabio Pizzati2,3
1Hong Kong University of Science and Technology, 2University of Oxford
3MBZUAI
* Equal Contribution
Safety alignment for T2I. T2I models released without safety alignment risk to be misused (top). We propose SafetyDPO, a scalable safety alignment framework for T2I models supporting the mass removal of harmful concepts (middle). We allow for scalability by training safety experts focusing on separate categories such as “Hate”, “Sexual”, “Violence”, etc. We then merge the experts with a novel strategy. By doing so, we obtain safety-aligned models, mitigating unsafe content generation (bottom).
@article{liu2024safetydpo,
title={SafetyDPO: Scalable Safety Alignment for Text-to-Image Generation},
author={Liu, Runtao and Chieh, Chen I and Gu, Jindong and Zhang, Jipeng and Pi, Renjie and Chen, Qifeng and Torr, Philip and Khakzar, Ashkan and Pizzati, Fabio},
journal={arXiv preprint arXiv:2412.10493},
year={2024}
}
[2025/01]:
🔥🔥🔥The checkpoint Safe-StableDiffusionV2.1 has been released in HuggingFace! Welcome downloading![2025/01]:
🔥🔥🔥The checkpoint safe-SDXL has been released in HuggingFace! Welcome downloading![2025/01]:
🔥🔥🔥The checkpoint Safe-StableDiffusionV1.5 has been released in HuggingFace! Welcome downloading! The testing and inference code are also released.[2025/01]:
🔥🔥🔥The dataset CoProV2(for SD1.5) has been released.[2024/12]:
The arXiv has been released.
Our dataset CoProV2 for Stable Diffusion v1.5 has been released at here.
Please download the dataset from the link and unzip it in the datasets
folder. The category of each prompt is included in data/CoProv2_train.csv
.
To set up the conda environment, run the following command:
conda env create -f environment.yaml
After installation, activate the environment with:
conda activate SafetyDPO
To run the inference, execute the following command:
python inference.py --model_path MODEL_PATH --prompts_path PROMPT_FILE --save_path SAVE_PATH
--model_path
: Specifies the path to the trained model.--prompts_path
: Specifies the path to the csv prompt file for image generation, please make sure the csv file contains the following columns:prompt
,image
.--save_path
: Specifies the folder path to save the generated images.
To run the testing, execute the following command:
python test.py --metrics METRIC --target_folder TARGET_FOLDER --reference REFERENCE_FOLDER_OR_FILE --device DEVICE
--metrics
: Specifies the metric to be evaluated, we supportIP
,FID
, andCLIP
.--target_folder
: Specifies the folder that contains to images to be evaluated.--reference
: Specifies the reference folder or file used for evaluation. To evaluateIP
, please provide theinappropriate_images.csv
file generated by Q16. To evaluateFID
, please provide the path the path of the reference images. To evaluateCLIP
, please provide the path to the csv file containing columnsimage
andprompt
, i.e.data/CoProv2_test.csv
.--device
: Specifies the GPU to use, defaults tocuda:0
Step 1. Please follow Q16 and generate the Q16 results to a designated path Q16_PATH.
Important
For the ./main/clip_classifier/classify/inference_images.py
of Q16, please modify as follow or you may encounter errors:
- Please set
only_inappropriate
toFalse
in line 19. - Please specify your GPUs in the format
gpu=[0]
in line 21.
Step 2. Run the following commands with your designated IMAGE_PATH
and Q16_PATH
.
python test.py \
--metrics 'inpro' \
--target_folder IMAGE_PATH \
--reference /Q16_PATH/inappropriate/Clip_ViT-L/sim_prompt_tuneddata/inappropriate_images.csv \
--device 'cuda:0'
Step 1. Run the following commands with your designated IMAGE_PATH
and REFERENCE_IMAGE_PATH
.
python test.py \
--metrics 'fid' \
--target_folder IMAGE_PATH \
--reference REFERENCE_IMAGE_PATH \
--device 'cuda:0'
Step 1. Run the following commands with your designated IMAGE_PATH
and PROMPT_PATH
.
Note
PROMPT_PATH should be a csv file containing columns image
and prompt
python test.py \
--metrics 'clip' \
--target_folder IMAGE_PATH \
--reference PROMPT_PATH \
--device 'cuda:0'
Text-to-image (T2I) models have become widespread, but their limited safety guardrails expose end users to harmful content and potentially allow for model misuse. Current safety measures are typically limited to text-based filtering or concept removal strategies, able to remove just a few concepts from the model's generative capabilities. In this work, we introduce SafetyDPO, a method for safety alignment of T2I models through Direct Preference Optimization (DPO). We enable the application of DPO for safety purposes in T2I models by synthetically generating a dataset of harmful and safe image-text pairs, which we call CoProV2. Using a custom DPO strategy and this dataset, we train safety experts, in the form of low-rank adaptation (LoRA) matrices, able to guide the generation process away from specific safety-related concepts. Then, we merge the experts into a single LoRA using a novel merging strategy for optimal scaling performance. This expert-based approach enables scalability, allowing us to remove 7 times more harmful concepts from T2I models compared to baselines. SafetyDPO consistently outperforms the state-of-the-art on many benchmarks and establishes new practices for safety alignment in T2I networks.
For each unsafe concept in different categories, we generate corresponding prompts using an LLM. We generate paired safe prompts using an LLM, minimizing semantic differences. Then, we use the T2I model we intend to align to generate corresponding images for both prompts.
Expert Training and Merging. First, we use the previously generated prompts and images to train LoRA experts on specific safety categories (left), exploiting our DPO-based losses. Then, we merge all the safety experts with Co-Merge (right). This allows us to achieve general safety experts that produce safe outputs for a generic unsafe input prompt in any category.
Merging Experts with Co-Merge. (Left) Assuming LoRA experts with the same architecture, we analyze which expert has the highest activation for each weight across all inputs. (Right) Then, we obtain the merged weights from multiple experts by merging only the most active weights per expert.
Datasets Comparison. Our LLM-generated dataset, CoProV2, achieves comparable Inappropriate Probability (IP) to human-crafted datasets (UD [44], I2P [51]) and offers a similar scale to CoPro [33]. COCO [32], exhibiting a low IP, is used as a benchmark for image generation with safe prompts as input.
Benchmark. SafetyDPO achieves the best performance both in generated image alignment (IP) and image quality (FID, CLIPScore) with two T2I models and against 3 methods for SD v1.5. Note that we use CoProV2 only for training; hence, I2P and UD are out-of-distribution. Yet, SafetyDPO allows a robust safety alignment.
Best results are bold, and second-best results are underlined.
Qualitative Comparison. Compared to non-aligned baseline models, SafetyDPO allows the synthesis of safe images for unsafe input prompts. Please note the layout similarity between the unsafe and safe outputs: thanks to our training, only the harmful image traits are removed from the generated images. Concepts are shown in ⟨brackets⟩. Prompts are shortened; for full ones, see the supplementary material.
Effectiveness of Merging. While training a single safety expert across all data (All-single), IP performance is lower or comparable to single experts (previous rows). Instead, by merging safety experts (All-ours), we considerably improve results.
Resistance to Adversarial Attacks. We evaluate the performance of SafetyDPO and the best baseline, ESD-u, in terms of IP using 4 adversarial attack methods. For a wide range of attacks, we are able to outperform the baselines, advocating for the effectiveness of our scalable concept removal strategy.
Ablation Studies. We check the effects of alternative strategies for DPO, proving that our approach is the best (a). Co-Merge is also the best merging strategy compared to baselines (b). Finally, we verify that scaling data improves our performance (c).