Skip to content

SynCD: Generating Multi-Image Synthetic Data for Text-to-Image Customization

Notifications You must be signed in to change notification settings

nupurkmr9/syncd

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Synthetic Customization Data (SynCD)

*This is a reimplementation of the paper in diffusers framework after the end of the internship.


We propose a pipeline for synthetic training data generation consisting of multiple images of the same object under different lighting, poses, and backgrounds, using either explicit 3D object assets or, more implicitly, using masked shared attention across different views. Given the training data, we train a new encoder-based model for customization/personalization. During inference, our method can successfully generate new compositions of a reference object using text prompts.

Generating Multi-Image Synthetic Data for Text-to-Image Customization
(ArXiv 2025)
Nupur Kumari, Xi Yin, Jun-Yan Zhu, Ishan Misra, Samaneh Azadi

NEWS!!

  • Demo based on FLUX.1-dev model fine-tuning. Training code will be released soon.

Synthetic Customization Dataset (SynCD)

dataset_overview1.mp4

Note: Our dataset is available to download here

Results

Qualitative Comparison

  • With a single reference image as input:

  • With three reference images as input:

SynCD Overview

Our dataset generation pipeline is tailored for (a) Deformable categories where we use descriptive prompts and Masekd Shared Attention (MSA) among foreground objects regions of the images to promote visual consistency. (b) Rigid object categories, where we additionally employ depth and cross-view warping using existing Objaverse assets to ensure 3D multiview consistency. We further use DINOv2 and aesthetic score to filter out low-quality images to create our final training dataset.

Model Overview

We finetune a pre-trained IP-Adapter based model (global feature injection) on our generated dataset (SynCD). During training, we additionally employ Masked Shared Attention (MSA) between target and reference features of the image (fine-grained feature injection). This helps the model to incorporate more fine-grained features from multiple reference images during inference.

Getting Started

git clone https://github.com/nupurkmr9/syncd.git
cd syncd
conda create -n syncd python=3.10
conda activate syncd
pip3 install torch torchvision torchaudio  # (Or appropriate torch>2.0 from [https://pytorch.org/get-started/locally/](https://pytorch.org/get-started/locally/))
pip install -r assets/requirements.txt

Dataset Generation: Please refer here for dataset generation code.

Model Training: Please refer here for dataset filtering and SDXL model training code.

Todo:

  • Release the synthetic dataset (SynCD): Avaialble to download here.
  • SDXL fine-tuning with deepspeed.
  • Flux fine-tuning on our dataset.

Acknowledgements

We are grateful to the below works for their code/data/model. Our code is built upon them.

BibTeX

@article{kumari2025syncd,
  title={Generating Multi-Image Synthetic Data for Text-to-Image Customization},
  author={Kumari, Nupur and Yin, Xi and Zhu, Jun-Yan and Misra, Ishan and Azadi, Samaneh},
  journal={ArXiv},
  year={2025}
}

About

SynCD: Generating Multi-Image Synthetic Data for Text-to-Image Customization

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published