This is a ChatGPT-4 English adaptation of the original document by kohya-ss (train_ti_README-ja.md)
This is an explanation about learning Textual Inversion (https://textual-inversion.github.io/).
Please also refer to the common documentation on learning.
The implementation was greatly inspired by https://github.com/huggingface/diffusers/tree/main/examples/textual_inversion.
The learned model can be used directly in the Web UI.
Please refer to this repository's README beforehand and set up the environment.
Refer to Preparing Training Data for more information.
Use train_textual_inversion.py
. The following is an example of a command-line (DreamBooth method).
accelerate launch --num_cpu_threads_per_process 1 train_textual_inversion.py
--dataset_config=<.toml file created during data preparation>
--output_dir=<output folder for the trained model>
--output_name=<file name for the trained model output without extension>
--save_model_as=safetensors
--prior_loss_weight=1.0
--max_train_steps=1600
--learning_rate=1e-6
--optimizer_type="AdamW8bit"
--xformers
--mixed_precision="fp16"
--cache_latents
--gradient_checkpointing
--token_string=mychar4 --init_word=cute --num_vectors_per_token=4
Specify the token string during training with --token_string
. Make sure your training prompt includes this string (e.g., if the token_string is mychar4, use "mychar4 1girl"). This part of the prompt will be replaced with a new token for Textual Inversion and learned. For DreamBooth and class+identifier-style datasets, it is easiest and most reliable to make the token_string
the token string.
You can check whether the token string is included in the prompt by using --debug_dataset
. The replaced token id will be displayed, so you can check if there are tokens after 49408
, as shown below.
input ids: tensor([[49406, 49408, 49409, 49410, 49411, 49412, 49413, 49414, 49415, 49407,
49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
49407, 49407, 49407, 49407, 49407, 49407, 49407]])
You cannot use words that the tokenizer already has (common words).
Specify the string of the source token for initializing embeddings with --init_word
. It is better to choose something close to the concept you want to learn. You cannot specify a string that consists of two or more tokens.
Specify how many tokens to use in this training with --num_vectors_per_token
. The more tokens you use, the more expressive the model will be, but the more tokens will be consumed. For example, if num_vectors_per_token=8, the specified token string will consume 8 tokens (out of the general prompt's 77-token limit).
These are the main options for Textual Inversion. The rest is similar to other training scripts.
Usually, it is better to specify 1
for num_cpu_threads_per_process
.
Specify the base model for additional learning with pretrained_model_name_or_path
. You can specify a Stable Diffusion checkpoint file (.ckpt or .safetensors), a Diffusers model directory on your local disk, or a Diffusers model ID (e.g., "stabilityai/stable-diffusion-2").
Specify the folder to save the trained model after learning with output_dir
. Specify the model's filename without the extension in output_name
. Specify saving the model in safetensors format with save_model_as
.
Specify the .toml
file in dataset_config
. Set the batch size in the file to 1
initially to keep memory consumption low.
Set the number of training steps to 10000 with max_train_steps
. Set the learning rate to 5e-6 with learning_rate
.
To save memory, specify mixed_precision="fp16"
(for RTX 30 series and later, you can also specify bf16
. Match the setting you made in accelerate when setting up the environment). Also, specify gradient_checkpointing
.
To use a low-memory consumption 8bit AdamW optimizer, specify optimizer_type="AdamW8bit"
.
Specify the xformers
option to use xformers' CrossAttention. If you have not installed xformers or if it causes errors (depending on the environment, such as when mixed_precision="no"
), you can alternatively specify the mem_eff_attn
option to use the memory-efficient CrossAttention (although it will be slower).
If you have enough memory, edit the .toml
file to increase the batch size to, for example, 8
(this may speed up and potentially improve accuracy).
Please refer to the documentation on options in the following cases:
- Training a Stable Diffusion 2.x or derived model
- Training a model with a clip skip of 2 or more
- Training with captions exceeding 75 tokens
Compared to DreamBooth and fine-tuning, which train the entire model, Textual Inversion uses less memory, so you can set a larger batch size.
Please refer to another document for all options.
--weights
- Load pre-trained embeddings before training and learn further from them.
--use_object_template
- Learn with a default object template string (e.g., "a photo of a {}") instead of captions. This will be the same as the official implementation. Captions will be ignored.
--use_style_template
- Learn with a default style template string (e.g., "a painting in the style of {}") instead of captions. This will be the same as the official implementation. Captions will be ignored.
Specify the learned embeddings file with the --textual_inversion_embeddings
option in gen_img_diffusers.py (multiple files allowed). Use the filename (without the extension) of the embeddings file in the prompt, and the embeddings will be applied.