You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Recently I have studied the code and found you use multiple LabelEmbedder in the U-DiT model. However, I am not sure if this approach is right because in classifier-free guidance, the LabelEmbedder has a class_dropout_prob. Since there are 3 LabelEmbedder in a forward pass, the probability of all of them drop out the label will be 0.1**3 = 0.001, which means that the labels are very likely to leak to the model in some latent resolution. I'm afraid that this will damage the conditional generation quality. In fact, I have tried the U-DiT-L-1000k steps model and find the visual quailty at cfg=1.5 seems worse than DiT-XL/2 at 7M steps(I haven't tried FID because generating 50k samples requires a lot of compute).
Do you have any idea about this? Thanks for your attention!
The text was updated successfully, but these errors were encountered:
Thanks for your comments! I have inspected the code and I completely agree with your opinion. The inconsitency of label embedding could hamper the performance of conditional generation. I will fix this architectural bug and report fixed values asap.
Thanks for your comments! I have inspected the code and I completely agree with your opinion. The inconsitency of label embedding could hamper the performance of conditional generation. I will fix this architectural bug and report fixed values asap.
Thank you! I'm looking forward to your update. Recently we're trying to build more efficient diffusion backbones and your research inspired us a lot!
Thanks for your awesome work firstly.
Recently I have studied the code and found you use multiple LabelEmbedder in the U-DiT model. However, I am not sure if this approach is right because in classifier-free guidance, the LabelEmbedder has a class_dropout_prob. Since there are 3 LabelEmbedder in a forward pass, the probability of all of them drop out the label will be 0.1**3 = 0.001, which means that the labels are very likely to leak to the model in some latent resolution. I'm afraid that this will damage the conditional generation quality. In fact, I have tried the U-DiT-L-1000k steps model and find the visual quailty at cfg=1.5 seems worse than DiT-XL/2 at 7M steps(I haven't tried FID because generating 50k samples requires a lot of compute).
Do you have any idea about this? Thanks for your attention!
The text was updated successfully, but these errors were encountered: