-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nice work! And several questions of the paper #3
Comments
Hi @StarCycle thanks for the interest in our work! I am happy to hear that you like it.
I only use the current state image observation as encoding for the model and predict the next 10 actions that I always rollout until the end until the model gets the next state and predicts 10 actions again. This makes the model trains several times faster than GR-1 or any other competitive policy on CALVIN, since longer history is expensive and my model is small. Similar work such as Octo uses a history of 2, and reported that longer history did not add better performance. I tried 2 as well but did not notice any benefits for CALVIN.
I tried it on CALVIN D and the performance drops to 3.4 compared to 3.72 of MDT. While the dataset is smaller 1/4 compared to the full dataset. The labeled dataset has the advantage, that is is segmented into different tasks, while the unlabeled part does not contain information on when one task ends and when the next one starts. So the benefit is limited from that unlabeled part. (Or that is my assumption why adding 3 times more training data has limited benefits) One of the key takeways from this work is that other policies such as GR-1 can also easily leverage the non-labeled part by using either a trainable ResNet or CLIP to learn from unlabeled demos. But stay tuned for a framework that deal with this issue of unlabeled and unsegmented robot trajectories soon and hopefully will be released in the next two months. Preimilary results shown here https://openreview.net/pdf?id=Q70XYvUk52 !:)
I tried it once during earlier stages of the training but in my experience the performance is worse than Diffusion.
Yes, good catch. For MDT that uses two ResNets-18 trained from scratch to encode images. On this setting CLA has a lower impact. However, the same policy setup benefits a lot more from CLA on LIBERO settings, where MGF loss has only little impact on the same environments. For MDT-V, where I use a pretrained Voltron model frozen with Perceiver Resampler to extract 3 latent tokens, CLA worked well on CALVIN. I made a lot of abaltions and tests with other contrastive objectives such as SigLip Loss or NCE loss but I did not find the reason for difference. Let me know if you have more questions! |
@StarCycle I just saw you already tried to combine Diffusion with GR-1 in your GR-1 reproduction codebase. Would you be interested to discuss your ideas in more detail? I am interested what you tried. |
Hi @mbreuss, Thank you for the nice response! I think there are some interesting findings! I was debugging my code so replied a little late...
I fully agree that smaller history length leads to quicker training. In my own tests on Robomimic/LanguageTable/real robotics arms, just using the current observation is enough. ACT/Diffusion Policy show the same phenomenon. Using longer history may harm the performance because of causal confusion But some other papers arrive at different conclusions, including GR-1 and HiveFormer. They are trained with CALVIN and RLBench respectively, so I guess more training data may solve causal confusion. Yes, Octo is an exception that trained with massive data and only use current observation, but I guess they only compared effects of different history length with few tasks (and limited data of each task). In their Appendix E:
It's another interesting difference between MDT and GR-1. My version of "GR-Chunk" also predicts future 10 actions but only executes the 1st one. When it tries to execute the first 2 actions in the "action chunk", the sucess rate becomes lower. By contrast, MDT executes the whole action chunk and that's fine. I also trained diffusion policy (not GR-Diffusion but another backbone) and I also found rolling out until the end is fine. Will it be a difference between diffusion and non-diffusion policies?
Nice work! And it will be the final solution to unlabelled data! I guess the reason that MDT can leverages the unlabelled part is the CLA loss. The language goals are not labelled but the image goals are labelled. Training a policy that accepts 2 types of goals and aligned them explicitly is a nice design. I shall think about how to apply the CLA loss in my design!
I modified the GR-1 transformer backbone similar to your GPT-Diffusion decoder. I tried to predict actions with diffusion or predict both of actions and videos with diffusion (DDIM & DDPM). Compared with GR-Chunk, the success rate of GR-Diffusion is lower... A possible reason is that the "timestep" input is overwhelmed by other inputs to the GPT2 transformer. By contrast, the Film condition layers in your DiT may avoid that (I guess so). To conquer this problem, I may modify GR-1 in another way, i.e., adding an lightweight MLP after the GPT2 transformer and only denoising with it.
If I understand it correctly, MDT-V with MGF loss only performs better than MDT-V with CLA loss only on CALVIN ABCD->D (4.48 and 4.38 respetively according to Table IX). But Fig.4 shows a different conclusion? It's a bad news for me that MGF loss is not that useful...One reason may be the unmasked tokens, i.e., fully masking the future image may force the model to learn to predict future. And predicting a video (instead of a single future image) will make the policy to learn an implicit world model inside (I hope so...). I have recommended your implementation in my repo and glad to have nice diccusion with you! |
So one guess is that diffusion has a higher expressiveness and can predict multiple actions with a higher accuracy, while just prediction actions with MSE is a bit worse and regenerating actions every timestep does help to migate that issue. But no idea if thats true.
Yes, I also find it fascinating that GR-1 works so well with such a long history. My prior work on Diffusion Policies also used GPT-style action prediction but performance was a lot worse, especially with longer history. There are no good studies on why this is the case. But for me, I need to save compute thus I like the single or two timestep based models a lot more.
I think you need a bigger model for that. If you look at any recent img gen diffusion model better is always better in terms of performance. I also have some work coming up that scales a GPT-style diffusion policy to large models up to 500 Million parameters on the language only data on CALVIN. We noticed that smaller models in this setting are not good on ABC, while using the same model with more parameters increases the performance a lot. Our best Diffusion Policy learned from scratch achieved 2.8 on ABC. While GR-1 and 3DDa are both stronger, they either need large-scale pre-training or depth + camera, so for 2d Policies thats quite good. So given that experience and current methods for diffusion video gen, I believe that GR-1 is big enough for this pixel gen, but not big enough for both img diffusion + action diffusion. My best bet would be to use a pretrained DiT (Diffusion Transformer for images or OpenSora style videos) and add action diffusion. But these models are usually several times bigger and require a lot more compute. In my experience having small diffusion heads it not optimal, as diffusion requires more parameters to do well.
So for CALVIN MGF is very strong for MDT with Resnets while for MDT-V with Voltron it not that big of a performance increase, since the basemodel is a lot better. The idea to only predict mask tokens and give some context is so the model does not overfit on the background. In my ablations predicting the full images does hurt performance in some cases. Regarding the video, that is actually something I tried but did not include in the paper anymore. Predicting several masked future images works quite well too but I did not explore it further to make a good comparison. Overall, I am optimistic for future to combine low level policy with some visual behavior generation, similar to MDT or GR-1. Policies that understand their actions in both domains seem to my best bet for better generalization. |
Also please let me know if CLA works well for your models! :) |
@mbreuss Thanks! I will let you know when I try it! |
Hi @omeryagmurlu @mbreuss,
It's indeed a nice work! Great thanks for you to make it open-source so quickly! I am the author of a similar repository on CALVIN benchmark: GR1-Training. However, it takes more training time...
I have several questions about your paper:
Again it's a very nice work! I star it and would like to recommend it to others. Good lucks!
Zhuoheng
The text was updated successfully, but these errors were encountered: