Nice work! And several questions of the paper #3

StarCycle · 2024-07-16T04:34:12Z

It's indeed a nice work! Great thanks for you to make it open-source so quickly! I am the author of a similar repository on CALVIN benchmark: GR1-Training. However, it takes more training time...

I have several questions about your paper:

What's the length of the history do you use, i.e., how many historical frames are sent to the visual encoder? GR1 uses a history length of 10. In my experiment, I found the success rate would be higher if the history length become longer. I guess CALVIN has sufficient data to train the network and avoid the common causal confusion problem.
Did you try to train MDT using only data with language annotations? Recent works like GR1 and 3D Diffuser Actor only use a small portion of CALVIN data (only the data with lang annotations). Although MDT's strength is training on a combination of data with and without language annotation (of course it's nice), training it with only lang-annotated data may be a useful ablation.
Did you try to close diffusion and let GPT-Diffusion policy to be just a normal GPT policy? Sometime a diffusion action head is not necessary, as shown by baku which is also trained on libero-90. When you are predicting an action chunk, the multimodality of the action distribution has been reduced (for a little bit).
Fig.4 and Table IX seems to have different conclusions and they are both on CALVIN ABCD->D task. Fig 4 shows that CLA loss is more important than MGF loss on the task, while Table IX shows MGF loss is more important.

Again it's a very nice work! I star it and would like to recommend it to others. Good lucks!

Zhuoheng

mbreuss · 2024-07-16T07:13:22Z

Hi @StarCycle

thanks for the interest in our work! I am happy to hear that you like it.

What's the length of the history do you use, i.e., how many historical frames are sent to the visual encoder?

I only use the current state image observation as encoding for the model and predict the next 10 actions that I always rollout until the end until the model gets the next state and predicts 10 actions again. This makes the model trains several times faster than GR-1 or any other competitive policy on CALVIN, since longer history is expensive and my model is small. Similar work such as Octo uses a history of 2, and reported that longer history did not add better performance. I tried 2 as well but did not notice any benefits for CALVIN.

Did you try to train MDT using only data with language annotations?

I tried it on CALVIN D and the performance drops to 3.4 compared to 3.72 of MDT. While the dataset is smaller 1/4 compared to the full dataset. The labeled dataset has the advantage, that is is segmented into different tasks, while the unlabeled part does not contain information on when one task ends and when the next one starts. So the benefit is limited from that unlabeled part. (Or that is my assumption why adding 3 times more training data has limited benefits) One of the key takeways from this work is that other policies such as GR-1 can also easily leverage the non-labeled part by using either a trainable ResNet or CLIP to learn from unlabeled demos. But stay tuned for a framework that deal with this issue of unlabeled and unsegmented robot trajectories soon and hopefully will be released in the next two months. Preimilary results shown here https://openreview.net/pdf?id=Q70XYvUk52 !:)

Did you try to close diffusion and let GPT-Diffusion policy to be just a normal GPT policy?

I tried it once during earlier stages of the training but in my experience the performance is worse than Diffusion.
I am a big fan of diffusion policies as I developed the continuous time one last year. Several large-scale experiments on OXE show that diffusion still scales better than other action representations. I would be interested to know if GR-1 would benefit from a diffusion head more than their current head.

Question about CLA

Yes, good catch. For MDT that uses two ResNets-18 trained from scratch to encode images. On this setting CLA has a lower impact. However, the same policy setup benefits a lot more from CLA on LIBERO settings, where MGF loss has only little impact on the same environments. For MDT-V, where I use a pretrained Voltron model frozen with Perceiver Resampler to extract 3 latent tokens, CLA worked well on CALVIN. I made a lot of abaltions and tests with other contrastive objectives such as SigLip Loss or NCE loss but I did not find the reason for difference.

Let me know if you have more questions!

mbreuss · 2024-07-16T13:05:29Z

@StarCycle I just saw you already tried to combine Diffusion with GR-1 in your GR-1 reproduction codebase. Would you be interested to discuss your ideas in more detail? I am interested what you tried.

StarCycle · 2024-07-16T15:01:19Z

Hi @mbreuss,

Thank you for the nice response! I think there are some interesting findings! I was debugging my code so replied a little late...

The optimal history length is 1, 2, or longer?

I fully agree that smaller history length leads to quicker training. In my own tests on Robomimic/LanguageTable/real robotics arms, just using the current observation is enough. ACT/Diffusion Policy show the same phenomenon. Using longer history may harm the performance because of causal confusion

But some other papers arrive at different conclusions, including GR-1 and HiveFormer. They are trained with CALVIN and RLBench respectively, so I guess more training data may solve causal confusion.

Yes, Octo is an exception that trained with massive data and only use current observation, but I guess they only compared effects of different history length with few tasks (and limited data of each task). In their Appendix E:

Models with one frame of history as context performed better in zero-shot evals than models pretrained without history. We did not observe benefits of increasing the history length further on the few tasks we evaluated on, though other tasks may benefit.

You always rollout until the end until the model gets the next state and predicts 10 actions again.

It's another interesting difference between MDT and GR-1. My version of "GR-Chunk" also predicts future 10 actions but only executes the 1st one. When it tries to execute the first 2 actions in the "action chunk", the sucess rate becomes lower.

By contrast, MDT executes the whole action chunk and that's fine. I also trained diffusion policy (not GR-Diffusion but another backbone) and I also found rolling out until the end is fine.

Will it be a difference between diffusion and non-diffusion policies?

About Lupus

Nice work! And it will be the final solution to unlabelled data!

I guess the reason that MDT can leverages the unlabelled part is the CLA loss. The language goals are not labelled but the image goals are labelled. Training a policy that accepts 2 types of goals and aligned them explicitly is a nice design. I shall think about how to apply the CLA loss in my design!

About the diffusion action head on GR-1

I modified the GR-1 transformer backbone similar to your GPT-Diffusion decoder. I tried to predict actions with diffusion or predict both of actions and videos with diffusion (DDIM & DDPM). Compared with GR-Chunk, the success rate of GR-Diffusion is lower...

A possible reason is that the "timestep" input is overwhelmed by other inputs to the GPT2 transformer. By contrast, the Film condition layers in your DiT may avoid that (I guess so). To conquer this problem, I may modify GR-1 in another way, i.e., adding an lightweight MLP after the GPT2 transformer and only denoising with it.

About CLA loss

If I understand it correctly, MDT-V with MGF loss only performs better than MDT-V with CLA loss only on CALVIN ABCD->D (4.48 and 4.38 respetively according to Table IX). But Fig.4 shows a different conclusion?

It's a bad news for me that MGF loss is not that useful...One reason may be the unmasked tokens, i.e., fully masking the future image may force the model to learn to predict future. And predicting a video (instead of a single future image) will make the policy to learn an implicit world model inside (I hope so...).

I have recommended your implementation in my repo and glad to have nice diccusion with you!

mbreuss · 2024-07-16T15:56:56Z

Will it be a difference between diffusion and non-diffusion policies?

So one guess is that diffusion has a higher expressiveness and can predict multiple actions with a higher accuracy, while just prediction actions with MSE is a bit worse and regenerating actions every timestep does help to migate that issue. But no idea if thats true.

Long history of GR-1

Yes, I also find it fascinating that GR-1 works so well with such a long history. My prior work on Diffusion Policies also used GPT-style action prediction but performance was a lot worse, especially with longer history. There are no good studies on why this is the case. But for me, I need to save compute thus I like the single or two timestep based models a lot more.

I modified the GR-1 transformer backbone similar to your GPT-Diffusion decoder. I tried to predict actions with diffusion or predict both of actions and videos with diffusion (DDIM & DDPM). Compared with GR-Chunk, the success rate of GR-Diffusion is lower...

I think you need a bigger model for that. If you look at any recent img gen diffusion model better is always better in terms of performance. I also have some work coming up that scales a GPT-style diffusion policy to large models up to 500 Million parameters on the language only data on CALVIN. We noticed that smaller models in this setting are not good on ABC, while using the same model with more parameters increases the performance a lot. Our best Diffusion Policy learned from scratch achieved 2.8 on ABC. While GR-1 and 3DDa are both stronger, they either need large-scale pre-training or depth + camera, so for 2d Policies thats quite good. So given that experience and current methods for diffusion video gen, I believe that GR-1 is big enough for this pixel gen, but not big enough for both img diffusion + action diffusion. My best bet would be to use a pretrained DiT (Diffusion Transformer for images or OpenSora style videos) and add action diffusion. But these models are usually several times bigger and require a lot more compute. In my experience having small diffusion heads it not optimal, as diffusion requires more parameters to do well.

It's a bad news for me that MGF loss is not that useful

So for CALVIN MGF is very strong for MDT with Resnets while for MDT-V with Voltron it not that big of a performance increase, since the basemodel is a lot better. The idea to only predict mask tokens and give some context is so the model does not overfit on the background. In my ablations predicting the full images does hurt performance in some cases. Regarding the video, that is actually something I tried but did not include in the paper anymore. Predicting several masked future images works quite well too but I did not explore it further to make a good comparison. Overall, I am optimistic for future to combine low level policy with some visual behavior generation, similar to MDT or GR-1. Policies that understand their actions in both domains seem to my best bet for better generalization.

mbreuss · 2024-07-16T16:04:49Z

Also please let me know if CLA works well for your models! :)

StarCycle · 2024-07-17T03:51:13Z

@mbreuss Thanks! I will let you know when I try it!

mbreuss closed this as completed Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nice work! And several questions of the paper #3

Nice work! And several questions of the paper #3

StarCycle commented Jul 16, 2024 •

edited

Loading

mbreuss commented Jul 16, 2024

mbreuss commented Jul 16, 2024

StarCycle commented Jul 16, 2024

mbreuss commented Jul 16, 2024

mbreuss commented Jul 16, 2024

StarCycle commented Jul 17, 2024

Nice work! And several questions of the paper #3

Nice work! And several questions of the paper #3

Comments

StarCycle commented Jul 16, 2024 • edited Loading

mbreuss commented Jul 16, 2024

mbreuss commented Jul 16, 2024

StarCycle commented Jul 16, 2024

mbreuss commented Jul 16, 2024

mbreuss commented Jul 16, 2024

StarCycle commented Jul 17, 2024

StarCycle commented Jul 16, 2024 •

edited

Loading