Why is Pipeline parallelism not compatible with ZeRO-2 and ZeRO-3? #1629

Dounm · 2021-12-10T08:24:36Z

Could you explain why Pipeline parallelism is not compatible with ZeRO-2 and ZeRO-3? Are there any design tradeoffs?

Because as far as I know, it is pretty common to train large models with both DataParallel and PipelineParallel together. and with the constraint above, the offload mechanism cannot be enabled due to its dependency on ZeRO-2/3.

Also, Megatron-DeepSpeed:pretrain_gpt.py use GPTModelPipe, which is a subclass of PipelineModule as the model module passed to deepspeed.initialize(), so it's impossible to enable ZeRO-2/3 in the config json, is there any examples to run with ZeRO-2/3?

The text was updated successfully, but these errors were encountered:

2catycm · 2023-05-06T17:11:31Z

Some relevant information are in #1110

fxmarty · 2024-05-06T12:19:03Z

ZeRO Stage 3: The 16-bit model parameters are partitioned across the processes. ZeRO-3 will automatically collect and partition them during the forward and backward passes.

How is that not pipeline parallelism?

fxmarty · 2024-05-06T12:19:28Z

Dounm added the enhancement New feature or request label Dec 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is Pipeline parallelism not compatible with ZeRO-2 and ZeRO-3? #1629

Why is Pipeline parallelism not compatible with ZeRO-2 and ZeRO-3? #1629

Dounm commented Dec 10, 2021 •

edited

Loading

2catycm commented May 6, 2023

fxmarty commented May 6, 2024

fxmarty commented May 6, 2024

Why is Pipeline parallelism not compatible with ZeRO-2 and ZeRO-3? #1629

Why is Pipeline parallelism not compatible with ZeRO-2 and ZeRO-3? #1629

Comments

Dounm commented Dec 10, 2021 • edited Loading

2catycm commented May 6, 2023

fxmarty commented May 6, 2024

fxmarty commented May 6, 2024

Dounm commented Dec 10, 2021 •

edited

Loading