You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Could you explain why Pipeline parallelism is not compatible with ZeRO-2 and ZeRO-3? Are there any design tradeoffs?
Because as far as I know, it is pretty common to train large models with both DataParallel and PipelineParallel together. and with the constraint above, the offload mechanism cannot be enabled due to its dependency on ZeRO-2/3.
Also, Megatron-DeepSpeed:pretrain_gpt.py use GPTModelPipe, which is a subclass of PipelineModule as the model module passed to deepspeed.initialize(), so it's impossible to enable ZeRO-2/3 in the config json, is there any examples to run with ZeRO-2/3?
The text was updated successfully, but these errors were encountered:
ZeRO Stage 3: The 16-bit model parameters are partitioned across the processes. ZeRO-3 will automatically collect and partition them during the forward and backward passes.
Could you explain why Pipeline parallelism is not compatible with ZeRO-2 and ZeRO-3? Are there any design tradeoffs?
Because as far as I know, it is pretty common to train large models with both DataParallel and PipelineParallel together. and with the constraint above, the offload mechanism cannot be enabled due to its dependency on ZeRO-2/3.
Also, Megatron-DeepSpeed:pretrain_gpt.py use
GPTModelPipe
, which is a subclass ofPipelineModule
as the model module passed todeepspeed.initialize()
, so it's impossible to enable ZeRO-2/3 in the config json, is there any examples to run with ZeRO-2/3?The text was updated successfully, but these errors were encountered: