Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multinode training #1103

Merged
merged 8 commits into from
Apr 9, 2024
Merged

Support multinode training #1103

merged 8 commits into from
Apr 9, 2024

Conversation

r4victor
Copy link
Collaborator

@r4victor r4victor commented Apr 9, 2024

Closes #1084.

The PR adds multi-node tasks support for AWS, Azure, and GCP backends.

A task with two jobs that measures the connectivity speed between the nodes:

type: task
nodes: 2
commands:
  - apt-get update && apt-get install iperf
  - if [ $DSTACK_NODE_RANK == 0 ]; then iperf -s; fi
  - iperf -t 30 -c $DSTACK_MASTER_NODE_IP -P 16

An example of a multi-node training task:

type: task
nodes: 2
resources:
  gpu: 1
commands:
  - git clone https://github.com/r4victor/pytorch-distributed-resnet.git
  - cd pytorch-distributed-resnet
  - mkdir -p data && mkdir -p saved_models && cd data && wget -c --quiet https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz && tar -xvzf cifar-10-python.tar.gz
  - cd ..
  - pip3 install -r requirements.txt torch
  - time torchrun --nproc_per_node=${DSTACK_GPUS_PER_NODE} --node_rank=${DSTACK_NODE_RANK} --nnodes=${DSTACK_NODES_NUM} --master_addr=${DSTACK_MASTER_NODE_IP} --master_port=8008 resnet_ddp.py --num_epochs 20
  - echo Done

TODO:

  • Support multi-node for remote SSH instances.
  • Improve provisioning/retry logic. Currently, the master job runs on a top offer, and other jobs will try to run in the same backend/region as the master job. There is no smart retry that would backtrack for other backends/regions.
  • Ensure better connectivity in backends. E.g. placement groups in AWS and Azure.

@r4victor r4victor merged commit daa35b7 into master Apr 9, 2024
15 checks passed
@peterschmidt85 peterschmidt85 mentioned this pull request Apr 11, 2024
41 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support multi-node training
1 participant