Using torchrun with AWS sagemaker estimator on multi-GPU node

Question

I would like to run a training job ml.p4d.24xlarge machine on AWS SageMaker. I ran into a similar issue described here with significant slowdowns in training time. I understand now that I should run it with torchrun. My constraints are that I don't want to use the HuggingFace or PyTorch estimators from SageMaker (for customizability and to properly understand the stack).

Currently, the entrypoint to my container is set as such in my Dockerfile:

ENTRYPOINT ["python3", "/opt/program/entrypoint.py"]

How should I change it, and can I change it to use torchrun instead? Is it just a matter of setting:

ENTRYPOINT ["torchrun --nproc_per_node 8", "/opt/program/entrypoint.py"]

Tomonori Shimomura · Accepted Answer · 2024-05-28 16:54:26Z

0

SageMaker Training Toolkit has the implementation to call torchrun command within the SageMaker's python sdk classes.

You can refer to the "TorchDistributedRunner._create_command()" to see how it constructs the torchrun command and its arguments.

Please also refer to PyTorch document how to use the torchrun command. https://pytorch.org/docs/stable/elastic/run.html

answered May 28, 2024 at 16:54

Tomonori Shimomura

2771 silver badge3 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

David Waterworth Over a year ago

Do you have any examples of a custom container with torchrun and sagemaker-pytorch-training-toolkit? I'm trying to do something similar with a BYOC but looking through the estimator code, DDP is only setup if you use the PyTorch estimator and I've been using the Estimator (specifically Estimator doesn't accept a distribution dictionary)

Collectives™ on Stack Overflow

Using torchrun with AWS sagemaker estimator on multi-GPU node

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related