1

I would like to run a training job ml.p4d.24xlarge machine on AWS SageMaker. I ran into a similar issue described here with significant slowdowns in training time. I understand now that I should run it with torchrun. My constraints are that I don't want to use the HuggingFace or PyTorch estimators from SageMaker (for customizability and to properly understand the stack).

Currently, the entrypoint to my container is set as such in my Dockerfile:

ENTRYPOINT ["python3", "/opt/program/entrypoint.py"]

How should I change it, and can I change it to use torchrun instead? Is it just a matter of setting:

ENTRYPOINT ["torchrun --nproc_per_node 8", "/opt/program/entrypoint.py"]

1 Answer 1

0

SageMaker Training Toolkit has the implementation to call torchrun command within the SageMaker's python sdk classes.

You can refer to the "TorchDistributedRunner._create_command()" to see how it constructs the torchrun command and its arguments.

Please also refer to PyTorch document how to use the torchrun command. https://pytorch.org/docs/stable/elastic/run.html

Sign up to request clarification or add additional context in comments.

1 Comment

Do you have any examples of a custom container with torchrun and sagemaker-pytorch-training-toolkit? I'm trying to do something similar with a BYOC but looking through the estimator code, DDP is only setup if you use the PyTorch estimator and I've been using the Estimator (specifically Estimator doesn't accept a distribution dictionary)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.