-
Notifications
You must be signed in to change notification settings - Fork 2
Slurm (Soperator) Distributed Training Tutorials #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good examples overall, I left some comments regarding some quality of life improvements and streamlining the user experience.
Please feel free to reach out if have any questions
.DS_Store
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MacOS artifact, to delete
@@ -0,0 +1,33 @@ | |||
#!/bin/bash | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add : "${HF_TOKEN:?provide your HF_TOKEN}"
to force providing HF_TOKEN before launching script
# pip install -e . | ||
|
||
# Flux specific setup | ||
python ./torchtitan/experiments/flux/scripts/download_autoencoder.py --repo_id black-forest-labs/FLUX.1-dev --ae_path ae.safetensors --hf_token <YOUR HF TOKEN> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace <YOUR HF TOKEN>
with $HF_TOKEN
from the env var
pip install -r requirements-flux.txt | ||
|
||
# 5) Make required directories | ||
mkdir -p output slurm_out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace
mkdir -p output slurm_out
cd ../../../..
with
cd $ROOT_DIR
mkdir -p outputs slurm_out
since dirs should be created in $ROOT_DIR
nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) ) | ||
nodes_array=($nodes) | ||
head_node=${nodes_array[0]} | ||
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace
nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) )
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
with head_node_ip=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
In general avoid using srun
to get node IP since it launches a slurm job, use single scontrol
command
For running this workload, you will need to SSH to the login node of the Soperator cluster and clone this repository to the shared filesystem (by default, Soperator has `/` mounted as a shared filesystem). | ||
|
||
### 🔧 Setup the environment | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add instruction about providing HF_TOKEN
before launching setup_flux.sh
@@ -0,0 +1,24 @@ | |||
#!/bin/bash | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a check for HF_TOKEN env var
set -e
: "${HF_TOKEN:?provide your HF_TOKEN}"
# | ||
# This config is only tested on 2 nodes w/ 8 H100 machines. | ||
|
||
root_dir: /root/ml-cookbook/slurm-recipes/torchtune/ # <-- Update this to your ml-cookbook/slurm-recipes root dir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
root_dir: /root/ml-cookbook/slurm-recipes/torchtune/ # <-- Update this to your ml-cookbook/slurm-recipes root dir
Use YAML template .tpl rendering in setup.sh
with the current $(pwd)
# This config needs 8 GPUs to run | ||
# tune run --nproc_per_node 8 lora_finetune_distributed --config llama3_3/70B_lora | ||
|
||
root_dir: /root/ml-cookbook/slurm-recipes/torchtune/ # <-- Update this to your ml-cookbook/slurm-recipes root dir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
root_dir: /root/ml-cookbook/slurm-recipes/torchtune/ # <-- Update this to your ml-cookbook/slurm-recipes root dir
Use YAML template .tpl rendering in setup.sh
with the current $(pwd)
# Download LLaMA-3 model to shared FS --> Syncs to all Soperoator nodes | ||
tune download meta-llama/Llama-3.3-70B-Instruct \ | ||
--ignore-patterns "original/consolidated*.pth" \ | ||
--output-dir ./models/Llama-3.3-70B-Instruct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a section which inserts $(pwd) as root_dir
var in the YAMLs, for this you need to use .tpl files for generating final YAML configs for torchtune
Beginner friendly slurm recipes with torchrun, tested on 2 H100s
Recipes:
Each example contains its own README.md and setup script to setup distributed training.