Skip to content

Conversation

HoomanRamezani
Copy link
Collaborator

Beginner friendly slurm recipes with torchrun, tested on 2 H100s

Recipes:

  • Full parameter finetuning Llama3.3-70B on 16 H100 Soperator nodes via TorchTune
  • LORA finetuning Llama3.3-70B on 16 H100 Soperator nodes via TorchTune
  • Continued pretraining of Flux 1 Schnell 12B on 2 nodes via TorchTitan

Each example contains its own README.md and setup script to setup distributed training.

Copy link
Collaborator

@cyril-k cyril-k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good examples overall, I left some comments regarding some quality of life improvements and streamlining the user experience.

Please feel free to reach out if have any questions

.DS_Store Outdated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MacOS artifact, to delete

@@ -0,0 +1,33 @@
#!/bin/bash

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add : "${HF_TOKEN:?provide your HF_TOKEN}" to force providing HF_TOKEN before launching script

# pip install -e .

# Flux specific setup
python ./torchtitan/experiments/flux/scripts/download_autoencoder.py --repo_id black-forest-labs/FLUX.1-dev --ae_path ae.safetensors --hf_token <YOUR HF TOKEN>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace <YOUR HF TOKEN> with $HF_TOKEN from the env var

pip install -r requirements-flux.txt

# 5) Make required directories
mkdir -p output slurm_out
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace

mkdir -p output slurm_out
cd ../../../..

with

cd $ROOT_DIR
mkdir -p outputs slurm_out

since dirs should be created in $ROOT_DIR

nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) )
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace

nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) )
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)

with head_node_ip=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
In general avoid using srun to get node IP since it launches a slurm job, use single scontrol command

For running this workload, you will need to SSH to the login node of the Soperator cluster and clone this repository to the shared filesystem (by default, Soperator has `/` mounted as a shared filesystem).

### 🔧 Setup the environment

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add instruction about providing HF_TOKEN before launching setup_flux.sh

@@ -0,0 +1,24 @@
#!/bin/bash

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a check for HF_TOKEN env var

set -e

: "${HF_TOKEN:?provide your HF_TOKEN}"

#
# This config is only tested on 2 nodes w/ 8 H100 machines.

root_dir: /root/ml-cookbook/slurm-recipes/torchtune/ # <-- Update this to your ml-cookbook/slurm-recipes root dir
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

root_dir: /root/ml-cookbook/slurm-recipes/torchtune/ # <-- Update this to your ml-cookbook/slurm-recipes root dir

Use YAML template .tpl rendering in setup.sh with the current $(pwd)

# This config needs 8 GPUs to run
# tune run --nproc_per_node 8 lora_finetune_distributed --config llama3_3/70B_lora

root_dir: /root/ml-cookbook/slurm-recipes/torchtune/ # <-- Update this to your ml-cookbook/slurm-recipes root dir
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

root_dir: /root/ml-cookbook/slurm-recipes/torchtune/ # <-- Update this to your ml-cookbook/slurm-recipes root dir

Use YAML template .tpl rendering in setup.sh with the current $(pwd)

# Download LLaMA-3 model to shared FS --> Syncs to all Soperoator nodes
tune download meta-llama/Llama-3.3-70B-Instruct \
--ignore-patterns "original/consolidated*.pth" \
--output-dir ./models/Llama-3.3-70B-Instruct
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a section which inserts $(pwd) as root_dir var in the YAMLs, for this you need to use .tpl files for generating final YAML configs for torchtune

cyril-k

This comment was marked as duplicate.

cyril-k

This comment was marked as duplicate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants