Slurm (Soperator) Distributed Training Tutorials #5

HoomanRamezani · 2025-06-13T17:29:01Z

Beginner friendly slurm recipes with torchrun, tested on 2 H100s

Recipes:

Full parameter finetuning Llama3.3-70B on 16 H100 Soperator nodes via TorchTune
LORA finetuning Llama3.3-70B on 16 H100 Soperator nodes via TorchTune
Continued pretraining of Flux 1 Schnell 12B on 2 nodes via TorchTitan

Each example contains its own README.md and setup script to setup distributed training.

…ator tutorials

cyril-k

Very good examples overall, I left some comments regarding some quality of life improvements and streamlining the user experience.

Please feel free to reach out if have any questions

cyril-k · 2025-06-24T16:43:55Z

.DS_Store

MacOS artifact, to delete

cyril-k · 2025-06-24T16:46:20Z

slurm-recipes/torchtitan/setup_flux.sh

@@ -0,0 +1,33 @@
+#!/bin/bash
+


add : "${HF_TOKEN:?provide your HF_TOKEN}" to force providing HF_TOKEN before launching script

cyril-k · 2025-06-24T16:47:39Z

slurm-recipes/torchtitan/setup_flux.sh

+# pip install -e .
+
+# Flux specific setup
+python ./torchtitan/experiments/flux/scripts/download_autoencoder.py --repo_id black-forest-labs/FLUX.1-dev --ae_path ae.safetensors --hf_token <YOUR HF TOKEN>


replace <YOUR HF TOKEN> with $HF_TOKEN from the env var

cyril-k · 2025-06-24T16:49:20Z

slurm-recipes/torchtitan/setup_flux.sh

+pip install -r requirements-flux.txt
+
+# 5) Make required directories 
+mkdir -p output slurm_out


Replace

mkdir -p output slurm_out cd ../../../..

with

cd $ROOT_DIR mkdir -p outputs slurm_out

since dirs should be created in $ROOT_DIR

cyril-k · 2025-06-24T16:50:58Z

slurm-recipes/torchtitan/multinode_flux.slurm

+nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) )
+nodes_array=($nodes)
+head_node=${nodes_array[0]}
+head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)


Replace

nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) ) nodes_array=($nodes) head_node=${nodes_array[0]} head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)

with head_node_ip=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
In general avoid using srun to get node IP since it launches a slurm job, use single scontrol command

cyril-k · 2025-06-24T16:57:55Z

slurm-recipes/torchtitan/README.md

+For running this workload, you will need to SSH to the login node of the Soperator cluster and clone this repository to the shared filesystem (by default, Soperator has `/` mounted as a shared filesystem).
+
+### 🔧 Setup the environment
+


Add instruction about providing HF_TOKEN before launching setup_flux.sh

cyril-k · 2025-06-24T17:52:02Z

slurm-recipes/torchtune/setup.sh

@@ -0,0 +1,24 @@
+#!/bin/bash
+


add a check for HF_TOKEN env var

set -e : "${HF_TOKEN:?provide your HF_TOKEN}"

cyril-k · 2025-06-24T17:55:57Z

slurm-recipes/torchtune/llama3_3_70B_full_multinode.yaml

+#
+# This config is only tested on 2 nodes w/ 8 H100 machines.
+
+root_dir: /root/ml-cookbook/slurm-recipes/torchtune/ # <-- Update this to your ml-cookbook/slurm-recipes root dir


root_dir: /root/ml-cookbook/slurm-recipes/torchtune/ # <-- Update this to your ml-cookbook/slurm-recipes root dir

Use YAML template .tpl rendering in setup.sh with the current $(pwd)

cyril-k · 2025-06-24T17:56:06Z

slurm-recipes/torchtune/llama3_3_70B_lora_multinode.yaml

+# This config needs 8 GPUs to run
+#   tune run --nproc_per_node 8 lora_finetune_distributed --config llama3_3/70B_lora
+
+root_dir: /root/ml-cookbook/slurm-recipes/torchtune/ # <-- Update this to your ml-cookbook/slurm-recipes root dir


root_dir: /root/ml-cookbook/slurm-recipes/torchtune/ # <-- Update this to your ml-cookbook/slurm-recipes root dir

Use YAML template .tpl rendering in setup.sh with the current $(pwd)

cyril-k · 2025-06-24T17:57:47Z

slurm-recipes/torchtune/setup.sh

+# Download LLaMA-3 model to shared FS --> Syncs to all Soperoator nodes
+tune download meta-llama/Llama-3.3-70B-Instruct \
+  --ignore-patterns "original/consolidated*.pth" \
+  --output-dir ./models/Llama-3.3-70B-Instruct


Maybe add a section which inserts $(pwd) as root_dir var in the YAMLs, for this you need to use .tpl files for generating final YAML configs for torchtune

Hooman Ramezani added 3 commits June 12, 2025 17:30

Add torchtune full and lora finetune + torchtitan flux1 schnell soper…

824a9ff

…ator tutorials

Fixed file paths and added FLUX1 SCHNELL local dataset support

bcc678b

Add WandB sample screenshot

408968a

HoomanRamezani assigned HoomanRamezani, cyril-k and malibora Jun 13, 2025

Update Readme grammar and polish

67eff4c

cyril-k requested changes Jun 24, 2025

View reviewed changes

This comment was marked as duplicate.

Sign in to view

Hooman Ramezani added 2 commits June 25, 2025 14:39

Make quality of life improvements

b2da253

fix path issue

ff31279

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Slurm (Soperator) Distributed Training Tutorials #5

Slurm (Soperator) Distributed Training Tutorials #5

Uh oh!

HoomanRamezani commented Jun 13, 2025

Uh oh!

cyril-k left a comment

Uh oh!

cyril-k Jun 24, 2025

Uh oh!

cyril-k Jun 24, 2025

Uh oh!

cyril-k Jun 24, 2025

Uh oh!

cyril-k Jun 24, 2025

Uh oh!

cyril-k Jun 24, 2025

Uh oh!

cyril-k Jun 24, 2025

Uh oh!

cyril-k Jun 24, 2025

Uh oh!

cyril-k Jun 24, 2025

Uh oh!

cyril-k Jun 24, 2025

Uh oh!

cyril-k Jun 24, 2025

Uh oh!

This comment was marked as duplicate.

Uh oh!

This comment was marked as duplicate.

Uh oh!

Uh oh!

		For running this workload, you will need to SSH to the login node of the Soperator cluster and clone this repository to the shared filesystem (by default, Soperator has `/` mounted as a shared filesystem).

		### 🔧 Setup the environment

Slurm (Soperator) Distributed Training Tutorials #5

Are you sure you want to change the base?

Slurm (Soperator) Distributed Training Tutorials #5

Uh oh!

Conversation

HoomanRamezani commented Jun 13, 2025

Uh oh!

cyril-k left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as duplicate.

Uh oh!

This comment was marked as duplicate.

Uh oh!

Uh oh!