Merge pull request #52 from aws/adtian2-patch-1

adtian2 · web-flow · commit 32b40c45e816 · 2025-08-18T10:14:05.000-07:00
Update README.md
diff --git a/README.md b/README.md
@@ -234,31 +234,84 @@ Additionally, you will need to install Kubectl and Helm on your local machine.
 Refer to the following documentation for installation of [Kubectl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html)
 and [Helm](https://helm.sh/docs/intro/install/).
 
-You can now proceed with submitting a training job by utilizing the same launcher script with the
-following command:
+Using the recipes involves updating `k8s.yaml`, `config.yaml`, and running the launch script.
+
+- In k8s.yaml, update persistent_volume_claims. It mounts the Amazon FSx claim to the /data directory of each computing pod
+    ```
+    persistent_volume_claims:
+      - claimName: fsx-claim
+        mountPath: data
+    ```
+
+- (Optional) In `config.yaml`, update `repo_url_or_path` under `git`.
+    ```
+    git:
+      repo_url_or_path: <training_adapter_repo>
+      branch: null
+      commit: null
+      entry_script: null
+      token: null
+    ```
+
+- Update `launcher_scripts/llama/run_hf_llama3_8b_seq16k_gpu_p5x16_pretrain.sh`
+
+    - `your_container`: A Deep Learning container. To find the most recent release of the SMP container, see Release notes for the SageMaker model parallelism library.
+
+    - (Optional) You can provide the HuggingFace token if you need pre-trained weights from HuggingFace by setting the following key-value pair:
+    ```
+    recipes.model.hf_access_token=<your_hf_token>
+    ```
 
 ```
-aws eks update-kubeconfig --region "${CLUSTER_REGION}" --name "${CLUSTER_NAME}"
-launcher_scripts/llama/run_hf_llama3_8b_seq8192.sh
+#!/bin/bash
+#Users should setup their cluster type in /recipes_collection/config.yaml
+REGION="<region>"
+IMAGE="658645717510.dkr.ecr.${REGION}.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121"
+SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}
+EXP_DIR="<your_exp_dir>" # Location to save experiment info including logging, checkpoints, etc
+TRAIN_DIR="<your_training_data_dir>" # Location of training dataset
+VAL_DIR="<your_val_data_dir>" # Location of validation dataset
+
+HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
+    recipes=training/llama/hf_llama3_8b_seq8k_gpu_p5x16_pretrain \
+    base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
+    recipes.run.name="hf-llama3" \
+    recipes.exp_manager.exp_dir="$EXP_DIR" \
+    cluster=k8s \
+    cluster_type=k8s \
+    container="${IMAGE}" \
+    recipes.model.data.train_dir=$TRAIN_DIR \
+    recipes.model.data.val_dir=$VAL_DIR
 ```
 
-We recommend that you utilize [HyperPod command-line tool (release_v2)](https://github.com/aws/sagemaker-hyperpod-cli/tree/release_v2)
-to launch a training job.
+- Launch the training job
+    ```
+    bash launcher_scripts/llama/run_hf_llama3_8b_seq16k_gpu_p5x16_pretrain.sh
+    ```
 
+After you've submitted the training job, you can use the following command to verify if you submitted it successfully.
 ```
-hyperpod start-job --recipe training/llama/hf_llama3_8b_seq16k_gpu_p5x16_pretrain \
---persistent-volume-claims fsx-claim:data \
---override-parameters \
-'{
- "recipes.run.name": "hf-llama3-8b",
- "recipes.exp_manager.exp_dir": "/data/<your_exp_dir>",
- "container": "658645717510.dkr.ecr.<region>.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121",
- "recipes.model.data.train_dir": "<your_train_data_dir>",
- "recipes.model.data.val_dir": "<your_val_data_dir>",
- "cluster": "k8s",
- "cluster_type": "k8s"
-}'
+kubectl get pods
 ```
+```
+NAME READY   STATUS             RESTARTS        AGE
+hf-llama3-<your-alias>-worker-0   0/1     running         0               36s
+```
+
+If the `STATUS` is `PENDING` or `ContainerCreating`, run the following command to get more details.
+```
+kubectl describe pod <name-of-pod>
+```
+
+After the job `STATUS` changes to `Running`, you can examine the log by using the following command.
+```
+kubectl logs name_of_pod
+```
+
+The `STATUS` will turn to `Completed` when you run `kubectl get pods`.
+
+For more information about the k8s cluster configuration, see [Running a training job on HyperPod k8s](https://docs.aws.amazon.com/sagemaker/latest/dg/cluster-specific-configurations-run-training-job-hyperpod-k8s.html).
+
 To run Amazon Nova recipe on SageMaker HyperPod clusters orchestrated by Amazon EKS, you will need to create a Restricted Instance Group in your cluster. Refer to the following documentation to [learn more](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-hp-cluster.html).
 
 ### Running a recipe on SageMaker training jobs