Skip to content

Commit 32b40c4

Browse files
authored
Merge pull request #52 from aws/adtian2-patch-1
Update README.md
2 parents a9ba8bf + 8db0d92 commit 32b40c4

File tree

1 file changed

+71
-18
lines changed

1 file changed

+71
-18
lines changed

README.md

Lines changed: 71 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -234,31 +234,84 @@ Additionally, you will need to install Kubectl and Helm on your local machine.
234234
Refer to the following documentation for installation of [Kubectl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html)
235235
and [Helm](https://helm.sh/docs/intro/install/).
236236

237-
You can now proceed with submitting a training job by utilizing the same launcher script with the
238-
following command:
237+
Using the recipes involves updating `k8s.yaml`, `config.yaml`, and running the launch script.
238+
239+
- In k8s.yaml, update persistent_volume_claims. It mounts the Amazon FSx claim to the /data directory of each computing pod
240+
```
241+
persistent_volume_claims:
242+
- claimName: fsx-claim
243+
mountPath: data
244+
```
245+
246+
- (Optional) In `config.yaml`, update `repo_url_or_path` under `git`.
247+
```
248+
git:
249+
repo_url_or_path: <training_adapter_repo>
250+
branch: null
251+
commit: null
252+
entry_script: null
253+
token: null
254+
```
255+
256+
- Update `launcher_scripts/llama/run_hf_llama3_8b_seq16k_gpu_p5x16_pretrain.sh`
257+
258+
- `your_container`: A Deep Learning container. To find the most recent release of the SMP container, see Release notes for the SageMaker model parallelism library.
259+
260+
- (Optional) You can provide the HuggingFace token if you need pre-trained weights from HuggingFace by setting the following key-value pair:
261+
```
262+
recipes.model.hf_access_token=<your_hf_token>
263+
```
239264
240265
```
241-
aws eks update-kubeconfig --region "${CLUSTER_REGION}" --name "${CLUSTER_NAME}"
242-
launcher_scripts/llama/run_hf_llama3_8b_seq8192.sh
266+
#!/bin/bash
267+
#Users should setup their cluster type in /recipes_collection/config.yaml
268+
REGION="<region>"
269+
IMAGE="658645717510.dkr.ecr.${REGION}.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121"
270+
SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}
271+
EXP_DIR="<your_exp_dir>" # Location to save experiment info including logging, checkpoints, etc
272+
TRAIN_DIR="<your_training_data_dir>" # Location of training dataset
273+
VAL_DIR="<your_val_data_dir>" # Location of validation dataset
274+
275+
HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
276+
recipes=training/llama/hf_llama3_8b_seq8k_gpu_p5x16_pretrain \
277+
base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
278+
recipes.run.name="hf-llama3" \
279+
recipes.exp_manager.exp_dir="$EXP_DIR" \
280+
cluster=k8s \
281+
cluster_type=k8s \
282+
container="${IMAGE}" \
283+
recipes.model.data.train_dir=$TRAIN_DIR \
284+
recipes.model.data.val_dir=$VAL_DIR
243285
```
244286
245-
We recommend that you utilize [HyperPod command-line tool (release_v2)](https://github.com/aws/sagemaker-hyperpod-cli/tree/release_v2)
246-
to launch a training job.
287+
- Launch the training job
288+
```
289+
bash launcher_scripts/llama/run_hf_llama3_8b_seq16k_gpu_p5x16_pretrain.sh
290+
```
247291
292+
After you've submitted the training job, you can use the following command to verify if you submitted it successfully.
248293
```
249-
hyperpod start-job --recipe training/llama/hf_llama3_8b_seq16k_gpu_p5x16_pretrain \
250-
--persistent-volume-claims fsx-claim:data \
251-
--override-parameters \
252-
'{
253-
"recipes.run.name": "hf-llama3-8b",
254-
"recipes.exp_manager.exp_dir": "/data/<your_exp_dir>",
255-
"container": "658645717510.dkr.ecr.<region>.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121",
256-
"recipes.model.data.train_dir": "<your_train_data_dir>",
257-
"recipes.model.data.val_dir": "<your_val_data_dir>",
258-
"cluster": "k8s",
259-
"cluster_type": "k8s"
260-
}'
294+
kubectl get pods
261295
```
296+
```
297+
NAME READY STATUS RESTARTS AGE
298+
hf-llama3-<your-alias>-worker-0 0/1 running 0 36s
299+
```
300+
301+
If the `STATUS` is `PENDING` or `ContainerCreating`, run the following command to get more details.
302+
```
303+
kubectl describe pod <name-of-pod>
304+
```
305+
306+
After the job `STATUS` changes to `Running`, you can examine the log by using the following command.
307+
```
308+
kubectl logs name_of_pod
309+
```
310+
311+
The `STATUS` will turn to `Completed` when you run `kubectl get pods`.
312+
313+
For more information about the k8s cluster configuration, see [Running a training job on HyperPod k8s](https://docs.aws.amazon.com/sagemaker/latest/dg/cluster-specific-configurations-run-training-job-hyperpod-k8s.html).
314+
262315
To run Amazon Nova recipe on SageMaker HyperPod clusters orchestrated by Amazon EKS, you will need to create a Restricted Instance Group in your cluster. Refer to the following documentation to [learn more](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-hp-cluster.html).
263316
264317
### Running a recipe on SageMaker training jobs

0 commit comments

Comments
 (0)