Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
61 commits
Select commit Hold shift + click to select a range
223af40
Update telemetry status to be Integer for parity (#130)
Aditi2424 Jul 18, 2025
cf77296
Release new version for Health Monitoring Agent (1.0.643.0_1.0.192.0)…
maheshxb Jul 18, 2025
0342f60
Release new version for Health Monitoring Agent (1.0.674.0_1.0.199.0)…
jiayelamazon Jul 18, 2025
631ddf9
update inference CLI describe command print for better visualization …
mollyheamazon Jul 21, 2025
dc440c3
Update inference integ test to add dependency to improve telemetry ex…
mollyheamazon Jul 22, 2025
cc08405
Manual release v3.0.1 (#143)
mollyheamazon Jul 22, 2025
079fafd
change security-monitoring metrics data destination to us-east-2 for …
mollyheamazon Jul 22, 2025
29a16c5
feat: Add region detection to install Health Monitoring Agent and use…
haardm Jul 22, 2025
66232ed
Add unique time string to integ test (#150)
zhaoqizqwang Jul 23, 2025
9fbec4a
update example notebook for inference CLI (#151)
mollyheamazon Jul 23, 2025
8034a24
Training: Main documentation update (#153)
rsareddy0329 Jul 23, 2025
0bcee6d
Update inferenece SDK examples (#155)
zhaoqizqwang Jul 23, 2025
d2130e9
update help text to avoid truncation (#158)
mollyheamazon Jul 24, 2025
e3fafe0
Enable telemetry for cli (#165)
rsareddy0329 Jul 29, 2025
293f9b9
Add an option to disable the deployment of KubeFlow TrainingOperator …
DaniilGlazkoTR Jul 29, 2025
9f534b4
Remove unused param from documentation (#170)
nargokul Jul 30, 2025
ec8800d
Update volume flag to support hostPath and pvc (#171)
mollyheamazon Jul 31, 2025
95e073e
Restructure list-cluster output (#173)
pintaoz-aws Jul 31, 2025
a8a2baf
Update inference config and integ tests (#167)
zhaoqizqwang Jul 31, 2025
2908a62
Update readme for volume flag (#176)
mollyheamazon Jul 31, 2025
9b7220c
Manual release v3.0.2 (#177)
pintaoz-aws Jul 31, 2025
36fac66
Add schema pattern check to pytorch-job template (#178)
mollyheamazon Aug 1, 2025
0de2138
Add version comptability check between server K8s and Client python K…
papriwal Aug 1, 2025
dcbc8fb
Fix training test (#184)
zhaoqizqwang Aug 5, 2025
28424e4
Update logging information for submitting and deleting training job (…
pintaoz-aws Aug 5, 2025
17cfdbd
Merge Documentation changes to main for Launch (#196)
rsareddy0329 Aug 6, 2025
6553766
Added new column 'deploymeny configs' to the itable that allows user'…
mohamedzeidan2021 Aug 6, 2025
63ff3b4
Add instance type support for ml.p6e-gb200.36xlarge (#204)
zhaoqizqwang Aug 8, 2025
e3f697a
changed endpoint name from value user has to manually insert to place…
mohamedzeidan2021 Aug 12, 2025
d16d1b3
Enable PR checks on feature branches (#207)
rsareddy0329 Aug 12, 2025
0fd2bef
Release tg (#209)
jam-jee Aug 14, 2025
9560a48
Update generate_click_command inject logic to not expose unwanted fla…
mollyheamazon Aug 15, 2025
96c5b2b
update CHANGELOG.md (#175)
jam-jee Aug 15, 2025
7fda684
Minor update on README, example notebooks and documentation (#216)
mollyheamazon Aug 18, 2025
f747815
Add metadata_name argument to js and custom endpoint to match with SD…
mollyheamazon Aug 19, 2025
a4f0465
Add cert mgr installation which is required by HPTO (#180)
emeraldbay Aug 19, 2025
9c07154
Implementing hyp version command (#223)
jam-jee Aug 19, 2025
21d7ca2
FIX README DOCUMENTATION ISSUES (#221)
papriwal Aug 19, 2025
73a41b3
Update description for scheduler type (#222)
zhaoqizqwang Aug 19, 2025
743bd4d
fix: Set cert mgr installation disable by default (#224)
emeraldbay Aug 20, 2025
99121e7
Release new version for Health Monitoring Agent (1.0.742.0_1.0.241.0)…
992X Aug 20, 2025
853dfa8
feat: add get_operator_logs to pytorch job (#218)
rsareddy0329 Aug 20, 2025
d2bd3c2
Change default container name in pytorch template (#220)
mollyheamazon Aug 20, 2025
cc9eec6
Enhanced Error Handling for all hyp commands
mohamedzeidan2021 Aug 21, 2025
f571859
update v1.1 pytorch job template to match parity with v1.0 change in …
mollyheamazon Aug 22, 2025
935a4d9
Update list_pods to only display pods of corresponding endpoint type …
pintaoz-aws Aug 22, 2025
84aabcf
Implementing Task Gov. feature for SDK flow (#230)
jam-jee Aug 25, 2025
da607d2
Update warning message string for k8s version compatibility check (#229)
papriwal Aug 25, 2025
6f452bf
Implemented parallel processing for list-cluster operation to improve…
jam-jee Aug 25, 2025
91504e9
Add enpoint_name argument for list_pods() (#232)
pintaoz-aws Aug 25, 2025
e3cfe1d
Adding thread sleep before deleting resources in integ test (#236)
jam-jee Aug 26, 2025
5cff2a7
Release Cluster Management (#233)
nargokul Aug 26, 2025
3ad70ec
Create README.md (#237)
nargokul Aug 26, 2025
12730ca
Fix list_pods and AZ_ID error message (#238)
zhaoqizqwang Aug 26, 2025
16b48dd
Update setup.py to enable cluster creation template (#243)
nargokul Aug 27, 2025
e1ac050
Update docs for Cluster Management (#240)
papriwal Aug 27, 2025
0bf0782
Update CHANGELOG.md for 3.2.1 (#245)
rsareddy0329 Aug 27, 2025
1590894
Bug fix for cluster creation integ test, fixed cfn cleanup, wait for …
aviruthen Aug 27, 2025
0d7c810
update jumpstart and pytorch template for release (#248)
mollyheamazon Aug 27, 2025
4e73b0e
Update CHANGELOG.md for training and inference templates (#247)
rsareddy0329 Aug 27, 2025
5a346e8
Update pyproject.toml for inference templates (#249)
rsareddy0329 Aug 28, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions .github/workflows/codebuild-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,7 @@ name: PR Checks
on:
pull_request_target:
branches:
- "master*"
- "main*"
- "*"

concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.head_ref }}
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/security-monitoring.yml
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ jobs:
uses: aws-actions/configure-aws-credentials@12e3392609eaaceb7ae6191b3f54bbcb85b5002b
with:
role-to-assume: ${{ secrets.MONITORING_ROLE_ARN }}
aws-region: us-west-2
aws-region: us-east-2
- name: Put Dependabot Alert Metric Data
run: |
if [ "${{ needs.check-dependabot-alerts.outputs.dependabot_alert_status }}" == "1" ]; then
Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,13 @@ __pycache__/
/.mypy_cache

/doc/_apidoc/
doc/_build/
/build

/sagemaker-hyperpod/build
/sagemaker-hyperpod/.coverage
/sagemaker-hyperpod/.coverage.*
/hyperpod-cluster-stack-template/build

# Ignore all contents of result and results directories
/result/
Expand Down
20 changes: 20 additions & 0 deletions .readthedocs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
version: 2

build:
os: ubuntu-22.04
tools:
python: "3.9"

python:
install:
- method: pip
path: .
- requirements: doc/requirements.txt

sphinx:
configuration: doc/conf.py
fail_on_warning: false

formats:
- pdf
- epub
47 changes: 41 additions & 6 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,58 @@
# Changelog

## v2.0.0 (2024-12-04)
## v3.2.1 (2025-08-27)

### Features

- feature: The HyperPod CLI now support ([Hyperpod recipes](https://github.com/aws/sagemaker-hyperpod-recipes.git)). The HyperPod recipes enable customers to get started training and fine-tuning popular publicly-available foundation models like Llama 3.1 405B in minutes. Learn more ([here](https://github.com/aws/sagemaker-hyperpod-recipes.git)).
* Cluster management
* Bug Fixes with cluster creation
* Enable cluster template to be installed with hyperpod CLI .

## v1.0.0 (2024-09-09)
## v3.2.0 (2025-08-25)

### Features

- feature: Add support for SageMaker HyperPod CLI
* Cluster management
* Creation of cluster stack
* Describing and listing a cluster stack
* Updating a cluster
* Init Experience
* Init, Validate, Create with local configurations


## v3.1.0 (2025-08-13)

### Features
* Task Governance feature for training jobs.


## v1.0.0] ([2025]-[07]-[10])
## v3.0.2 (2025-07-31)

### Features

* Update volume flag to support hostPath and PVC
* Add an option to disable the deployment of KubeFlow TrainingOperator
* Enable telemetry for CLI

## v3.0.0 (2025-07-10)

### Features

* Training Job - Create, List , Get
* Inference Jumpstart - Create , List, Get, Invoke
* Inference Custom - Create , List, Get, Invoke
* Observability changes
* Observability changes

## v2.0.0 (2024-12-04)

### Features

- feature: The HyperPod CLI now support ([Hyperpod recipes](https://github.com/aws/sagemaker-hyperpod-recipes.git)). The HyperPod recipes enable customers to get started training and fine-tuning popular publicly-available foundation models like Llama 3.1 405B in minutes. Learn more ([here](https://github.com/aws/sagemaker-hyperpod-recipes.git)).

## v1.0.0 (2024-09-09)

### Features

- feature: Add support for SageMaker HyperPod CLI


150 changes: 67 additions & 83 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,24 +54,13 @@ SageMaker HyperPod CLI currently supports start training job with:

1. Make sure that your local python version is 3.8, 3.9, 3.10 or 3.11.

1. Install ```helm```.

The SageMaker Hyperpod CLI uses Helm to start training jobs. See also the [Helm installation guide](https://helm.sh/docs/intro/install/).

```
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
rm -f ./get_helm.sh
```

1. Clone and install the sagemaker-hyperpod-cli package.
2. Install the sagemaker-hyperpod-cli package.

```
pip install sagemaker-hyperpod
```

1. Verify if the installation succeeded by running the following command.
3. Verify if the installation succeeded by running the following command.

```
hyp --help
Expand Down Expand Up @@ -158,8 +147,8 @@ hyp create hyp-pytorch-job \
--version 1.0 \
--job-name test-pytorch-job \
--image pytorch/pytorch:latest \
--command '["python", "train.py"]' \
--args '["--epochs", "10", "--batch-size", "32"]' \
--command '[python, train.py]' \
--args '[--epochs=10, --batch-size=32]' \
--environment '{"PYTORCH_CUDA_ALLOC_CONF": "max_split_size_mb:32"}' \
--pull-policy "IfNotPresent" \
--instance-type ml.p4d.24xlarge \
Expand All @@ -170,9 +159,15 @@ hyp create hyp-pytorch-job \
--queue-name "training-queue" \
--priority "high" \
--max-retry 3 \
--volumes '["data-vol", "model-vol", "checkpoint-vol"]' \
--persistent-volume-claims '["shared-data-pvc", "model-registry-pvc"]' \
--output-s3-uri s3://my-bucket/model-artifacts
--accelerators 8 \
--vcpu 96.0 \
--memory 1152.0 \
--accelerators-limit 8 \
--vcpu-limit 96.0 \
--memory-limit 1152.0 \
--preferred-topology "topology.kubernetes.io/zone=us-west-2a" \
--volume name=model-data,type=hostPath,mount_path=/data,path=/data \
--volume name=training-output,type=pvc,mount_path=/data2,claim_name=my-pvc,read_only=false
```

Key required parameters explained:
Expand All @@ -181,8 +176,6 @@ Key required parameters explained:

--image: Docker image containing your training environment

This command starts a training job named test-pytorch-job. The --output-s3-uri specifies where the trained model artifacts will be stored, for example, s3://my-bucket/model-artifacts. Note this location, as you’ll need it for deploying the custom model.

### Inference

#### Creating a JumpstartModel Endpoint
Expand All @@ -195,7 +188,6 @@ hyp create hyp-jumpstart-endpoint \
--model-id jumpstart-model-id\
--instance-type ml.g5.8xlarge \
--endpoint-name endpoint-jumpstart \
--tls-output-s3-uri s3://sample-bucket
```


Expand All @@ -211,7 +203,7 @@ hyp invoke hyp-jumpstart-endpoint \

```
hyp list hyp-jumpstart-endpoint
hyp get hyp-jumpstart-endpoint --name endpoint-jumpstart
hyp describe hyp-jumpstart-endpoint --name endpoint-jumpstart
```

#### Creating a Custom Inference Endpoint
Expand All @@ -222,7 +214,8 @@ hyp create hyp-custom-endpoint \
--endpoint-name my-custom-endpoint \
--model-name my-pytorch-model \
--model-source-type s3 \
--model-location my-pytorch-training/model.tar.gz \
--model-location my-pytorch-training \
--model-volume-mount-name test-volume \
--s3-bucket-name your-bucket \
--s3-region us-east-1 \
--instance-type ml.g5.8xlarge \
Expand Down Expand Up @@ -257,9 +250,10 @@ Along with the CLI, we also have SDKs available that can perform the training an

```

from sagemaker.hyperpod import HyperPodPytorchJob
from sagemaker.hyperpod.job
import ReplicaSpec, Template, Spec, Container, Resources, RunPolicy, Metadata
from sagemaker.hyperpod.training import HyperPodPytorchJob
from sagemaker.hyperpod.training
import ReplicaSpec, Template, Spec, Containers, Resources, RunPolicy
from sagemaker.hyperpod.common.config import Metadata

# Define job specifications
nproc_per_node = "1" # Number of processes per node
Expand All @@ -274,7 +268,7 @@ replica_specs =
(
containers =
[
Container
Containers
(
# Container name
name="container-name",
Expand Down Expand Up @@ -315,8 +309,6 @@ pytorch_job = HyperPodPytorchJob
replica_specs = replica_specs,
# Run policy
run_policy = run_policy,
# S3 location for artifacts
output_s3_uri="s3://my-bucket/model-artifacts"
)
# Launch the job
pytorch_job.create()
Expand All @@ -336,24 +328,18 @@ Pre-trained Jumpstart models can be gotten from https://sagemaker.readthedocs.io
from sagemaker.hyperpod.inference.config.hp_jumpstart_endpoint_config import Model, Server, SageMakerEndpoint, TlsConfig
from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint

model = Model(
model_id="deepseek-llm-r1-distill-qwen-1-5b",
model_version="2.0.4"
model=Model(
model_id='deepseek-llm-r1-distill-qwen-1-5b'
)

server = Server(
instance_type="ml.g5.8xlarge"
server=Server(
instance_type='ml.g5.8xlarge',
)
endpoint_name=SageMakerEndpoint(name='<my-endpoint-name>')

endpoint_name = SageMakerEndpoint(name="endpoint-jumpstart")

tls_config = TlsConfig(tls_certificate_output_s3_uri="s3://sample-bucket")

js_endpoint = HPJumpStartEndpoint(
js_endpoint=HPJumpStartEndpoint(
model=model,
server=server,
sage_maker_endpoint=endpoint_name,
tls_config=tls_config
sage_maker_endpoint=endpoint_name
)

js_endpoint.create()
Expand All @@ -369,51 +355,51 @@ print(response)
```


#### Creating a Custom Inference Endpoint
#### Creating a Custom Inference Endpoint (with S3)

```
from sagemaker.hyperpod.inference.config.hp_custom_endpoint_config import Model, Server, SageMakerEndpoint, TlsConfig, EnvironmentVariables
from sagemaker.hyperpod.inference.hp_custom_endpoint import HPCustomEndpoint
from sagemaker.hyperpod.inference.config.hp_endpoint_config import CloudWatchTrigger, Dimensions, AutoScalingSpec, Metrics, S3Storage, ModelSourceConfig, TlsConfig, EnvironmentVariables, ModelInvocationPort, ModelVolumeMount, Resources, Worker
from sagemaker.hyperpod.inference.hp_endpoint import HPEndpoint

model = Model(
model_source_type="s3",
model_location="test-pytorch-job/model.tar.gz",
s3_bucket_name="my-bucket",
s3_region="us-east-2",
prefetch_enabled=True
model_source_config = ModelSourceConfig(
model_source_type='s3',
model_location="<my-model-folder-in-s3>",
s3_storage=S3Storage(
bucket_name='<my-model-artifacts-bucket>',
region='us-east-2',
),
)

server = Server(
instance_type="ml.g5.8xlarge",
image_uri="763104351884.dkr.ecr.us-east-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.4.0-tgi2.3.1-gpu-py311-cu124-ubuntu22.04-v2.0",
container_port=8080,
model_volume_mount_name="model-weights"
)
environment_variables = [
EnvironmentVariables(name="HF_MODEL_ID", value="/opt/ml/model"),
EnvironmentVariables(name="SAGEMAKER_PROGRAM", value="inference.py"),
EnvironmentVariables(name="SAGEMAKER_SUBMIT_DIRECTORY", value="/opt/ml/model/code"),
EnvironmentVariables(name="MODEL_CACHE_ROOT", value="/opt/ml/model"),
EnvironmentVariables(name="SAGEMAKER_ENV", value="1"),
]

resources = {
"requests": {"cpu": "30000m", "nvidia.com/gpu": 1, "memory": "100Gi"},
"limits": {"nvidia.com/gpu": 1}
}

env = EnvironmentVariables(
HF_MODEL_ID="/opt/ml/model",
SAGEMAKER_PROGRAM="inference.py",
SAGEMAKER_SUBMIT_DIRECTORY="/opt/ml/model/code",
MODEL_CACHE_ROOT="/opt/ml/model",
SAGEMAKER_ENV="1"
worker = Worker(
image='763104351884.dkr.ecr.us-east-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.4.0-tgi2.3.1-gpu-py311-cu124-ubuntu22.04-v2.0',
model_volume_mount=ModelVolumeMount(
name='model-weights',
),
model_invocation_port=ModelInvocationPort(container_port=8080),
resources=Resources(
requests={"cpu": "30000m", "nvidia.com/gpu": 1, "memory": "100Gi"},
limits={"nvidia.com/gpu": 1}
),
environment_variables=environment_variables,
)

endpoint_name = SageMakerEndpoint(name="endpoint-custom-pytorch")
tls_config=TlsConfig(tls_certificate_output_s3_uri='s3://<my-tls-bucket-name>')

tls_config = TlsConfig(tls_certificate_output_s3_uri="s3://sample-bucket")

custom_endpoint = HPCustomEndpoint(
model=model,
server=server,
resources=resources,
environment=env,
sage_maker_endpoint=endpoint_name,
custom_endpoint = HPEndpoint(
endpoint_name='<my-endpoint-name>',
instance_type='ml.g5.8xlarge',
model_name='deepseek15b-test-model-name',
tls_config=tls_config,
model_source_config=model_source_config,
worker=worker,
)

custom_endpoint.create()
Expand All @@ -430,19 +416,17 @@ print(response)
#### Managing an Endpoint

```
endpoint_iterator = HPJumpStartEndpoint.list()
for endpoint in endpoint_iterator:
print(endpoint.name, endpoint.status)
endpoint_list = HPEndpoint.list()
print(endpoint_list[0])

logs = js_endpoint.get_logs()
print(logs)
print(custom_endpoint.get_operator_logs(since_hours=0.5))

```

#### Deleting an Endpoint

```
js_endpoint.delete()
custom_endpoint.delete()

```

Expand Down
Loading
Loading