Skip to content

Commit 95096e8

Browse files
nargokulmbnavalishantanutripzhaoqizqwangrvasahu-amazon
authored
Release v3 (aws#109)
* Introduce helm charts for hyperpod inference operator * Introduce helm charts for hyperpod inference operator * Introduce helm charts for hyperpod inference operator * Update Helm charts for inference operator, clean up to remove bedrock references. * Changes to 1. update image tag 2. Remove IAM policies for execution role 3. Rename to hyperpod-inference-operator prefix instead of deploymentoperator prefix * Removed binary from the code base. * Nit: Update the app name labels for sample yaml files. * Merge pull request aws#29 from mbnavali/main Introduce helm charts for hyperpod inference operator * Add crds, service account and region (aws#32) * Add CRDs and setup for region * Change annotation for SA * Remove default region * Add hyperpod inference classes **Description** Support jumpstart and custom model endpoints **Testing Done** Tested manually, will add unit tests in next few PRs * Refactor create inference function **Description** Refactor ModelEndpoint classes to let create happen in separate method instead of constructor **Testing Done** Manually tested in demo notebook * Add List, Delete, Describe endpoint features Tested manually in demo jupyter notebook * Add unit test and update class names **Description** **Testing Done** Unit test passes * Add end and setup.py * Update gitignore * Add setup.cfg * Fix HPEndpoint class and add optional values * remove utils.py * Make function classmethod and update unit tests * Fix bugs for inference endpoint * Small fixes * build: add mountpoint s3 csi driver, keda + cert-manager controllers as dependencies feat: add pv and pvc creation as part of helm * chore: add inference operator as dependency for HP Helm Chart, default disabled * feat: add support for jumpstart gated models * fix: remove stray symbol * fix: rename inference operator chart to match name in parent * change: sync charts with latest version of operator * doc: update readme.md identifying the inference operator as a subchart * Add HyperpodPytorchJob class (aws#39) * Add HyperpodPytorchJob * update to class methods * update to class methods * Address feedback * Fix bug --------- Co-authored-by: pintaoz <[email protected]> * Add tlsConfig to quick create * Revert "Add tlsConfig to quick create" This reverts commit 574351e. * Add tls config * Update CRD configs and minor updates * Add model_location to HPEndpont * Adding observability command to fetch details of grafana, prometheus and list of enabled metrics. * Training CLI implementation: create * Adding observability SDK experience and updating CLI command signature * Rename CLI commands to be consistent with SDK * Training CLI for Launch * Training CLI for Launch * Training CLI for Launch * Training CLI for Launch * Update JumpStartModel interface (aws#51) * Update JumpStartModel interface Tested in Jupyter notebook that endpoint can be successfully invoked * Add refresh method * remove debugging print * Update HPEndpoint classes Tested using example notebooks * Add example notebooks These notebooks haven't been cleaned up and they are for internal review only. Commands are supposed to change later * Add metadata class * Get Cluster Context * Update to HyperPodManager call * Cleanup import * Training CLI for Launch * Training CLI for Launch * Training CLI for Launch * Update HyperPodPytorchJob (aws#52) * Add HyperpodPytorchJob * update to class methods * update to class methods * Address feedback * Fix bug * Update HyperPodPytorchJob * Fix dependency * Add status * Add list_pods and get_logs_from_pod * Add error handling and metadata * Add example notebook * Fix bug --------- Co-authored-by: pintaoz <[email protected]> * E2E testing done for inference CLI * delete build * Revert accidental submodule pointer change * Update inference example notebook and fix bugs * Reformat code with black * Add get_logs function for inference * Update HyperPodPytorchJob to not use _HyperPodPytorchJob object (aws#63) * Add HyperpodPytorchJob * update to class methods * update to class methods * Address feedback * Fix bug * Update HyperPodPytorchJob * Fix dependency * Add status * Add list_pods and get_logs_from_pod * Add error handling and metadata * Add example notebook * Fix bug * Hide _HyperPodPytorchJob from user * Fix merge conflicts --------- Co-authored-by: pintaoz <[email protected]> * Update get_logs function to accept since_hour Tested in notebook * Separate get_logs and get_operator_logs methods * Update get_logs to class method * Add container name to get_logs function * Add container in get_logs_from_pod (aws#66) Co-authored-by: pintaoz <[email protected]> * change inference CLI directory, add inference CLI notebook, add get-logs and get-operator-logs * delete build * Training CLI for Launch - Changes per SDK HyperPodPytorchJob constructor (aws#64) * Training CLI for Launch * Training CLI for Launch --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> * | * d2453d6 (rig-dev) Add notes about HMA patching * add cloudwatchtrigger and autoscalingspec to model.py and schema.json * Add exception handling and update example notebooks (aws#71) * Add exception handling and update example notebooks * Update HPEndpoint get status * Add unit tests for training sdk * Update util tests * Add training cli example notebook (aws#72) Co-authored-by: Roja Reddy Sareddy <[email protected]> * Address comments * fix tls flag issue, fsx endpoint successfully created with cli notebook * clear notebook outputs * minor update in notebook * minor change to notebook * Move Metadata model to common (aws#75) Co-authored-by: pintaoz <[email protected]> * REstructure HPCLI * Fix training cli unit tests * Fix list jobs test * Fixed logger Logger sometimes does not function properly. Tested in example notebook * Updates from Testing * Update import path * Revert lines from readme (should not have been updated) * unit test for inference CLI done * resolve merge conflicts * rebase with master * clean up * clean up recipes * Merging hyp and hyperpod commands in a common entry point as hyp * Removing not relevant directories and updating setup and pyproject (aws#87) * Add unit test and fix HyperPod Manager (aws#84) * Add unit test and fix HyperPod Manager 1. Default namespace can be set by HyperpodManager.set_context() 2. Added unit tests for inference * Remove debug print * update print for inference CLI for list and describe, bug fix for since-hours flag to support float, minor update to notebook (aws#85) * Append uuid to endpoint name (aws#90) * Add unit test and fix HyperPod Manager 1. Default namespace can be set by HyperpodManager.set_context() 2. Added unit tests for inference * Remove debug print * Append uuid to model name and endpoint name * minor fix in create method * Fix set_context in HyperPodManager (aws#91) * Add unit test and fix HyperPod Manager 1. Default namespace can be set by HyperpodManager.set_context() 2. Added unit tests for inference * Remove debug print * Append uuid to model name and endpoint name * minor fix in create method * Fix set_context in HyperPodManager * Add logging info for delete() * Remove Self from type hint (aws#92) * Add unit test and fix HyperPod Manager 1. Default namespace can be set by HyperpodManager.set_context() 2. Added unit tests for inference * Remove debug print * Append uuid to model name and endpoint name * minor fix in create method * Fix set_context in HyperPodManager * Add logging info for delete() * Remove Self in type hint This only supports python version 3.11+ * Minor documentation fixes for RIG Helm (aws#93) * Bug fix: Fixed create command job error (aws#94) Co-authored-by: Roja Reddy Sareddy <[email protected]> * [HyperPod Inference] Update RBAC with perms for KEDA, allow direct provision of operator image repository (aws#44) * change: add rbac perms for KEDA scaledobject * change: allow image.repository to be set directly via flag * change: consistently use namePrefix for app name and resources * fix: remove empty string as default value * fix: reference correct value for tls cert bucket URI fix: override empty image.repository values from domain map change: use shorter prefix for namespace change: do not require sageMakerEndpoint * Adding dynamic flag for dependencies installation (aws#95) * Add utils unit tests for training cli (aws#97) * Bug fix: Fixed create command job error * Add utils unit tests for training cli --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> * Add instance type validation for JS model (aws#98) * Adding observability notebook (aws#96) * Inference dogfood notebook update (aws#99) * update print for inference CLI for list and describe, bug fix for since-hours flag to support float, minor update to notebook * change hyperpod to hyp in inferece cli notebook * update inference CLI notebook to reflect uuid change * Unique job name: Append uuid to training job name (aws#101) * Bug fix: Fixed create command job error * Add utils unit tests for training cli * Unique job name: Append uuid to training job name * Unique job name: Append uuid to training job name --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> * Inference CLI update after dogfood (aws#102) * update print for inference CLI for list and describe, bug fix for since-hours flag to support float, minor update to notebook * change hyperpod to hyp in inferece cli notebook * update inference CLI notebook to reflect uuid change * update list and describe after dogfood callout, remove get_logs for inference CLI, update help text for CLI * Lookup standard Helm release name for RIG Helm installation (1ff9c) (aws#104) * Minor negative case update for Helm release name lookup during RIG Helm installation (aws#105) * Add JumpStart PublicHub model visualization utilities. (aws#106) * Add JumpStart PublicHub model visualization utilities. * Add JumpStart PublicHub model visualization utilities. * Update cli command noun to hyp-*, logging, list_jobs bug fix (aws#107) * Bug fix: Fixed create command job error * Add utils unit tests for training cli * Unique job name: Append uuid to training job name * Unique job name: Append uuid to training job name * Update command verb name to hyp, logging, list_jobs bug fix * Update command verb name to hyp, logging, list_jobs bug fix --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> * Make metadata name same as endpoint name; Updated instance type validation (aws#110) Unit test passes and verified in jupyter notebook * Add integ test for training CLI and SDK (aws#100) * Add integ test for training cli * Add integ test for training sdk * relax pydantic version * fix pydantic version * return latest cluster and fix set cluster context test --------- Co-authored-by: adishaa <[email protected]> * baseline inference integration test for CLI and SDK, minor bug fixes (aws#111) * baseline inference integration test for CLI and SDK, minor bug fix for inference cli, clear inference sdk notebook output * clean up merge header * Remove UUID from training and Inference (aws#108) * Remove UUID from training and Inference * Fixes and PR comments * Fix * Fix logging * Fix * Update inference logging setup similar to training (aws#113) * Bug fix: Fixed create command job error * Add utils unit tests for training cli * Unique job name: Append uuid to training job name * Unique job name: Append uuid to training job name * Update command verb name to hyp, logging, list_jobs bug fix * Update command verb name to hyp, logging, list_jobs bug fix * Update inference logging setup similar to training --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> * Change hp-pytorch-job to hyp-pytorch-job (aws#115) Co-authored-by: adishaa <[email protected]> * Add methods for list pods and namespaces (aws#114) Added unit test and tested in notebook * Minor change in training cli notebook: UUID removed (aws#117) * Bug fix: Fixed create command job error * Add utils unit tests for training cli * Unique job name: Append uuid to training job name * Unique job name: Append uuid to training job name * Update command verb name to hyp, logging, list_jobs bug fix * Update command verb name to hyp, logging, list_jobs bug fix * Update inference logging setup similar to training * Minor change in training cli notebook: UUID removed --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> * Cleaner error messading for Endpoint invoke (aws#112) * Invoke Validation check * Fix * Bumping kubernetes python client version and updating observability command (aws#116) * change: add prefix to convert bucket name to s3 URI (aws#109) * Added type check on commands before invoking subprocess run (aws#118) * Bring HyperPodManager class util functions (aws#119) * Bring HyperPodManager class util functions Unit tests pass and verified in notebook * Update init * Add list_pods and get_logs for CLI (Update notebook, integ test, unit test) (aws#120) * baseline inference integration test for CLI and SDK, minor bug fix for inference cli, clear inference sdk notebook output * update integ test in progress * update cli code, notebook, integ and unit test to add list_pods and get_logs * clean up merge header * Update inference and training to only check kubeconfig on the first time (aws#122) Updated unit tests and verified in notebook * Update Readme to include Inference and Training (aws#121) * Update Readme to include Inference and Training * Update readme command * Documentation updates * Doc Updates * Move observability utils and constants; Rename set_context/get_context (aws#125) * Update inference and training to only check kubeconfig on the first time Updated unit tests and verified in notebook * Remove old unit tests * Revert "Remove old unit tests" This reverts commit e728e9864c853635f724e9a377fbe870f0f2e2a4. * Move observability utils and constants; Rename set_context/get_context * Updating template packages name and structure (aws#126) * Changelog updates (aws#128) * Changelog updates * Rebase and update * Fix * Readme update (aws#129) * Update Readme to include Inference and Training * Update readme command * Documentation updates * Doc Updates * Readme updates * Fix README.md * Remove Orchestrator from List Cluster * Changes to README.md * Fix the link * Remove orchestrator from README.md * Unit test fix (aws#127) * use unique basename for test file modules * fix unit tests * remove append_uuid test * fix failing test_invoke tests --------- Co-authored-by: adishaa <[email protected]> * Fix get_cluster_context runtime error (aws#130) * Remove Py38 Tests (aws#131) * Fix get_cluster_context runtime error * Remove Py38 fromtests * UNit test fixes (aws#132) * Fix get_cluster_context runtime error * Remove Py38 fromtests * Fix * Unit test fixes * Inference integ tests all passed in Chait's account (aws#135) * baseline inference integration test for CLI and SDK, minor bug fix for inference cli, clear inference sdk notebook output * update integ test in progress * update cli code, notebook, integ and unit test to add list_pods and get_logs * clean up merge header * inference integ tests all passing in chait's account * Update operator namespace string (aws#137) * Inference integ test passed on beta account (aws#140) * baseline inference integration test for CLI and SDK, minor bug fix for inference cli, clear inference sdk notebook output * update integ test in progress * update cli code, notebook, integ and unit test to add list_pods and get_logs * clean up merge header * inference integ tests all passing in chait's account * integ test passing on beta account * is_kubeconfig_loaded Fix (aws#139) * Test PR * Fix is_kubeconfig_loaded Class attribute bug * Include main branch in pull request target --------- Co-authored-by: Mahadeva N <[email protected]> Co-authored-by: Shantanu Tripathi <[email protected]> Co-authored-by: Zhaoqi <[email protected]> Co-authored-by: jzhaoqwa <[email protected]> Co-authored-by: Rahul Sahu <[email protected]> Co-authored-by: rvasahu-amazon <[email protected]> Co-authored-by: pintaoz-aws <[email protected]> Co-authored-by: pintaoz <[email protected]> Co-authored-by: Molly He <[email protected]> Co-authored-by: Amarjeet LNU <[email protected]> Co-authored-by: Roja Reddy Sareddy <[email protected]> Co-authored-by: rsareddy0329 <[email protected]> Co-authored-by: Chris Chan <[email protected]> Co-authored-by: adishaa <[email protected]> Co-authored-by: Aditi Sharma <[email protected]> Co-authored-by: chnnmz <[email protected]>
1 parent 1ffc962 commit 95096e8

File tree

201 files changed

+21510
-4494
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

201 files changed

+21510
-4494
lines changed

.github/workflows/codebuild-ci.yml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ name: PR Checks
22
on:
33
pull_request_target:
44
branches:
5+
- "master*"
56
- "main*"
67

78
concurrency:
@@ -47,7 +48,7 @@ jobs:
4748
needs: [wait-for-approval]
4849
strategy:
4950
matrix:
50-
python-version: ["38", "39", "310", "311"]
51+
python-version: ["39", "310", "311"]
5152
steps:
5253
- name: Configure AWS Credentials
5354
uses: aws-actions/configure-aws-credentials@v3
@@ -66,7 +67,7 @@ jobs:
6667
strategy:
6768
fail-fast: false
6869
matrix:
69-
python-version: ["38", "39", "310", "311"]
70+
python-version: ["39", "310", "311"]
7071
steps:
7172
- name: Configure AWS Credentials
7273
uses: aws-actions/configure-aws-credentials@v3

.gitignore

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,12 @@ __pycache__/
1818
/doc/_apidoc/
1919
/build
2020

21+
/sagemaker-hyperpod/build
22+
/sagemaker-hyperpod/.coverage
23+
/sagemaker-hyperpod/.coverage.*
24+
2125
# Ignore all contents of result and results directories
2226
/result/
23-
/results/
27+
/results/
28+
29+
.idea/

.gitmodules

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
[submodule "src/hyperpod_cli/sagemaker_hyperpod_recipes"]
2-
path = src/hyperpod_cli/sagemaker_hyperpod_recipes
1+
[submodule "src/sagemaker/hyperpod/cli/sagemaker_hyperpod_recipes"]
2+
path = src/sagemaker/hyperpod/cli/sagemaker_hyperpod_recipes
33
url = https://github.com/aws/sagemaker-hyperpod-recipes.git
44
branch = release-1.3.3

CHANGELOG.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,3 +11,13 @@
1111
### Features
1212

1313
- feature: Add support for SageMaker HyperPod CLI
14+
15+
16+
## v1.0.0] ([2025]-[07]-[10])
17+
18+
### Features
19+
20+
* Training Job - Create, List , Get
21+
* Inference Jumpstart - Create , List, Get, Invoke
22+
* Inference Custom - Create , List, Get, Invoke
23+
* Observability changes

README.md

Lines changed: 318 additions & 122 deletions
Large diffs are not rendered by default.

__init__.py

Whitespace-only changes.

examples/basic-job-example-config.yaml

Lines changed: 0 additions & 116 deletions
This file was deleted.
Lines changed: 162 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,162 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "2d55c8b9",
6+
"metadata": {},
7+
"source": [
8+
"## Inference Operator CLI E2E Expereience (S3 custom model)"
9+
]
10+
},
11+
{
12+
"cell_type": "markdown",
13+
"id": "6e281ddd",
14+
"metadata": {},
15+
"source": [
16+
"Make sure you have installed pacakges:\n",
17+
"- sagemaker-hyperpod\n",
18+
"- hyperpod-custom-inference-template"
19+
]
20+
},
21+
{
22+
"cell_type": "code",
23+
"execution_count": null,
24+
"id": "da015cdb",
25+
"metadata": {},
26+
"outputs": [],
27+
"source": [
28+
"!hyp list-cluster --output table"
29+
]
30+
},
31+
{
32+
"cell_type": "code",
33+
"execution_count": null,
34+
"id": "e9e1ce47",
35+
"metadata": {},
36+
"outputs": [],
37+
"source": [
38+
"!hyp set-cluster-context --cluster-name hp-cluster-for-inf-Beta2try1"
39+
]
40+
},
41+
{
42+
"cell_type": "code",
43+
"execution_count": null,
44+
"id": "dfc2f047",
45+
"metadata": {},
46+
"outputs": [],
47+
"source": [
48+
"!hyp create hyp-custom-endpoint \\\n",
49+
" --version 1.0 \\\n",
50+
" --env \\\n",
51+
" '{\"HF_MODEL_ID\":\"/opt/ml/model\", \\\n",
52+
" \"SAGEMAKER_PROGRAM\":\"inference.py\", \\\n",
53+
" \"SAGEMAKER_SUBMIT_DIRECTORY\":\"/opt/ml/model/code\", \\\n",
54+
" \"MODEL_CACHE_ROOT\":\"/opt/ml/model\", \\\n",
55+
" \"SAGEMAKER_ENV\":\"1\"}' \\\n",
56+
" --model-source-type fsx \\\n",
57+
" --model-location deepseek-1-5b \\\n",
58+
" --fsx-file-system-id fs-0e6a92495c35a81f2 \\\n",
59+
" --image-uri 763104351884.dkr.ecr.us-east-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.4.0-tgi2.3.1-gpu-py311-cu124-ubuntu22.04-v2.0 \\\n",
60+
" --model-volume-mount-name model-weights \\\n",
61+
" --container-port 8080 \\\n",
62+
" --resources-requests '{\"cpu\": \"4\", \"nvidia.com/gpu\": 1, \"memory\": \"32Gi\"}' \\\n",
63+
" --resources-limits '{\"nvidia.com/gpu\": 1}' \\\n",
64+
" --tls-certificate-output-s3-uri s3://tls-bucket-inf1-beta2 \\\n",
65+
" --instance-type ml.g5.8xlarge \\\n",
66+
" --endpoint-name endpoint-fsx-test-cli \\\n",
67+
" --model-name deepseek15b-fsx-test-cli"
68+
]
69+
},
70+
{
71+
"cell_type": "code",
72+
"execution_count": null,
73+
"id": "47a338fd",
74+
"metadata": {},
75+
"outputs": [],
76+
"source": [
77+
"!hyp list hyp-custom-endpoint"
78+
]
79+
},
80+
{
81+
"cell_type": "code",
82+
"execution_count": null,
83+
"id": "2929171e",
84+
"metadata": {},
85+
"outputs": [],
86+
"source": [
87+
"!hyp describe hyp-custom-endpoint --name endpoint-fsx-test-cli"
88+
]
89+
},
90+
{
91+
"cell_type": "code",
92+
"execution_count": null,
93+
"id": "74157664",
94+
"metadata": {},
95+
"outputs": [],
96+
"source": [
97+
"!hyp invoke hyp-custom-endpoint --endpoint-name endpoint-fsx-test-cli --body '{\"inputs\":\"What is the capital of USA?\"}'"
98+
]
99+
},
100+
{
101+
"cell_type": "code",
102+
"execution_count": null,
103+
"id": "52bfcde6",
104+
"metadata": {},
105+
"outputs": [],
106+
"source": [
107+
"!hyp delete hyp-custom-endpoint --name endpoint-fsx-test-cli"
108+
]
109+
},
110+
{
111+
"cell_type": "code",
112+
"execution_count": null,
113+
"id": "60fea9e8",
114+
"metadata": {},
115+
"outputs": [],
116+
"source": [
117+
"!hyp get-operator-logs hyp-custom-endpoint --since-hours 0.5"
118+
]
119+
},
120+
{
121+
"cell_type": "code",
122+
"execution_count": null,
123+
"id": "30a5cd60",
124+
"metadata": {},
125+
"outputs": [],
126+
"source": [
127+
"!hyp list-pods hyp-custom-endpoint"
128+
]
129+
},
130+
{
131+
"cell_type": "code",
132+
"execution_count": null,
133+
"id": "1a7a0583",
134+
"metadata": {},
135+
"outputs": [],
136+
"source": [
137+
"!hyp get-logs hyp-custom-endpoint --pod-name <pod-name>"
138+
]
139+
}
140+
],
141+
"metadata": {
142+
"kernelspec": {
143+
"display_name": "Python 3",
144+
"language": "python",
145+
"name": "python3"
146+
},
147+
"language_info": {
148+
"codemirror_mode": {
149+
"name": "ipython",
150+
"version": 3
151+
},
152+
"file_extension": ".py",
153+
"mimetype": "text/x-python",
154+
"name": "python",
155+
"nbconvert_exporter": "python",
156+
"pygments_lexer": "ipython3",
157+
"version": "3.12.2"
158+
}
159+
},
160+
"nbformat": 4,
161+
"nbformat_minor": 5
162+
}

0 commit comments

Comments
 (0)