-
Notifications
You must be signed in to change notification settings - Fork 44
Release v3 #109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Release v3 #109
Changes from all commits
Commits
Show all changes
186 commits
Select commit
Hold shift + click to select a range
0139e19
Introduce helm charts for hyperpod inference operator
mbnavali 89d9f29
Introduce helm charts for hyperpod inference operator
mbnavali d4fffa0
Introduce helm charts for hyperpod inference operator
mbnavali f6d3949
Update Helm charts for inference operator, clean up to remove bedrock…
mbnavali 3cf3f8b
Changes to
mbnavali 716382f
Removed binary from the code base.
mbnavali c0acb09
Nit: Update the app name labels for sample yaml files.
mbnavali 6868542
Merge pull request #29 from mbnavali/main
mbnavali 6a9ddb4
Add crds, service account and region (#32)
shantanutrip a5dec3e
Merge pull request #33 from zhaoqizqwang/update-sync-branch
zhaoqizqwang 15242fe
Add hyperpod inference classes
zhaoqizqwang 984c820
Refactor create inference function
zhaoqizqwang 1af95ad
Add List, Delete, Describe endpoint features
zhaoqizqwang de55508
Add unit test and update class names
zhaoqizqwang 0ab576f
Add end and setup.py
zhaoqizqwang f104770
Update gitignore
zhaoqizqwang 05c374c
Add setup.cfg
zhaoqizqwang bbf9570
Fix HPEndpoint class and add optional values
zhaoqizqwang 11faa90
remove utils.py
zhaoqizqwang 5430f3f
Merge pull request #34 from zhaoqizqwang/add-inference-classes
zhaoqizqwang b22e4ad
Make function classmethod and update unit tests
zhaoqizqwang c3a8a16
Merge pull request #40 from zhaoqizqwang/add-inference-classes
zhaoqizqwang b13c487
Fix bugs for inference endpoint
zhaoqizqwang d13acdd
Small fixes
zhaoqizqwang 6acb6f6
Merge pull request #41 from zhaoqizqwang/add-inference-classes
zhaoqizqwang 1ab67ee
build: add mountpoint s3 csi driver, keda + cert-manager controllers …
rvasahu-amazon 68a9944
chore: add inference operator as dependency for HP Helm Chart, defaul…
rvasahu-amazon d87dc59
feat: add support for jumpstart gated models
rvasahu-amazon 349f2ff
Merge pull request #42 from rvasahu-amazon/main-2
rvasahu-amazon 66df206
fix: remove stray symbol
rvasahu-amazon 31bac53
fix: rename inference operator chart to match name in parent
rvasahu-amazon b538a4e
change: sync charts with latest version of operator
rvasahu-amazon e732254
doc: update readme.md identifying the inference operator as a subchart
rvasahu-amazon 2bf4d52
Merge pull request #43 from rvasahu-amazon/main-2
zhaoqizqwang d3db0af
Add HyperpodPytorchJob class (#39)
pintaoz-aws 574351e
Add tlsConfig to quick create
zhaoqizqwang f9a5d90
Revert "Add tlsConfig to quick create"
zhaoqizqwang bffec6c
Add tls config
zhaoqizqwang 7981fea
Merge pull request #45 from zhaoqizqwang/add-inference-classes
mollyheamazon ffa945f
Update CRD configs and minor updates
zhaoqizqwang f066b55
Merge branch 'aws:master' into add-inference-classes
zhaoqizqwang 5671b26
Merge pull request #47 from zhaoqizqwang/add-inference-classes
mollyheamazon 26f3ecb
Add model_location to HPEndpont
zhaoqizqwang 075fcf1
Merge branch 'add-inference-classes' of https://github.com/zhaoqizqwa…
zhaoqizqwang 0348875
Merge pull request #48 from zhaoqizqwang/add-inference-classes
mollyheamazon 8f7f835
Adding observability command to fetch details of grafana, prometheus …
jam-jee 2c46993
Merge pull request #50 from jam-jee/master
zhaoqizqwang d00cec7
Training CLI implementation: create
97f02c8
Adding observability SDK experience and updating CLI command signature
jam-jee 027c3f1
Rename CLI commands to be consistent with SDK
nargokul 6b2a6a5
Merge pull request #56 from jam-jee/master
jam-jee 3a310fb
Merge branch 'master' into master
nargokul 50af9a6
Merge branch 'aws:master' into master-hyperpod-test-630
rsareddy0329 116fb5b
Training CLI for Launch
f6f852d
Training CLI for Launch
bde684b
Training CLI for Launch
c8c185f
Training CLI for Launch
d90666a
Update JumpStartModel interface (#51)
zhaoqizqwang edb5c3c
Merge pull request #57 from nargokul/master
nargokul 7e980d7
Get Cluster Context
nargokul f1c9f4d
Merge remote-tracking branch 'origin/master'
nargokul bf22bb1
Merge branch 'aws:master' into master
nargokul 15aeffc
Update to HyperPodManager call
nargokul 22a6df4
Merge remote-tracking branch 'origin/master'
nargokul 16a68a5
Merge branch 'aws:master' into master-hyperpod-test-630
rsareddy0329 39d2f41
Cleanup import
nargokul cebe4c2
Merge pull request #58 from nargokul/master
nargokul 3b8713e
Merge branch 'aws:master' into master-hyperpod-test-630
rsareddy0329 6d4b211
Training CLI for Launch
47c9c48
Training CLI for Launch
d36ddb0
Training CLI for Launch
48d0239
Update HyperPodPytorchJob (#52)
pintaoz-aws 8e995a2
Merge branch 'aws:master' into master-hyperpod-test-630
rsareddy0329 d2150f7
E2E testing done for inference CLI
mollyheamazon d6074d6
delete build
mollyheamazon 0ae5c32
Revert accidental submodule pointer change
mollyheamazon 8d24dda
Merge pull request #54 from rsareddy0329/master-hyperpod-test-630
rsareddy0329 2cf9336
Update inference example notebook and fix bugs
zhaoqizqwang f1e7234
Reformat code with black
zhaoqizqwang f4a918a
Merge pull request #61 from zhaoqizqwang/add-inference-classes
zhaoqizqwang 5886bec
Add get_logs function for inference
zhaoqizqwang 6f38a31
Update HyperPodPytorchJob to not use _HyperPodPytorchJob object (#63)
pintaoz-aws 8358856
Update get_logs function to accept since_hour
zhaoqizqwang 903b6f6
Separate get_logs and get_operator_logs methods
zhaoqizqwang c36ec03
Update get_logs to class method
zhaoqizqwang cb915f5
Merge pull request #62 from zhaoqizqwang/add-inference-classes
zhaoqizqwang f17d1b3
Merge pull request #59 from mollyheamazon/inference-pentest
mollyheamazon 4fbf3c5
Add container name to get_logs function
zhaoqizqwang a6a4f81
Merge branch 'aws:master' into add-inference-classes
zhaoqizqwang e165fa7
Merge pull request #65 from zhaoqizqwang/add-inference-classes
zhaoqizqwang 60e56f4
Add container in get_logs_from_pod (#66)
pintaoz-aws be796b4
change inference CLI directory, add inference CLI notebook, add get-l…
mollyheamazon a0cf0f0
delete build
mollyheamazon fc3cb62
Training CLI for Launch - Changes per SDK HyperPodPytorchJob construc…
rsareddy0329 503408c
Merge pull request #67 from mollyheamazon/inference-pentest
mollyheamazon 16aa13c
| * d2453d6 (rig-dev) Add notes about HMA patching
chnnmz 6cf1eb2
add cloudwatchtrigger and autoscalingspec to model.py and schema.json
mollyheamazon d0ebb1f
Merge pull request #69 from mollyheamazon/inference-pentest
mollyheamazon 93cf35d
Add exception handling and update example notebooks (#71)
zhaoqizqwang 00c22f0
Add unit tests for training sdk
829b1cf
Merge branch 'master' into training_unit_test
ab6be47
Update util tests
5719b76
Add training cli example notebook (#72)
rsareddy0329 85735ff
Address comments
4dfa302
fix tls flag issue, fsx endpoint successfully created with cli notebook
mollyheamazon 21de3d6
clear notebook outputs
mollyheamazon d4a5c1c
minor update in notebook
mollyheamazon d70d98c
Merge pull request #73 from pintaoz-aws/training_unit_test
zhaoqizqwang cb627c3
Merge branch 'aws:master' into inference-pentest
mollyheamazon cb4d878
Merge pull request #74 from mollyheamazon/inference-pentest
zhaoqizqwang ee19040
minor change to notebook
mollyheamazon 6f3219b
Move Metadata model to common (#75)
pintaoz-aws 240e523
Minor change to notebook
mollyheamazon 2f8030b
REstructure HPCLI
nargokul dd383a2
Fix training cli unit tests
adishaa 00d6c34
Fix list jobs test
adishaa 28461d7
Merge Changes
nargokul 71260e2
Merge pull request #78 from Aditi2424/integration-tests
Aditi2424 cb09fc4
Fixed logger
zhaoqizqwang b2525d8
Merge pull request #79 from zhaoqizqwang/add-inference-classes
zhaoqizqwang 5ce85fc
Updates from Testing
nargokul f59742b
Merge remote-tracking branch 'origin/master' into restructre
nargokul 3e6fb63
Update import path
a91b11e
Merge pull request #77 from nargokul/restructre
nargokul 2efb36f
Revert lines from readme (should not have been updated)
chnnmz 2bc6801
Merge branch 'master' into import_path
3466c5d
unit test for inference CLI done
mollyheamazon 896509b
resolve merge conflicts
15dfea9
rebase with master
mollyheamazon 0f3e0d4
clean up
mollyheamazon 4371779
clean up recipes
mollyheamazon d365e6b
Merge pull request #68 from chnnmz/rig-rebsae
zhaoqizqwang 6f902ca
Merging hyp and hyperpod commands in a common entry point as hyp
jam-jee 99d2b27
Merge pull request #82 from jam-jee/master
jam-jee 0752bfa
Merge pull request #81 from mollyheamazon/inference-pentest
mollyheamazon 7467c20
Merge pull request #80 from pintaoz-aws/import_path
zhaoqizqwang 7c30bbc
Removing not relevant directories and updating setup and pyproject (…
jam-jee 389d6ae
Add unit test and fix HyperPod Manager (#84)
zhaoqizqwang 68d87de
update print for inference CLI for list and describe, bug fix for sin…
mollyheamazon 22e176b
Append uuid to endpoint name (#90)
zhaoqizqwang d89bcf4
Fix set_context in HyperPodManager (#91)
zhaoqizqwang 60d46c6
Remove Self from type hint (#92)
zhaoqizqwang 877edd4
Minor documentation fixes for RIG Helm (#93)
chnnmz 3e4c2ff
Bug fix: Fixed create command job error (#94)
rsareddy0329 ab77a3f
[HyperPod Inference] Update RBAC with perms for KEDA, allow direct pr…
rvasahu-amazon 5a80519
Adding dynamic flag for dependencies installation (#95)
jam-jee dceee17
Add utils unit tests for training cli (#97)
rsareddy0329 8855224
Add instance type validation for JS model (#98)
zhaoqizqwang 466baca
Adding observability notebook (#96)
jam-jee 2384409
Inference dogfood notebook update (#99)
mollyheamazon eb44e6f
Unique job name: Append uuid to training job name (#101)
rsareddy0329 70576ef
Inference CLI update after dogfood (#102)
mollyheamazon c85d399
Lookup standard Helm release name for RIG Helm installation (1ff9c) (…
chnnmz 63b0481
Minor negative case update for Helm release name lookup during RIG He…
chnnmz 387816a
Add JumpStart PublicHub model visualization utilities. (#106)
mbnavali 5eeed51
Update cli command noun to hyp-*, logging, list_jobs bug fix (#107)
rsareddy0329 f27a3cb
Make metadata name same as endpoint name; Updated instance type valid…
zhaoqizqwang 281e6d6
Add integ test for training CLI and SDK (#100)
Aditi2424 cd36154
baseline inference integration test for CLI and SDK, minor bug fixes …
mollyheamazon 47c8990
Remove UUID from training and Inference (#108)
nargokul 3fb4054
Update inference logging setup similar to training (#113)
rsareddy0329 dc052d2
Change hp-pytorch-job to hyp-pytorch-job (#115)
Aditi2424 f636aab
Add methods for list pods and namespaces (#114)
zhaoqizqwang c37fc38
Minor change in training cli notebook: UUID removed (#117)
rsareddy0329 872853c
Cleaner error messading for Endpoint invoke (#112)
nargokul b64e7ea
Bumping kubernetes python client version and updating observability c…
jam-jee d749271
change: add prefix to convert bucket name to s3 URI (#109)
rvasahu-amazon bce58b3
Added type check on commands before invoking subprocess run (#118)
jam-jee e9e749c
Bring HyperPodManager class util functions (#119)
zhaoqizqwang e60a9f2
Add list_pods and get_logs for CLI (Update notebook, integ test, unit…
mollyheamazon 1c9209a
Update inference and training to only check kubeconfig on the first t…
zhaoqizqwang 6dc7013
Update Readme to include Inference and Training (#121)
nargokul b53d301
Move observability utils and constants; Rename set_context/get_contex…
zhaoqizqwang 3463e98
Updating template packages name and structure (#126)
jam-jee 78d8c37
Changelog updates (#128)
nargokul 05fd213
Readme update (#129)
nargokul 9ee0c92
Unit test fix (#127)
Aditi2424 8d11346
Fix get_cluster_context runtime error (#130)
nargokul 475b9f2
Remove Py38 Tests (#131)
nargokul 6d7975f
UNit test fixes (#132)
nargokul 2a5c0f4
Inference integ tests all passed in Chait's account (#135)
mollyheamazon d0c6f14
Update operator namespace string (#137)
zhaoqizqwang 2683932
Inference integ test passed on beta account (#140)
mollyheamazon a9e8d59
is_kubeconfig_loaded Fix (#139)
nargokul 966d4b5
Merge remote-tracking branch 'origin/main' into release_v3
nargokul 96b469b
Include main branch in pull request target
nargokul File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
[submodule "src/hyperpod_cli/sagemaker_hyperpod_recipes"] | ||
path = src/hyperpod_cli/sagemaker_hyperpod_recipes | ||
[submodule "src/sagemaker/hyperpod/cli/sagemaker_hyperpod_recipes"] | ||
path = src/sagemaker/hyperpod/cli/sagemaker_hyperpod_recipes | ||
url = https://github.com/aws/sagemaker-hyperpod-recipes.git | ||
branch = release-1.3.3 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
Empty file.
This file was deleted.
Oops, something went wrong.
162 changes: 162 additions & 0 deletions
162
examples/inference/CLI/inference-fsx-model-e2e-cli.ipynb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,162 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "2d55c8b9", | ||
"metadata": {}, | ||
"source": [ | ||
"## Inference Operator CLI E2E Expereience (S3 custom model)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "6e281ddd", | ||
"metadata": {}, | ||
"source": [ | ||
"Make sure you have installed pacakges:\n", | ||
"- sagemaker-hyperpod\n", | ||
"- hyperpod-custom-inference-template" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "da015cdb", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"!hyp list-cluster --output table" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "e9e1ce47", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"!hyp set-cluster-context --cluster-name hp-cluster-for-inf-Beta2try1" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "dfc2f047", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"!hyp create hyp-custom-endpoint \\\n", | ||
" --version 1.0 \\\n", | ||
" --env \\\n", | ||
" '{\"HF_MODEL_ID\":\"/opt/ml/model\", \\\n", | ||
" \"SAGEMAKER_PROGRAM\":\"inference.py\", \\\n", | ||
" \"SAGEMAKER_SUBMIT_DIRECTORY\":\"/opt/ml/model/code\", \\\n", | ||
" \"MODEL_CACHE_ROOT\":\"/opt/ml/model\", \\\n", | ||
" \"SAGEMAKER_ENV\":\"1\"}' \\\n", | ||
" --model-source-type fsx \\\n", | ||
" --model-location deepseek-1-5b \\\n", | ||
" --fsx-file-system-id fs-0e6a92495c35a81f2 \\\n", | ||
" --image-uri 763104351884.dkr.ecr.us-east-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.4.0-tgi2.3.1-gpu-py311-cu124-ubuntu22.04-v2.0 \\\n", | ||
" --model-volume-mount-name model-weights \\\n", | ||
" --container-port 8080 \\\n", | ||
" --resources-requests '{\"cpu\": \"4\", \"nvidia.com/gpu\": 1, \"memory\": \"32Gi\"}' \\\n", | ||
" --resources-limits '{\"nvidia.com/gpu\": 1}' \\\n", | ||
" --tls-certificate-output-s3-uri s3://tls-bucket-inf1-beta2 \\\n", | ||
" --instance-type ml.g5.8xlarge \\\n", | ||
" --endpoint-name endpoint-fsx-test-cli \\\n", | ||
" --model-name deepseek15b-fsx-test-cli" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "47a338fd", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"!hyp list hyp-custom-endpoint" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "2929171e", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"!hyp describe hyp-custom-endpoint --name endpoint-fsx-test-cli" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "74157664", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"!hyp invoke hyp-custom-endpoint --endpoint-name endpoint-fsx-test-cli --body '{\"inputs\":\"What is the capital of USA?\"}'" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "52bfcde6", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"!hyp delete hyp-custom-endpoint --name endpoint-fsx-test-cli" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "60fea9e8", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"!hyp get-operator-logs hyp-custom-endpoint --since-hours 0.5" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "30a5cd60", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"!hyp list-pods hyp-custom-endpoint" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "1a7a0583", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"!hyp get-logs hyp-custom-endpoint --pod-name <pod-name>" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.12.2" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: may not need master branch here, CI would be running on main