-
Notifications
You must be signed in to change notification settings - Fork 44
Merge latest changes from main to 'Documentation' branch #192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
rsareddy0329
wants to merge
55
commits into
documentation
Choose a base branch
from
main
base: documentation
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+23,112
−2,388
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Co-authored-by: adishaa <[email protected]>
… with minor improvements and bug fixes (#137)
… with minor improvements and bug fixes. (#139)
…ception count data (#140)
* manual release v3.0.1
… regionalized HMA URI (#141)
* Add unique time string to integ test * Update syntax
* Training CLI & SDK: example notebook and README update * Update training cli example notebook --------- Co-authored-by: Roja Reddy Sareddy <[email protected]>
* Update inferenece SDK examples * Update readme
* Enable Hyperpod telemetry * Enable Hyperpod telemetry * Enable Hyperpod telemetry * Enable Hyperpod telemetry * Enable Hyperpod telemetry * Enable Hyperpod telemetry * CLI: Enable Telemetry * CLI: Enable Telemetry --------- Co-authored-by: Roja Reddy Sareddy <[email protected]>
* update help text to avoid truncation * update volume flag to support hostPath and pvc, before e2e testing * clean up and e2e working * Minor updates after PR * update * Added unit tests for volume, all cli unit tests passed
Co-authored-by: pintaoz <[email protected]>
* Update inference config and integ tests * Update integ tests for new canaries
* Manual release v3.0.2 * Update changelog --------- Co-authored-by: pintaoz <[email protected]>
* Update readme for volume flag * Add schema pattern check to pytorch-job template, unit test added, all test passed locally
…8s (#138) * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Fix unit test cases * Move regex to a constant. **Description** - Removed an integration test case that was being mocked. - Moved a regex to a constant. **Testing Done** Unit test cases pass no changes made to integration test cases and they should not be affected. * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Add ref link for version comptability contraints **Description** Added a link to k8s documentation mentioning the constraints that rule the version compatibility of client and server. **Testing Done** No breaking changes.
* Fix SDK training test: Add wait time before refresh * Fix training tests in canaries
…189) Co-authored-by: pintaoz <[email protected]>
* Update documentation-with-new-changes branch with latest changes from main (#190) * Fix training test (#184) * Fix SDK training test: Add wait time before refresh * Fix training tests in canaries * Update logging information for submitting and deleting training job (#189) Co-authored-by: pintaoz <[email protected]> --------- Co-authored-by: Zhaoqi <[email protected]> Co-authored-by: pintaoz-aws <[email protected]> Co-authored-by: pintaoz <[email protected]> * Documentation Fixes (#191) Co-authored-by: Roja Reddy Sareddy <[email protected]> * update documentation with new changes branch with latest changes (#194) * Fix training test (#184) * Fix SDK training test: Add wait time before refresh * Fix training tests in canaries * Update logging information for submitting and deleting training job (#189) Co-authored-by: pintaoz <[email protected]> --------- Co-authored-by: Zhaoqi <[email protected]> Co-authored-by: pintaoz-aws <[email protected]> Co-authored-by: pintaoz <[email protected]> * Documentation Fixes (#195) * Documentation Fixes * Documentation Fixes --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> * Documentation Fixes (#197) * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> * Documentation Fixes (#198) * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> * Documentation fixes (#199) * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> --------- Co-authored-by: Zhaoqi <[email protected]> Co-authored-by: pintaoz-aws <[email protected]> Co-authored-by: pintaoz <[email protected]> Co-authored-by: Roja Reddy Sareddy <[email protected]>
…s to view SDK config code (#188) Co-authored-by: Mohamed Zeidan <[email protected]>
* Add instance type support for ml.p6e-gb200.36xlarge Updated support for ml.p6-b200.48xlarge as well * Add ml.p6e-gb200.36xlarge to efa plugin
…holder value (#206) Co-authored-by: Mohamed Zeidan <[email protected]>
Co-authored-by: Roja Reddy Sareddy <[email protected]>
* Add labels to the top level metadata (#158) Co-authored-by: pintaoz <[email protected]> * Implemented GPU Quota Allocation Feature. Co-authored-by: aleszewi <[email protected]> * Revert "Implemented GPU Quota Allocation Feature." This reverts commit 790b8f1df59494a982463aaed9e5b3f2afa44123. * Fix: Template issue - pick user defined template version (#154) * Fix: Template issue - pick user defined template version * Fix: Template issue - pick user defined template version & add topology labels in 1.1 * Fix: Template issue - pick user defined template version & add topology labels in 1.1 --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> * Fix: Add __init__ to the new schema (#163) * Fix: Template issue - pick user defined template version * Fix: Template issue - pick user defined template version & add topology labels in 1.1 * Fix: Template issue - pick user defined template version & add topology labels in 1.1 * Fix: Add __init__ to load the new schema --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> * Add labels and annotations to top level metadata v1.1 (#165) * Add labels to top level metadata v1.1 * Move topology labels to annotations * Update topology parameter names * Add unit test --------- Co-authored-by: pintaoz <[email protected]> * Added GPU quota allocation. Co-authored-by: aleszewi <[email protected]> * Changed neuron key to neurondevice. (#177) Co-authored-by: Marta Aleszewicz <[email protected]> * fix: Renamed memory-in-gib to memory for consistency. (#179) cr: https://code.amazon.com/reviews/CR-214599587 Co-authored-by: Marta Aleszewicz <[email protected]> * Add validation to topology labels (#178) * Add validation to topology labels * Add validation to topology labels * Add validation to topology labels --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> * Add integ tests for topology annotations (#180) * Add labels to top level metadata v1.1 * Move topology labels to annotations * Update topology parameter names * Add unit test * Topology integ tests * Add invalid test case * Add empty test case --------- Co-authored-by: pintaoz <[email protected]> * Add integration tests for gpu quota allocation feature (#184) * add integration tests for gpu quota allocation feature * add valueError assertions for invalid test cases * Updating the CHANGELOG and minor version --------- Co-authored-by: pintaoz-aws <[email protected]> Co-authored-by: pintaoz <[email protected]> Co-authored-by: Marta Aleszewicz <[email protected]> Co-authored-by: rsareddy0329 <[email protected]> Co-authored-by: Roja Reddy Sareddy <[email protected]> Co-authored-by: mx26pol <[email protected]> Co-authored-by: satish Kumar <[email protected]>
…gs to hyp-jumpstart-endpoint (#213) * Update generate_click_command inject logic to not expose unwanted flags to hyp-jumpstart-endpoint * Update unit tests for bug fix, change --label_selector to --label-selector
* Update generate_click_command inject logic to not expose unwanted flags to hyp-jumpstart-endpoint * Update unit tests for bug fix, change --label_selector to --label-selector * Update README, example notebooks and documentation to 1)remove model_version, 2)add --model-volume-mount-name 3)remove tar.gz from --model-location 4)update unique mount_path for --volume * Update README, example notebooks and documentation to remove tls-config for jumpstart * minor update to remove tar.gz from --model-location for documentation
#219) * add metadata_name argument to js and custom endpoint to match with SDK * fix integ
* Add cert mgr installation * Add cert mgr installation * update cert-mgr readme --------- Co-authored-by: Xin Wang <[email protected]>
**Description** - Removed outdated Helm installation requirement for HyperPod CLI V3 - Fixed step numbering in installation section (1, 2, 3 instead of 1, 1, 1) - Simplified installation process by removing unnecessary Helm setup steps **Testing Done** Not needed, just README updates.
* Update description for scheduler type Tested in terminal with command `hyp create hyp-pytorch-job --help` and can see new description * Update scheduler type description in v1_0
Co-authored-by: Xin Wang <[email protected]>
… with minor improvements and bug fixes. (#225)
* feat: add get_operator_logs to pytorch job * feat: add get_operator_logs to pytorch job * feat: add get_operator_logs to pytorch job * feat: add get_operator_logs to pytorch job --------- Co-authored-by: Roja Reddy Sareddy <[email protected]>
* add metadata_name argument to js and custom endpoint to match with SDK * fix integ * change container name in pytorch template * update v1_0 too * update default container name for pytorch job template
…227) * Update list_pods to only display pods of corresponding endpoint type * Use list endpoints to check endpoint type --------- Co-authored-by: pintaoz <[email protected]>
* Update warning message string for k8s version compatibility check **Description** The warning message earlier was not formatted well enough. Made it explicitly look like a warning. **Testing Done** - Added unit test case to check if the warning will be displayed or not. - Checking the warning color to be yellow.
…#231) * Implementing Task Gov. feature for SDK flow * Implemented parallel processing for list-cluster operation to improve time
* Add enpoint_name argument for list_pods() * update test name --------- Co-authored-by: pintaoz <[email protected]>
nargokul
pushed a commit
that referenced
this pull request
Aug 26, 2025
* Add version comptability check between server K8s and Client python K8s (#138) * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Fix unit test cases * Move regex to a constant. **Description** - Removed an integration test case that was being mocked. - Moved a regex to a constant. **Testing Done** Unit test cases pass no changes made to integration test cases and they should not be affected. * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Add ref link for version comptability contraints **Description** Added a link to k8s documentation mentioning the constraints that rule the version compatibility of client and server. **Testing Done** No breaking changes. * Update logging information for submitting and deleting training job (#189) Co-authored-by: pintaoz <[email protected]> * Enhance docs with table formatting and comprehensive API reference **Description** - Convert CLI parameter lists to structured tables across all documentation files for better readability - Add comprehensive docstrings and examples to SDK classes (HPEndpointBase, HyperPodPytorchJob) - Enhance Sphinx configuration with better autodoc settings and extensions - Update API reference structure and formatting - Add custom CSS styling for improved table presentation - Update documentation requirements and index structure **Testing Done** - Verified documentation builds successfully with `make html` - Confirmed table formatting renders correctly in generated HTML - Validated API documentation generates properly with enhanced docstrings - Tested responsive table styling across different screen sizes - Checked that all parameter information remains accurate and complete * FIX ALTERED CODE **Description** Fixed the code altered while updating docstrings in `hyperpod_pytorch_job.py` file. **Testing Done** The unit test cases all pass. * FIX TEST CASES TO SKIP IF MODULE NOT FOUND **Description** Skipping the test cases if module not found. **Testing Done** Unit test cases all pass. Integ test cases cant be run for some reason. * Update with launch-fast-follow branch and fix unit test cases. **Description** **Testing Done** * Update with launch-fast-follow branch and fix unit test cases. **Description** **Testing Done** --------- Co-authored-by: pintaoz-aws <[email protected]> Co-authored-by: pintaoz <[email protected]>
nargokul
pushed a commit
that referenced
this pull request
Aug 26, 2025
* Add version comptability check between server K8s and Client python K8s (#138) * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Fix unit test cases * Move regex to a constant. **Description** - Removed an integration test case that was being mocked. - Moved a regex to a constant. **Testing Done** Unit test cases pass no changes made to integration test cases and they should not be affected. * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Add ref link for version comptability contraints **Description** Added a link to k8s documentation mentioning the constraints that rule the version compatibility of client and server. **Testing Done** No breaking changes. * Update logging information for submitting and deleting training job (#189) Co-authored-by: pintaoz <[email protected]> * Enhance docs with table formatting and comprehensive API reference **Description** - Convert CLI parameter lists to structured tables across all documentation files for better readability - Add comprehensive docstrings and examples to SDK classes (HPEndpointBase, HyperPodPytorchJob) - Enhance Sphinx configuration with better autodoc settings and extensions - Update API reference structure and formatting - Add custom CSS styling for improved table presentation - Update documentation requirements and index structure **Testing Done** - Verified documentation builds successfully with `make html` - Confirmed table formatting renders correctly in generated HTML - Validated API documentation generates properly with enhanced docstrings - Tested responsive table styling across different screen sizes - Checked that all parameter information remains accurate and complete * FIX ALTERED CODE **Description** Fixed the code altered while updating docstrings in `hyperpod_pytorch_job.py` file. **Testing Done** The unit test cases all pass. * ADD CLUSTER MANAGEMENT DOCS **Description** - Created comprehensive getting started guide for HyperPod cluster management - Added tab-set format showing both CLI and SDK options for consistency - Included step-by-step workflow from initialization to monitoring - Added cross-references to CLI documentation for auto-updating links - Filled in existing SDK methods (list_clusters, set_cluster_context) **Testing Done** Verified reStructuredText tab-set syntax renders correctly * Update PR some PR comments fixed **Description** **Testing Done** * Update cluster management cli ref to use md. **Description** Using markdown for the same of uniformity. **Testing Done** * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status (#204) * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status, support status parameter * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status, support status parameter --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> * ADD CLUSTER MANAGEMENT DOCS **Description** - Created comprehensive getting started guide for HyperPod cluster management - Added tab-set format showing both CLI and SDK options for consistency - Included step-by-step workflow from initialization to monitoring - Added cross-references to CLI documentation for auto-updating links - Filled in existing SDK methods (list_clusters, set_cluster_context) **Testing Done** Verified reStructuredText tab-set syntax renders correctly * Update for Cluster Management CLI commands. **Description** Updated md after verification. **Testing Done** Verified the commands. * Add note about default region to docs. **Description** Added a note about how the region selection and flag usage works, for better UX. **Testing Done** The note shows up as we want it to. * Update update commands for hyp-cluster. **Description** Updated the hyp-cluster update command correctly. **Testing Done** Verified the docs are correct. * Fix a unit test case changed while fixing merge conflicts. **Description** **Testing Done** * ADD NEW PARAMS TO CLI TRAINING DOCS **Description** - Resource parameters: accelerators, vcpu, memory, accelerators-limit, vcpu-limit, memory-limit - Topology parameters: preferred-topology, required-topology **Testing Done** - Verified parameter documentation follows existing format and style - Confirmed parameter descriptions match field definitions from source code - Validated documentation builds without errors * Updated docs for cli sdk ref (#192) * Add version comptability check between server K8s and Client python K8s (#138) * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Fix unit test cases * Move regex to a constant. **Description** - Removed an integration test case that was being mocked. - Moved a regex to a constant. **Testing Done** Unit test cases pass no changes made to integration test cases and they should not be affected. * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Add ref link for version comptability contraints **Description** Added a link to k8s documentation mentioning the constraints that rule the version compatibility of client and server. **Testing Done** No breaking changes. * Update logging information for submitting and deleting training job (#189) Co-authored-by: pintaoz <[email protected]> * Enhance docs with table formatting and comprehensive API reference **Description** - Convert CLI parameter lists to structured tables across all documentation files for better readability - Add comprehensive docstrings and examples to SDK classes (HPEndpointBase, HyperPodPytorchJob) - Enhance Sphinx configuration with better autodoc settings and extensions - Update API reference structure and formatting - Add custom CSS styling for improved table presentation - Update documentation requirements and index structure **Testing Done** - Verified documentation builds successfully with `make html` - Confirmed table formatting renders correctly in generated HTML - Validated API documentation generates properly with enhanced docstrings - Tested responsive table styling across different screen sizes - Checked that all parameter information remains accurate and complete * FIX ALTERED CODE **Description** Fixed the code altered while updating docstrings in `hyperpod_pytorch_job.py` file. **Testing Done** The unit test cases all pass. * FIX TEST CASES TO SKIP IF MODULE NOT FOUND **Description** Skipping the test cases if module not found. **Testing Done** Unit test cases all pass. Integ test cases cant be run for some reason. * Update with launch-fast-follow branch and fix unit test cases. **Description** **Testing Done** * Update with launch-fast-follow branch and fix unit test cases. **Description** **Testing Done** --------- Co-authored-by: pintaoz-aws <[email protected]> Co-authored-by: pintaoz <[email protected]> * ADD CLUSTER MANAGEMENT DOCS **Description** - Created comprehensive getting started guide for HyperPod cluster management - Added tab-set format showing both CLI and SDK options for consistency - Included step-by-step workflow from initialization to monitoring - Added cross-references to CLI documentation for auto-updating links - Filled in existing SDK methods (list_clusters, set_cluster_context) **Testing Done** Verified reStructuredText tab-set syntax renders correctly * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status (#204) * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status, support status parameter * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status, support status parameter --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> * Update for Cluster Management CLI commands. **Description** Updated md after verification. **Testing Done** Verified the commands. * Add note about default region to docs. **Description** Added a note about how the region selection and flag usage works, for better UX. **Testing Done** The note shows up as we want it to. * Update for Cluster Management CLI commands. **Description** Updated md after verification. **Testing Done** Verified the commands. * Add note about default region to docs. **Description** Added a note about how the region selection and flag usage works, for better UX. **Testing Done** The note shows up as we want it to. * Enhance docs with table formatting and comprehensive API reference **Description** - Convert CLI parameter lists to structured tables across all documentation files for better readability - Add comprehensive docstrings and examples to SDK classes (HPEndpointBase, HyperPodPytorchJob) - Enhance Sphinx configuration with better autodoc settings and extensions - Update API reference structure and formatting - Add custom CSS styling for improved table presentation - Update documentation requirements and index structure **Testing Done** - Verified documentation builds successfully with `make html` - Confirmed table formatting renders correctly in generated HTML - Validated API documentation generates properly with enhanced docstrings - Tested responsive table styling across different screen sizes - Checked that all parameter information remains accurate and complete * ADD CLUSTER MANAGEMENT DOCS **Description** - Created comprehensive getting started guide for HyperPod cluster management - Added tab-set format showing both CLI and SDK options for consistency - Included step-by-step workflow from initialization to monitoring - Added cross-references to CLI documentation for auto-updating links - Filled in existing SDK methods (list_clusters, set_cluster_context) **Testing Done** Verified reStructuredText tab-set syntax renders correctly * Updated docs for cli sdk ref (#192) * Add version comptability check between server K8s and Client python K8s (#138) * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Fix unit test cases * Move regex to a constant. **Description** - Removed an integration test case that was being mocked. - Moved a regex to a constant. **Testing Done** Unit test cases pass no changes made to integration test cases and they should not be affected. * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Add ref link for version comptability contraints **Description** Added a link to k8s documentation mentioning the constraints that rule the version compatibility of client and server. **Testing Done** No breaking changes. * Update logging information for submitting and deleting training job (#189) Co-authored-by: pintaoz <[email protected]> * Enhance docs with table formatting and comprehensive API reference **Description** - Convert CLI parameter lists to structured tables across all documentation files for better readability - Add comprehensive docstrings and examples to SDK classes (HPEndpointBase, HyperPodPytorchJob) - Enhance Sphinx configuration with better autodoc settings and extensions - Update API reference structure and formatting - Add custom CSS styling for improved table presentation - Update documentation requirements and index structure **Testing Done** - Verified documentation builds successfully with `make html` - Confirmed table formatting renders correctly in generated HTML - Validated API documentation generates properly with enhanced docstrings - Tested responsive table styling across different screen sizes - Checked that all parameter information remains accurate and complete * FIX ALTERED CODE **Description** Fixed the code altered while updating docstrings in `hyperpod_pytorch_job.py` file. **Testing Done** The unit test cases all pass. * FIX TEST CASES TO SKIP IF MODULE NOT FOUND **Description** Skipping the test cases if module not found. **Testing Done** Unit test cases all pass. Integ test cases cant be run for some reason. * Update with launch-fast-follow branch and fix unit test cases. **Description** **Testing Done** * Update with launch-fast-follow branch and fix unit test cases. **Description** **Testing Done** --------- Co-authored-by: pintaoz-aws <[email protected]> Co-authored-by: pintaoz <[email protected]> * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status (#204) * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status, support status parameter * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status, support status parameter --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> * ADD CLUSTER MANAGEMENT DOCS **Description** - Created comprehensive getting started guide for HyperPod cluster management - Added tab-set format showing both CLI and SDK options for consistency - Included step-by-step workflow from initialization to monitoring - Added cross-references to CLI documentation for auto-updating links - Filled in existing SDK methods (list_clusters, set_cluster_context) **Testing Done** Verified reStructuredText tab-set syntax renders correctly * Update PR some PR comments fixed **Description** **Testing Done** * Update cluster management cli ref to use md. **Description** Using markdown for the same of uniformity. **Testing Done** * Update for Cluster Management CLI commands. **Description** Updated md after verification. **Testing Done** Verified the commands. * Add note about default region to docs. **Description** Added a note about how the region selection and flag usage works, for better UX. **Testing Done** The note shows up as we want it to. * Update code lines messed up while fixing merge conflicts. **Description** **Testing Done** * Update docs and README to include task gov and gpu_quota params. **Description** **Testing Done** --------- Co-authored-by: pintaoz-aws <[email protected]> Co-authored-by: pintaoz <[email protected]> Co-authored-by: rsareddy0329 <[email protected]> Co-authored-by: Roja Reddy Sareddy <[email protected]>
* Init experience baseline (#145) * js init and reset done, next step is to expand to custom * basic workflow done and handles edge cases for multiple init * minor change to rerun init console print * init experience baseline * Add unique time string to integ test (#150) * Add unique time string to integ test * Update syntax * update template into TEMPLATES constant configuration --------- Co-authored-by: Zhaoqi <[email protected]> * Cluster management (#146) * Cluster Management SDK * Remove file * Address PR comments * Fix * Updates and Cleanup * Cluster create cli (#150) * CLuster Creation CLI **Description** This update integrates the init experience with the cluster creation SDK to configure multiple atributes and create the cluster and required resources **Testing Done** For manual testing , ran hyp init cluster , hyp condigure and hyp submit and verified stack creation * Unit Tests * Validations * Create param (#153) * Update Instance Group and Rig Settings Params * Unit Tests * Add Describe and List cluster stack feature (#151) * Add describe, list cluster stacks features to CLI. **Description** - Added the desired features by using `describe_stacks` and `list_stacks` CloudFormation APIs. - Formatted the JSON output of API to make it more readable. - Added Stack status on Describe stack feature explicitly. **Testing Done** Tested both features on CLI to be working. * Add test cases for describe and list features. **Description** Added unit and integration test cases for list and describe features. **Testing Done** The test cases pass. * Update CLI command call for list and descrive features **Description** Updated the CLI commands to follow the expected nomenclature. **Testing Done** All the test cases pass and do not need any changes. * Update CLI logging for List and Describe features. **Description** Improved logging on CLI to improve the UX. **Testing Done** Test cases pass. * Add test cases for describe and list features. **Description** Added unit and integration test cases for list and describe features. **Testing Done** The test cases pass. * Create param (#153) * Update Instance Group and Rig Settings Params * Unit Tests * **Description** Add util to create boto3 client Improve output formatting on cli for list and describe **Testing Done** No changes required to test cases, the changes are backwards compatible * Remove excess code due to git conflicts. **Description** **Testing Done** * Remove print and use click instead. **Description** **Testing Done** * Remove print and use click instead. **Description** **Testing Done** --------- Co-authored-by: Gokul Anantha Narayanan <[email protected]> * add validate logic in configure command, bug fixes for cluster init experience, update hytorch template to add CRD default, update custom inference endpoint to check s3 and fsx required * Revert "add validate logic in configure command, bug fixes for cluster init experience, update hytorch template to add CRD default, update custom inference endpoint to check s3 and fsx required" This reverts commit 63bc2c5c284f07f642f76afbcde83923fd910c61. * Revert "Revert "add validate logic in configure command, bug fixes for cluster init experience, update hytorch template to add CRD default, update custom inference endpoint to check s3 and fsx required"" (#156) This reverts commit 09c81f3438796d6e5dbfd0475dc895f70cdaba30. * Add get cluster status method (#157) * Add describe, list cluster stacks features to CLI. **Description** - Added the desired features by using `describe_stacks` and `list_stacks` CloudFormation APIs. - Formatted the JSON output of API to make it more readable. - Added Stack status on Describe stack feature explicitly. **Testing Done** Tested both features on CLI to be working. * Add test cases for describe and list features. **Description** Added unit and integration test cases for list and describe features. **Testing Done** The test cases pass. * Update CLI command call for list and descrive features **Description** Updated the CLI commands to follow the expected nomenclature. **Testing Done** All the test cases pass and do not need any changes. * Add test cases for describe and list features. **Description** Added unit and integration test cases for list and describe features. **Testing Done** The test cases pass. * Create param (#153) * Update Instance Group and Rig Settings Params * Unit Tests * **Description** Add util to create boto3 client Improve output formatting on cli for list and describe **Testing Done** No changes required to test cases, the changes are backwards compatible * Remove print and use click instead. **Description** **Testing Done** * Add get status and check status method for SDK experience **Description** **Testing Done** * Fix merge conflicts **Description** **Testing Done** * Fix code duplication due to merge conflicts. **Description** **Testing Done** * Fix code duplication due to merge conflicts. **Description** **Testing Done** * Remove unwanted empty line **Description** **Testing Done** * Add unit and integration test cases for get and check status methods **Description** **Testing Done** All the tests pass. --------- Co-authored-by: Gokul Anantha Narayanan <[email protected]> * Update for Hyperpod Cluster (#155) * Update create cluster method to return full cluster detail object. (#159) **Description** Update create cluster method to return full cluster detail object. **Testing Done** Updated the unit test case for create and all test cases pass. No change needed to integration test case for now. * add inference template submit backend logic, fix namespace default across template (#160) * add inference template submit backend logic, fix namespace default across template * add namespace to jumpstart and custom endpoint template to simplify logic, no special handling for namespace for any templates, add unit tests for init experience * Merge branch 'master' into launch-fast-follow (#174) * Bring recipe-supp branch to staging repo (#175) * Add version comptability check between server K8s and Client python K8s (#138) * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Fix unit test cases * Move regex to a constant. **Description** - Removed an integration test case that was being mocked. - Moved a regex to a constant. **Testing Done** Unit test cases pass no changes made to integration test cases and they should not be affected. * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Add ref link for version comptability contraints **Description** Added a link to k8s documentation mentioning the constraints that rule the version compatibility of client and server. **Testing Done** No breaking changes. * Recipe supp (#182) * Add sagemaker-hyperpod-recipes submodule * Recipe Support for Hyp --------- Co-authored-by: papriwal <[email protected]> Co-authored-by: jam-jee <[email protected]> * Update to fetch templates from S3 and other changes (#176) * Add version comptability check between server K8s and Client python K8s (#138) * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Fix unit test cases * Move regex to a constant. **Description** - Removed an integration test case that was being mocked. - Moved a regex to a constant. **Testing Done** Unit test cases pass no changes made to integration test cases and they should not be affected. * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Add ref link for version comptability contraints **Description** Added a link to k8s documentation mentioning the constraints that rule the version compatibility of client and server. **Testing Done** No breaking changes. * Update for Hyperpod Cluster * Fix training test (#184) * Fix SDK training test: Add wait time before refresh * Fix training tests in canaries * Add labels to the top level metadata (#158) Co-authored-by: pintaoz <[email protected]> * Update Fixes and Updating the S3 location to point to version locked templates * Update logic to point to main stack * Fixes and Tests * FIx * Address Fix * Code cleanup * Address comments * Fixes --------- Co-authored-by: papriwal <[email protected]> Co-authored-by: Zhaoqi <[email protected]> Co-authored-by: pintaoz-aws <[email protected]> Co-authored-by: pintaoz <[email protected]> * Main change: Enable hyp-pytorch-job template in init experience. Minor change see description (#168) * add inference template submit backend logic, fix namespace default across template * update get latest version logic for init, add hyp-pytorch-job template without submit e2e or volume handler * add submit for pytorch-job, e2e working, missing volume handle * add namespace to jumpstart and custom endpoint template to simplify logic, no special handling for namespace for any templates, add unit tests for init experience * Resolve namespace logic issue, update endpoint-name for endpoint schema to required * add support for volume flag, and other special handling (list and dictionary) * revert breaking changes for jumpstart and custom endpoint template v1.0, remove generate_click_command from init command * Update params being saved in jinja file (#171) * Show complete cfn param template **Description** Showing complete CFN param template regarding cluster creation to provide a better UX and more context for the user. **Testing Done** Update unit test cases wherever needed, the related test cases all pass. * SIMPLIFY CFN TEMPLATE GENERATION **Description** Moved to putting the full CFN template in the jinja file. **Testing Done** Update unit test cases to cover the updates and all of them pass. * FIX ERROR LOGGING IN TEMPLATE PROCESSING **Description** Logging error in case we get an exception while getting the template. **Testing Done** Updated the related unit test case and the whole associated test suite passes. * ADD REGION OPTION TO `describe-cluster-stack` COMMAND **Description** - Add --region option to describe_cluster_stack command for specifying AWS region - Update function signature to accept region parameter - Pass region parameter to HpClusterStack.describe() method call **Testing Done** Update unit test cases and all the related test cases pass. * Revert "Bring recipe-supp branch to staging repo (#175)" (#181) This reverts commit 29313327a11da8b5dc66d75ffee3981ac50f60e5. * Fix merge conflict issues, update cluster template to add default in model.py (#186) * Fix merge conflict issues, update cluster template to add default in model.py * Update model.py to remove default for network related params * Fix: List cluster stacks failure for datetime objects (#189) Co-authored-by: Roja Reddy Sareddy <[email protected]> * Added mapping for HyperPodClusterName (#188) Co-authored-by: AviRuthen <[email protected]> * Change default region in hyp submit command (#193) * Revert "Bring recipe-supp branch to staging repo (#175)" This reverts commit 29313327a11da8b5dc66d75ffee3981ac50f60e5. * Change default region in hyp submit command Change default region to aws configure region. Tested locally by editing config file. * Print region info when using default region * Update print message in submit command * Updated to handle YAML arrays in config file (#190) * Fix CloudFormation tags parsing, array validation, and test mocking issues (#195) * Update Validation logic for the Create cluster * Update handling of json strings * Small Revert * Test fix (#199) * Adding testing for new template related code and for this branch **Description** **Testing Done** * Adding to within unit tests folder * Empty commit * fix * Fix * fix * Add for integ tests * Fix * Fix * Remove AbstractIntegrationTests * UPDATE CFN PARAM IN JINJA FILE (#198) * UPDATE CFN PARAM IN JINJA FILE **Description** Updated cfn cluster creation template. **Testing Done** * FIX UNIT TEST CASES FOR CFN PARAM **Description** Updated the unit test cases for the process cfn param util. **Testing Done** All the unit test cases pass. * Remove unused function and fix CloudFormation template issues **Description** - Removed redundant _process_cfn_template_content function from init_utils.py - Fixed missing InstanceGroupSettings1 and RigSettings1 parameters in CFN template by changing loop range from (2,21) to (1,21) - Removed duplicate load_config_and_validate function definition **Testing Done** - Verified CloudFormation template generates all required parameters 1-20 - Confirmed no duplicate function definitions remain - Updated unit test cases and the whole suite passes * Updated comment for Resource Name Prefix to reflect the usage better (#206) * Add default availability zone ID based on region (#194) * Add default availability zone ID based on region * Add mapping reference link * Replace AZ ID mapping with boto3 call * Update error handling for getting AZ ID * Use create_boto3_client util * Resolve conflicts * Replace hyp submit with hyp create by overriding the default for hyp create (#202) * replace hyp submit with hyp create by overriding the default for hyp create * minor change * update help text and unit test imports * update create command help message * minor syntax update to accomodate for unit test running in py3.9 * update unit test to rename submit into default_create * Updated docs for cli sdk ref (#192) * Add version comptability check between server K8s and Client python K8s (#138) * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Fix unit test cases * Move regex to a constant. **Description** - Removed an integration test case that was being mocked. - Moved a regex to a constant. **Testing Done** Unit test cases pass no changes made to integration test cases and they should not be affected. * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Add ref link for version comptability contraints **Description** Added a link to k8s documentation mentioning the constraints that rule the version compatibility of client and server. **Testing Done** No breaking changes. * Update logging information for submitting and deleting training job (#189) Co-authored-by: pintaoz <[email protected]> * Enhance docs with table formatting and comprehensive API reference **Description** - Convert CLI parameter lists to structured tables across all documentation files for better readability - Add comprehensive docstrings and examples to SDK classes (HPEndpointBase, HyperPodPytorchJob) - Enhance Sphinx configuration with better autodoc settings and extensions - Update API reference structure and formatting - Add custom CSS styling for improved table presentation - Update documentation requirements and index structure **Testing Done** - Verified documentation builds successfully with `make html` - Confirmed table formatting renders correctly in generated HTML - Validated API documentation generates properly with enhanced docstrings - Tested responsive table styling across different screen sizes - Checked that all parameter information remains accurate and complete * FIX ALTERED CODE **Description** Fixed the code altered while updating docstrings in `hyperpod_pytorch_job.py` file. **Testing Done** The unit test cases all pass. * FIX TEST CASES TO SKIP IF MODULE NOT FOUND **Description** Skipping the test cases if module not found. **Testing Done** Unit test cases all pass. Integ test cases cant be run for some reason. * Update with launch-fast-follow branch and fix unit test cases. **Description** **Testing Done** * Update with launch-fast-follow branch and fix unit test cases. **Description** **Testing Done** --------- Co-authored-by: pintaoz-aws <[email protected]> Co-authored-by: pintaoz <[email protected]> * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status (#204) * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status, support status parameter * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status, support status parameter --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> * Update defaults to baseline example (#208) * Update defaults to baseline example * Init utils changes * MOre updates * More updates * Remove other jobs from template, change update-cluster verb to update, update help texts and readme (#209) * filter help arguments depending on current template, fix minor integ test issues by bringing change from main repo (#201) * Timeout for set_cluster_context (#211) * Timeout for set_cluster_context * Unit tests * Fix: list-clusters to display all HP clusters including which have 0 instances (#212) * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status, support status parameter * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status, support status parameter * Fix: list-clusters to display all HP clusters including which are not 'InService' status --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> * Update the enable_hp_inference_feature to be boolean . (#213) --- The conversion for the bool to string for the cloudformation is already handled --- Tested through unit tests and through manual testing * Bug fixes to HypCLI Cluster Creation (#210) * Fixed Bugs in HypCLI Cluster Creation * Updated file to match launch-fast-follow * Fully tested update to cluster creation * Update _parse_tags function to reflect more up-to-date changes * Update unit tests for hp_cluster_stack array handling and _parse_tags enhancements * Fixed failing unit test * Fix test expectation after merge - update to match actual Pydantic validation behavior * Fix config validation to handle list-to-JSON conversion in HpClusterStack * Final fix for unit tests * Fixed errors to validation --------- Co-authored-by: AviRuthen <[email protected]> * Append UUID to resource name prefix to ensure uniqueness . (#216) --- Tested with unit tests and manual testing * Docs for cluster stack creation (#207) * Add version comptability check between server K8s and Client python K8s (#138) * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Fix unit test cases * Move regex to a constant. **Description** - Removed an integration test case that was being mocked. - Moved a regex to a constant. **Testing Done** Unit test cases pass no changes made to integration test cases and they should not be affected. * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Add ref link for version comptability contraints **Description** Added a link to k8s documentation mentioning the constraints that rule the version compatibility of client and server. **Testing Done** No breaking changes. * Update logging information for submitting and deleting training job (#189) Co-authored-by: pintaoz <[email protected]> * Enhance docs with table formatting and comprehensive API reference **Description** - Convert CLI parameter lists to structured tables across all documentation files for better readability - Add comprehensive docstrings and examples to SDK classes (HPEndpointBase, HyperPodPytorchJob) - Enhance Sphinx configuration with better autodoc settings and extensions - Update API reference structure and formatting - Add custom CSS styling for improved table presentation - Update documentation requirements and index structure **Testing Done** - Verified documentation builds successfully with `make html` - Confirmed table formatting renders correctly in generated HTML - Validated API documentation generates properly with enhanced docstrings - Tested responsive table styling across different screen sizes - Checked that all parameter information remains accurate and complete * FIX ALTERED CODE **Description** Fixed the code altered while updating docstrings in `hyperpod_pytorch_job.py` file. **Testing Done** The unit test cases all pass. * ADD CLUSTER MANAGEMENT DOCS **Description** - Created comprehensive getting started guide for HyperPod cluster management - Added tab-set format showing both CLI and SDK options for consistency - Included step-by-step workflow from initialization to monitoring - Added cross-references to CLI documentation for auto-updating links - Filled in existing SDK methods (list_clusters, set_cluster_context) **Testing Done** Verified reStructuredText tab-set syntax renders correctly * Update PR some PR comments fixed **Description** **Testing Done** * Update PR some PR comments fixed **Description** **Testing Done** * Update cluster management getting started. **Description** **Testing Done** * Update cluster management cli ref to use md. **Description** Using markdown for the same of uniformity. **Testing Done** * Update cluster management getting started. **Description** Mentioning the missing file generated with `hyp init hyp-cluster` command. **Testing Done** N/A * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status (#204) * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status, support status parameter * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status, support status parameter --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> * ADD CLUSTER MANAGEMENT DOCS **Description** - Created comprehensive getting started guide for HyperPod cluster management - Added tab-set format showing both CLI and SDK options for consistency - Included step-by-step workflow from initialization to monitoring - Added cross-references to CLI documentation for auto-updating links - Filled in existing SDK methods (list_clusters, set_cluster_context) **Testing Done** Verified reStructuredText tab-set syntax renders correctly * Update for Cluster Management CLI commands. **Description** - Commented the complete autogen file for cli cluster management. - Added some updates to commands as required. **Testing Done** Verified the commands. * Update for Cluster Management CLI commands. **Description** Updated md after verification. **Testing Done** Verified the commands. * Add note about default region to docs. **Description** Added a note about how the region selection and flag usage works, for better UX. **Testing Done** The note shows up as we want it to. * Update update commands for hyp-cluster. **Description** Updated the hyp-cluster update command correctly. **Testing Done** Verified the docs are correct. * Fix a unit test case changed while fixing merge conflicts. **Description** **Testing Done** --------- Co-authored-by: pintaoz-aws <[email protected]> Co-authored-by: pintaoz <[email protected]> Co-authored-by: rsareddy0329 <[email protected]> Co-authored-by: Roja Reddy Sareddy <[email protected]> * Rename Stack related commands to hyp-cluster-stack instead of hyp-cluster (#214) --- Testd Manually and through unit tests . * Revert "Bug fixes to HypCLI Cluster Creation (#210)" (#217) This reverts commit 2da2588edabce2fa41cebd7aa6830c4e26105818. * Task gov doc updates (#218) * Add version comptability check between server K8s and Client python K8s (#138) * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Fix unit test cases * Move regex to a constant. **Description** - Removed an integration test case that was being mocked. - Moved a regex to a constant. **Testing Done** Unit test cases pass no changes made to integration test cases and they should not be affected. * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Add ref link for version comptability contraints **Description** Added a link to k8s documentation mentioning the constraints that rule the version compatibility of client and server. **Testing Done** No breaking changes. * Update logging information for submitting and deleting training job (#189) Co-authored-by: pintaoz <[email protected]> * Enhance docs with table formatting and comprehensive API reference **Description** - Convert CLI parameter lists to structured tables across all documentation files for better readability - Add comprehensive docstrings and examples to SDK classes (HPEndpointBase, HyperPodPytorchJob) - Enhance Sphinx configuration with better autodoc settings and extensions - Update API reference structure and formatting - Add custom CSS styling for improved table presentation - Update documentation requirements and index structure **Testing Done** - Verified documentation builds successfully with `make html` - Confirmed table formatting renders correctly in generated HTML - Validated API documentation generates properly with enhanced docstrings - Tested responsive table styling across different screen sizes - Checked that all parameter information remains accurate and complete * FIX ALTERED CODE **Description** Fixed the code altered while updating docstrings in `hyperpod_pytorch_job.py` file. **Testing Done** The unit test cases all pass. * ADD CLUSTER MANAGEMENT DOCS **Description** - Created comprehensive getting started guide for HyperPod cluster management - Added tab-set format showing both CLI and SDK options for consistency - Included step-by-step workflow from initialization to monitoring - Added cross-references to CLI documentation for auto-updating links - Filled in existing SDK methods (list_clusters, set_cluster_context) **Testing Done** Verified reStructuredText tab-set syntax renders correctly * Update PR some PR comments fixed **Description** **Testing Done** * Update cluster management cli ref to use md. **Description** Using markdown for the same of uniformity. **Testing Done** * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status (#204) * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status, support status parameter * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status, support status parameter --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> * ADD CLUSTER MANAGEMENT DOCS **Description** - Created comprehensive getting started guide for HyperPod cluster management - Added tab-set format showing both CLI and SDK options for consistency - Included step-by-step workflow from initialization to monitoring - Added cross-references to CLI documentation for auto-updating links - Filled in existing SDK methods (list_clusters, set_cluster_context) **Testing Done** Verified reStructuredText tab-set syntax renders correctly * Update for Cluster Management CLI commands. **Description** Updated md after verification. **Testing Done** Verified the commands. * Add note about default region to docs. **Description** Added a note about how the region selection and flag usage works, for better UX. **Testing Done** The note shows up as we want it to. * Update update commands for hyp-cluster. **Description** Updated the hyp-cluster update command correctly. **Testing Done** Verified the docs are correct. * Fix a unit test case changed while fixing merge conflicts. **Description** **Testing Done** * ADD NEW PARAMS TO CLI TRAINING DOCS **Description** - Resource parameters: accelerators, vcpu, memory, accelerators-limit, vcpu-limit, memory-limit - Topology parameters: preferred-topology, required-topology **Testing Done** - Verified parameter documentation follows existing format and style - Confirmed parameter descriptions match field definitions from source code - Validated documentation builds without errors * Updated docs for cli sdk ref (#192) * Add version comptability check between server K8s and Client python K8s (#138) * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Fix unit test cases * Move regex to a constant. **Description** - Removed an integration test case that was being mocked. - Moved a regex to a constant. **Testing Done** Unit test cases pass no changes made to integration test cases and they should not be affected. * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Add ref link for version comptability contraints **Description** Added a link to k8s documentation mentioning the constraints that rule the version compatibility of client and server. **Testing Done** No breaking changes. * Update logging information for submitting and deleting training job (#189) Co-authored-by: pintaoz <[email protected]> * Enhance docs with table formatting and comprehensive API reference **Description** - Convert CLI parameter lists to structured tables across all documentation files for better readability - Add comprehensive docstrings and examples to SDK classes (HPEndpointBase, HyperPodPytorchJob) - Enhance Sphinx configuration with better autodoc settings and extensions - Update API reference structure and formatting - Add custom CSS styling for improved table presentation - Update documentation requirements and index structure **Testing Done** - Verified documentation builds successfully with `make html` - Confirmed table formatting renders correctly in generated HTML - Validated API documentation generates properly with enhanced docstrings - Tested responsive table styling across different screen sizes - Checked that all parameter information remains accurate and complete * FIX ALTERED CODE **Description** Fixed the code altered while updating docstrings in `hyperpod_pytorch_job.py` file. **Testing Done** The unit test cases all pass. * FIX TEST CASES TO SKIP IF MODULE NOT FOUND **Description** Skipping the test cases if module not found. **Testing Done** Unit test cases all pass. Integ test cases cant be run for some reason. * Update with launch-fast-follow branch and fix unit test cases. **Description** **Testing Done** * Update with launch-fast-follow branch and fix unit test cases. **Description** **Testing Done** --------- Co-authored-by: pintaoz-aws <[email protected]> Co-authored-by: pintaoz <[email protected]> * ADD CLUSTER MANAGEMENT DOCS **Description** - Created comprehensive getting started guide for HyperPod cluster management - Added tab-set format showing both CLI and SDK options for consistency - Included step-by-step workflow from initialization to monitoring - Added cross-references to CLI documentation for auto-updating links - Filled in existing SDK methods (list_clusters, set_cluster_context) **Testing Done** Verified reStructuredText tab-set syntax renders correctly * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status (#204) * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status, support status parameter * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status, support status parameter --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> * Update for Cluster Management CLI commands. **Description** Updated md after verification. **Testing Done** Verified the commands. * Add note about default region to docs. **Description** Added a note about how the region selection and flag usage works, for better UX. **Testing Done** The note shows up as we want it to. * Update for Cluster Management CLI commands. **Description** Updated md after verification. **Testing Done** Verified the commands. * Add note about default region to docs. **Description** Added a note about how the region selection and flag usage works, for better UX. **Testing Done** The note shows up as we want it to. * Enhance docs with table formatting and comprehensive API reference **Description** - Convert CLI parameter lists to structured tables across all documentation files for better readability - Add comprehensive docstrings and examples to SDK classes (HPEndpointBase, HyperPodPytorchJob) - Enhance Sphinx configuration with better autodoc settings and extensions - Update API reference structure and formatting - Add custom CSS styling for improved table presentation - Update documentation requirements and index structure **Testing Done** - Verified documentation builds successfully with `make html` - Confirmed table formatting renders correctly in generated HTML - Validated API documentation generates properly with enhanced docstrings - Tested responsive table styling across different screen sizes - Checked that all parameter information remains accurate and complete * ADD CLUSTER MANAGEMENT DOCS **Description** - Created comprehensive getting started guide for HyperPod cluster management - Added tab-set format showing both CLI and SDK options for consistency - Included step-by-step workflow from initialization to monitoring - Added cross-references to CLI documentation for auto-updating links - Filled in existing SDK methods (list_clusters, set_cluster_context) **Testing Done** Verified reStructuredText tab-set syntax renders correctly * Updated docs for cli sdk ref (#192) * Add version comptability check between server K8s and Client python K8s (#138) * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Fix unit test cases * Move regex to a constant. **Description** - Removed an integration test case that was being mocked. - Moved a regex to a constant. **Testing Done** Unit test cases pass no changes made to integration test cases and they should not be affected. * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Add ref link for version comptability contraints **Description** Added a link to k8s documentation mentioning the constraints that rule the version compatibility of client and server. **Testing Done** No breaking changes. * Update logging information for submitting and deleting training job (#189) Co-authored-by: pintaoz <[email protected]> * Enhance docs with table formatting and comprehensive API reference **Description** - Convert CLI parameter lists to structured tables across all documentation files for better readability - Add comprehensive docstrings and examples to SDK classes (HPEndpointBase, HyperPodPytorchJob) - Enhance Sphinx configuration with better autodoc settings and extensions - Update API reference structure and formatting - Add custom CSS styling for improved table presentation - Update documentation requirements and index structure **Testing Done** - Verified documentation builds successfully with `make html` - Confirmed table formatting renders correctly in generated HTML - Validated API documentation generates properly with enhanced docstrings - Tested responsive table styling across different screen sizes - Checked that all parameter information remains accurate and complete * FIX ALTERED CODE **Description** Fixed the code altered while updating docstrings in `hyperpod_pytorch_job.py` file. **Testing Done** The unit test cases all pass. * FIX TEST CASES TO SKIP IF MODULE NOT FOUND **Description** Skipping the test cases if module not found. **Testing Done** Unit test cases all pass. Integ test cases cant be run for some reason. * Update with launch-fast-follow branch and fix unit test cases. **Description** **Testing Done** * Update with launch-fast-follow branch and fix unit test cases. **Description** **Testing Done** --------- Co-authored-by: pintaoz-aws <[email protected]> Co-authored-by: pintaoz <[email protected]> * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status (#204) * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status, support status parameter * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status, support status parameter --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> * ADD CLUSTER MANAGEMENT DOCS **Description** - Created comprehensive getting started guide for HyperPod cluster management - Added tab-set format showing both CLI and SDK options for consistency - Included step-by-step workflow from initialization to monitoring - Added cross-references to CLI documentation for auto-updating links - Filled in existing SDK methods (list_clusters, set_cluster_context) **Testing Done** Verified reStructuredText tab-set syntax renders correctly * Update PR some PR comments fixed **Description** **Testing Done** * Update cluster management cli ref to use md. **Description** Using markdown for the same of uniformity. **Testing Done** * Update for Cluster Management CLI commands. **Description** Updated md after verification. **Testing Done** Verified the commands. * Add note about default region to docs. **Description** Added a note about how the region selection and flag usage works, for better UX. **Testing Done** The note shows up as we want it to. * Update code lines messed up while fixing merge conflicts. **Description** **Testing Done** * Update docs and README to include task gov and gpu_quota params. **Description** **Testing Done** --------- Co-authored-by: pintaoz-aws <[email protected]> Co-authored-by: pintaoz <[email protected]> Co-authored-by: rsareddy0329 <[email protected]> Co-authored-by: Roja Reddy Sareddy <[email protected]> * update cloud formation template to 1.1, fix instance group setting format (#220) * update cloud formation template to 1.1, fix instance group setting format * fix unit test * Reorder and update description for each field in cluster creation (#221) * Reorder and update description for each field in cluster creation - Reordering the fields to match the order in the config.yaml file - Updating descriptions to match the comments in the config.yaml file - Updating default values (like resource_name_prefix changed from "hyperpod-cli-integ-test" to "hyp-eks-stack" and hyperpod_cluster_name from "hyperpod-cluster-integ-test" to "hyperpod-cluster") All unit tests passed. * update model.py * fix: validation error for json format that accomadates both single and double quotes (#224) * update cloud formation template to 1.1, fix instance group setting format * fix unit test * fix: validation error for json format that accomadates both single and double quotes * Add --debug flag to docs (#225) **Description** **Testing Done** * Update the cluster stack command to be cluster-stack instead of hyp-cluster-stack (#219) * Append UUID to resource name prefix to ensure uniqueness . --- Tested with unit tests and manual testing * Update the cluster stack command to be `cluster-stack` instead of `hyp-cluster-stack` * Fix * Update CLI docs for validation and resource naming clarity (#226) **Description** - Clarified hyp validate performs syntactic validation only, not AWS resource validation - Added resource_name_prefix requirement for unique deployment identifiers - Updated prerequisites and examples with accurate behavior descriptions **Testing Done** - Verified validation function implementation matches documentation * Update CHANGELOG.md for launch fast follow release (#228) * Update CHANGELOG.md for launch fast follow release * Update to minor version * Add default availability zone (#229) * Add default availability zone - Add default AZ IDs - Updated field description in model.py Tested by manually entering different AZ IDs in config.yaml and added unit tests * Pick 2 AZ IDs instead of 1 during submission * Add example of entering az ID * Update description in model.py * Enable Telemetry for Cluster creation (#230) * Enable Telemetry for Cluster creation * Telemetry for CLI and updates * Fix * Implemented exec command with unit tests (#222) * Implemented exec command with unit tests * Minor UX change to help for pod and all-pods * Better help for exec command usage * Removed unnecessary comment * ABstract out some defaut values from the user . (#234) Also add Example Notebooks * Cleanup and fix for notebooks (#236) * ABstract out some defaut values from the user . Also add Example Notebooks * Cleanup and fix * Cleanup for CLI notebook * Add sphinx_click to requirements. (#231) **Description** **Testing Done** * Add integration tests for HP Cluster Creation (#227) * Add integration test for HP cluster creation workflow * Add utility functions for integration tests * Cleaned imports and utils * Fixed Bugs related to Integ Test * Probable fix for configure bug * Revert Previous Changes and Fixed Configure Bug * update configure import strategy * remove cluster-stack command from list and describe cli * Updated monitoring logic to use boto3 * Changed name of cluster to be monitored --------- Co-authored-by: Molly He <[email protected]> * Update setup.py (#237) * Update pyproject.toml (#238) * Update CHANGELOG.md (#239) * Test Fixes * Skip some invoke tests * Skip some invoke tests --------- Co-authored-by: Molly He <[email protected]> Co-authored-by: Zhaoqi <[email protected]> Co-authored-by: papriwal <[email protected]> Co-authored-by: jam-jee <[email protected]> Co-authored-by: pintaoz-aws <[email protected]> Co-authored-by: pintaoz <[email protected]> Co-authored-by: rsareddy0329 <[email protected]> Co-authored-by: Roja Reddy Sareddy <[email protected]> Co-authored-by: aviruthen <[email protected]> Co-authored-by: AviRuthen <[email protected]> Co-authored-by: Zhaoqi <[email protected]>
Tested locally by running the commands and saw exptected results
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR Approval Steps
For Requester
For Reviewer
For Requester
section to double check each item.