Skip to content

Conversation

nargokul
Copy link
Collaborator

@nargokul nargokul commented Aug 26, 2025

Overview

  • This change is to merge cluster creation feature to the main hyperpod CLI.

PR Approval Steps

For Requester

  1. Description
    • Check the PR title and description for clarity. It should describe the changes made and the reason behind them.
    • Ensure that the PR follows the contribution guidelines, if applicable.
  2. Security requirements
  3. Manual review
    1. Click on the Files changed tab to see the code changes. Review the changes thoroughly:
      • Code Quality: Check for coding standards, naming conventions, and readability.
      • Functionality: Ensure that the changes meet the requirements and that all necessary code paths are tested.
      • Security: Check for any security issues or vulnerabilities.
      • Documentation: Confirm that any necessary documentation (code comments, README updates, etc.) has been updated.
  4. Check for Merge Conflicts:
    • Verify if there are any merge conflicts with the base branch. GitHub will usually highlight this. If there are conflicts, you should resolve them.

For Reviewer

  1. Go through For Requester section to double check each item.
  2. Request Changes or Approve the PR:
    1. If the PR is ready to be merged, click Review changes and select Approve.
    2. If changes are required, select Request changes and provide feedback. Be constructive and clear in your feedback.
  3. Merging the PR
    1. Check the Merge Method:
      1. Decide on the appropriate merge method based on your repository's guidelines (e.g., Squash and merge, Rebase and merge, or Merge).
    2. Merge the PR:
      1. Click the Merge pull request button.
      2. Confirm the merge by clicking Confirm merge.

mollyheamazon and others added 30 commits August 25, 2025 17:51
* js init and reset done, next step is to expand to custom

* basic workflow done and handles edge cases for multiple init

* minor change to rerun init console print

* init experience baseline

* Add unique time string to integ test (#150)

* Add unique time string to integ test

* Update syntax

* update template into TEMPLATES constant configuration

---------

Co-authored-by: Zhaoqi <[email protected]>
* Cluster Management SDK

* Remove file

* Address PR comments

* Fix

* Updates and Cleanup
* CLuster Creation CLI

**Description**

This update integrates the init experience with the cluster creation SDK to configure multiple atributes and create the cluster and required resources

**Testing Done**

For manual testing , ran hyp init cluster , hyp condigure and hyp submit and verified stack creation

* Unit Tests

* Validations
* Update Instance Group and Rig Settings Params

* Unit Tests
* Add describe, list cluster stacks features to CLI.

**Description**
- Added the desired features by using `describe_stacks` and `list_stacks` CloudFormation APIs.
- Formatted the JSON output of API to make it more readable.
- Added Stack status on Describe stack feature explicitly.

**Testing Done**
Tested both features on CLI to be working.

* Add test cases for describe and list features.

**Description**
Added unit and integration test cases for list and describe features.

**Testing Done**
The test cases pass.

* Update CLI command call for list and descrive features

**Description**
Updated the CLI commands to follow the expected nomenclature.

**Testing Done**
All the test cases pass and do not need any changes.

* Update CLI logging for List and Describe features.

**Description**
Improved logging on CLI to improve the UX.

**Testing Done**
Test cases pass.

* Add test cases for describe and list features.

**Description**
Added unit and integration test cases for list and describe features.

**Testing Done**
The test cases pass.

* Create param (#153)

* Update Instance Group and Rig Settings Params

* Unit Tests

* **Description**
Add util to create boto3 client
Improve output formatting on cli for list and describe

**Testing Done**
No changes required to test cases, the changes are backwards compatible

* Remove excess code due to git conflicts.

**Description**

**Testing Done**

* Remove print and use click instead.

**Description**

**Testing Done**

* Remove print and use click instead.

**Description**

**Testing Done**

---------

Co-authored-by: Gokul Anantha Narayanan <[email protected]>
…xperience, update hytorch template to add CRD default, update custom inference endpoint to check s3 and fsx required
…r init experience, update hytorch template to add CRD default, update custom inference endpoint to check s3 and fsx required"

This reverts commit 63bc2c5c284f07f642f76afbcde83923fd910c61.
…r cluster init experience, update hytorch template to add CRD default, update custom inference endpoint to check s3 and fsx required"" (#156)

This reverts commit 09c81f3438796d6e5dbfd0475dc895f70cdaba30.
* Add describe, list cluster stacks features to CLI.

**Description**
- Added the desired features by using `describe_stacks` and `list_stacks` CloudFormation APIs.
- Formatted the JSON output of API to make it more readable.
- Added Stack status on Describe stack feature explicitly.

**Testing Done**
Tested both features on CLI to be working.

* Add test cases for describe and list features.

**Description**
Added unit and integration test cases for list and describe features.

**Testing Done**
The test cases pass.

* Update CLI command call for list and descrive features

**Description**
Updated the CLI commands to follow the expected nomenclature.

**Testing Done**
All the test cases pass and do not need any changes.

* Add test cases for describe and list features.

**Description**
Added unit and integration test cases for list and describe features.

**Testing Done**
The test cases pass.

* Create param (#153)

* Update Instance Group and Rig Settings Params

* Unit Tests

* **Description**
Add util to create boto3 client
Improve output formatting on cli for list and describe

**Testing Done**
No changes required to test cases, the changes are backwards compatible

* Remove print and use click instead.

**Description**

**Testing Done**

* Add get status and check status method for SDK experience

**Description**

**Testing Done**

* Fix merge conflicts

**Description**

**Testing Done**

* Fix code duplication due to merge conflicts.

**Description**

**Testing Done**

* Fix code duplication due to merge conflicts.

**Description**

**Testing Done**

* Remove unwanted empty line

**Description**

**Testing Done**

* Add unit and integration test cases for get and check status methods

**Description**

**Testing Done**
All the tests pass.

---------

Co-authored-by: Gokul Anantha Narayanan <[email protected]>
)

**Description**
Update create cluster method to return full cluster detail object.

**Testing Done**
Updated the unit test case for create and all test cases pass. No change needed to integration test case for now.
…ross template (#160)

* add inference template submit backend logic, fix namespace default across template

* add namespace to jumpstart and custom endpoint template to simplify logic, no special handling for namespace for any templates, add unit tests for init experience
* Add version comptability check between server K8s and Client python K8s (#138)

* Add k8s version validation check between server and client version according to the supported versioning constraints by k8s

* Fix unit test cases

* Move regex to a constant.

**Description**
- Removed an integration test case that was being mocked.
- Moved a regex to a constant.

**Testing Done**
Unit test cases pass no changes made to integration test cases and they should not be affected.

* Add k8s version validation check between server and client version according to the supported versioning constraints by k8s

* Add ref link for version comptability contraints

**Description**
Added a link to k8s documentation mentioning the constraints that rule the version compatibility of client and server.

**Testing Done**
No breaking changes.

* Recipe supp (#182)

* Add sagemaker-hyperpod-recipes submodule

* Recipe Support for Hyp

---------

Co-authored-by: papriwal <[email protected]>
Co-authored-by: jam-jee <[email protected]>
* Add version comptability check between server K8s and Client python K8s (#138)

* Add k8s version validation check between server and client version according to the supported versioning constraints by k8s

* Fix unit test cases

* Move regex to a constant.

**Description**
- Removed an integration test case that was being mocked.
- Moved a regex to a constant.

**Testing Done**
Unit test cases pass no changes made to integration test cases and they should not be affected.

* Add k8s version validation check between server and client version according to the supported versioning constraints by k8s

* Add ref link for version comptability contraints

**Description**
Added a link to k8s documentation mentioning the constraints that rule the version compatibility of client and server.

**Testing Done**
No breaking changes.

* Update for Hyperpod Cluster

* Fix training test (#184)

* Fix SDK training test: Add wait time before refresh

* Fix training tests in canaries

* Add labels to the top level metadata (#158)

Co-authored-by: pintaoz <[email protected]>

* Update Fixes and Updating the S3 location to point to version locked templates

* Update logic to point to main stack

* Fixes and Tests

* FIx

* Address Fix

* Code cleanup

* Address comments

* Fixes

---------

Co-authored-by: papriwal <[email protected]>
Co-authored-by: Zhaoqi <[email protected]>
Co-authored-by: pintaoz-aws <[email protected]>
Co-authored-by: pintaoz <[email protected]>
…r change see description (#168)

* add inference template submit backend logic, fix namespace default across template

* update get latest version logic for init, add hyp-pytorch-job template without submit e2e or volume handler

* add submit for pytorch-job, e2e working, missing volume handle

* add namespace to jumpstart and custom endpoint template to simplify logic, no special handling for namespace for any templates, add unit tests for init experience

* Resolve namespace logic issue, update endpoint-name for endpoint schema to required

* add support for volume flag, and other special handling (list and dictionary)

* revert breaking changes for jumpstart and custom endpoint template v1.0, remove generate_click_command from init command
* Show complete cfn param template

**Description**
Showing complete CFN param template regarding cluster creation to provide a better UX and more context for the user.

**Testing Done**
Update unit test cases wherever needed, the related test cases all pass.

* SIMPLIFY CFN TEMPLATE GENERATION

**Description**
Moved to putting the full CFN template in the jinja file.

**Testing Done**
Update unit test cases to cover the updates and all of them pass.

* FIX ERROR LOGGING IN TEMPLATE PROCESSING

**Description**
Logging error in case we get an exception while getting the template.

**Testing Done**
Updated the related unit test case and the whole associated test suite passes.

* ADD REGION OPTION TO `describe-cluster-stack` COMMAND

**Description**
- Add --region option to describe_cluster_stack command for specifying AWS region
- Update function signature to accept region parameter
- Pass region parameter to HpClusterStack.describe() method call

**Testing Done**
Update unit test cases and all the related test cases pass.
This reverts commit 29313327a11da8b5dc66d75ffee3981ac50f60e5.
…model.py (#186)

* Fix merge conflict issues, update cluster template to add default in model.py

* Update model.py to remove default for network related params
* Revert "Bring recipe-supp branch to staging repo (#175)"

This reverts commit 29313327a11da8b5dc66d75ffee3981ac50f60e5.

* Change default region in hyp submit command

Change default region to aws configure region.

Tested locally by editing config file.

* Print region info when using default region

* Update print message in submit command
…ssues (#195)

* Update Validation logic for the Create cluster

* Update handling of json strings

* Small Revert
* Adding testing for new template related code and for this branch

**Description**

**Testing Done**

* Adding to within unit tests folder

* Empty commit

* fix

* Fix

* fix

* Add for integ tests

* Fix

* Fix

* Remove AbstractIntegrationTests
* UPDATE CFN PARAM IN JINJA FILE

**Description**
Updated cfn cluster creation template.

**Testing Done**

* FIX UNIT TEST CASES FOR CFN PARAM

**Description**
Updated the unit test cases for the process cfn param util.

**Testing Done**
All the unit test cases pass.

* Remove unused function and fix CloudFormation template issues

**Description**
- Removed redundant _process_cfn_template_content function from init_utils.py
- Fixed missing InstanceGroupSettings1 and RigSettings1 parameters in CFN template by changing loop range from (2,21) to (1,21)
- Removed duplicate load_config_and_validate function definition

**Testing Done**
- Verified CloudFormation template generates all required parameters 1-20
- Confirmed no duplicate function definitions remain
- Updated unit test cases and the whole suite passes
* Add default availability zone ID based on region

* Add mapping reference link

* Replace AZ ID mapping with boto3 call

* Update error handling for getting AZ ID

* Use create_boto3_client util

* Resolve conflicts
…create (#202)

* replace hyp submit with hyp create by overriding the default for hyp create

* minor change

* update help text and unit test imports

* update create command help message

* minor syntax update to accomodate for unit test running in py3.9

* update unit test to rename submit into default_create
* Add version comptability check between server K8s and Client python K8s (#138)

* Add k8s version validation check between server and client version according to the supported versioning constraints by k8s

* Fix unit test cases

* Move regex to a constant.

**Description**
- Removed an integration test case that was being mocked.
- Moved a regex to a constant.

**Testing Done**
Unit test cases pass no changes made to integration test cases and they should not be affected.

* Add k8s version validation check between server and client version according to the supported versioning constraints by k8s

* Add ref link for version comptability contraints

**Description**
Added a link to k8s documentation mentioning the constraints that rule the version compatibility of client and server.

**Testing Done**
No breaking changes.

* Update logging information for submitting and deleting training job (#189)

Co-authored-by: pintaoz <[email protected]>

* Enhance docs with table formatting and comprehensive API reference

**Description**
- Convert CLI parameter lists to structured tables across all documentation files for better readability
- Add comprehensive docstrings and examples to SDK classes (HPEndpointBase, HyperPodPytorchJob)
- Enhance Sphinx configuration with better autodoc settings and extensions
- Update API reference structure and formatting
- Add custom CSS styling for improved table presentation
- Update documentation requirements and index structure

**Testing Done**
- Verified documentation builds successfully with `make html`
- Confirmed table formatting renders correctly in generated HTML
- Validated API documentation generates properly with enhanced docstrings
- Tested responsive table styling across different screen sizes
- Checked that all parameter information remains accurate and complete

* FIX ALTERED CODE

**Description**
Fixed the code altered while updating docstrings in `hyperpod_pytorch_job.py` file.

**Testing Done**
The unit test cases all pass.

* FIX TEST CASES TO SKIP IF MODULE NOT FOUND

**Description**
Skipping the test cases if module not found.

**Testing Done**
Unit test cases all pass. Integ test cases cant be run for some reason.

* Update with launch-fast-follow branch and fix unit test cases.

**Description**

**Testing Done**

* Update with launch-fast-follow branch and fix unit test cases.

**Description**

**Testing Done**

---------

Co-authored-by: pintaoz-aws <[email protected]>
Co-authored-by: pintaoz <[email protected]>
papriwal and others added 19 commits August 25, 2025 18:05
* Add version comptability check between server K8s and Client python K8s (#138)

* Add k8s version validation check between server and client version according to the supported versioning constraints by k8s

* Fix unit test cases

* Move regex to a constant.

**Description**
- Removed an integration test case that was being mocked.
- Moved a regex to a constant.

**Testing Done**
Unit test cases pass no changes made to integration test cases and they should not be affected.

* Add k8s version validation check between server and client version according to the supported versioning constraints by k8s

* Add ref link for version comptability contraints

**Description**
Added a link to k8s documentation mentioning the constraints that rule the version compatibility of client and server.

**Testing Done**
No breaking changes.

* Update logging information for submitting and deleting training job (#189)

Co-authored-by: pintaoz <[email protected]>

* Enhance docs with table formatting and comprehensive API reference

**Description**
- Convert CLI parameter lists to structured tables across all documentation files for better readability
- Add comprehensive docstrings and examples to SDK classes (HPEndpointBase, HyperPodPytorchJob)
- Enhance Sphinx configuration with better autodoc settings and extensions
- Update API reference structure and formatting
- Add custom CSS styling for improved table presentation
- Update documentation requirements and index structure

**Testing Done**
- Verified documentation builds successfully with `make html`
- Confirmed table formatting renders correctly in generated HTML
- Validated API documentation generates properly with enhanced docstrings
- Tested responsive table styling across different screen sizes
- Checked that all parameter information remains accurate and complete

* FIX ALTERED CODE

**Description**
Fixed the code altered while updating docstrings in `hyperpod_pytorch_job.py` file.

**Testing Done**
The unit test cases all pass.

* ADD CLUSTER MANAGEMENT DOCS

**Description**
- Created comprehensive getting started guide for HyperPod cluster management
- Added tab-set format showing both CLI and SDK options for consistency
- Included step-by-step workflow from initialization to monitoring
- Added cross-references to CLI documentation for auto-updating links
- Filled in existing SDK methods (list_clusters, set_cluster_context)

**Testing Done**
Verified reStructuredText tab-set syntax renders correctly

* Update PR some PR comments fixed

**Description**

**Testing Done**

* Update cluster management cli ref to use md.

**Description**
Using markdown for the same of uniformity.

**Testing Done**

* Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status (#204)

* Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status

* Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status, support status parameter

* Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status, support status parameter

---------

Co-authored-by: Roja Reddy Sareddy <[email protected]>

* ADD CLUSTER MANAGEMENT DOCS

**Description**
- Created comprehensive getting started guide for HyperPod cluster management
- Added tab-set format showing both CLI and SDK options for consistency
- Included step-by-step workflow from initialization to monitoring
- Added cross-references to CLI documentation for auto-updating links
- Filled in existing SDK methods (list_clusters, set_cluster_context)

**Testing Done**
Verified reStructuredText tab-set syntax renders correctly

* Update for Cluster Management CLI commands.

**Description**
Updated md after verification.

**Testing Done**
Verified the commands.

* Add note about default region to docs.

**Description**
Added a note about how the region selection and flag usage works, for better UX.

**Testing Done**
The note shows up as we want it to.

* Update update commands for hyp-cluster.

**Description**
Updated the hyp-cluster update command correctly.

**Testing Done**
Verified the docs are correct.

* Fix a unit test case changed while fixing merge conflicts.

**Description**

**Testing Done**

* ADD NEW PARAMS TO CLI TRAINING DOCS

**Description**
- Resource parameters: accelerators, vcpu, memory, accelerators-limit, vcpu-limit, memory-limit
- Topology parameters: preferred-topology, required-topology

**Testing Done**
- Verified parameter documentation follows existing format and style
- Confirmed parameter descriptions match field definitions from source code
- Validated documentation builds without errors

* Updated docs for cli sdk ref (#192)

* Add version comptability check between server K8s and Client python K8s (#138)

* Add k8s version validation check between server and client version according to the supported versioning constraints by k8s

* Fix unit test cases

* Move regex to a constant.

**Description**
- Removed an integration test case that was being mocked.
- Moved a regex to a constant.

**Testing Done**
Unit test cases pass no changes made to integration test cases and they should not be affected.

* Add k8s version validation check between server and client version according to the supported versioning constraints by k8s

* Add ref link for version comptability contraints

**Description**
Added a link to k8s documentation mentioning the constraints that rule the version compatibility of client and server.

**Testing Done**
No breaking changes.

* Update logging information for submitting and deleting training job (#189)

Co-authored-by: pintaoz <[email protected]>

* Enhance docs with table formatting and comprehensive API reference

**Description**
- Convert CLI parameter lists to structured tables across all documentation files for better readability
- Add comprehensive docstrings and examples to SDK classes (HPEndpointBase, HyperPodPytorchJob)
- Enhance Sphinx configuration with better autodoc settings and extensions
- Update API reference structure and formatting
- Add custom CSS styling for improved table presentation
- Update documentation requirements and index structure

**Testing Done**
- Verified documentation builds successfully with `make html`
- Confirmed table formatting renders correctly in generated HTML
- Validated API documentation generates properly with enhanced docstrings
- Tested responsive table styling across different screen sizes
- Checked that all parameter information remains accurate and complete

* FIX ALTERED CODE

**Description**
Fixed the code altered while updating docstrings in `hyperpod_pytorch_job.py` file.

**Testing Done**
The unit test cases all pass.

* FIX TEST CASES TO SKIP IF MODULE NOT FOUND

**Description**
Skipping the test cases if module not found.

**Testing Done**
Unit test cases all pass. Integ test cases cant be run for some reason.

* Update with launch-fast-follow branch and fix unit test cases.

**Description**

**Testing Done**

* Update with launch-fast-follow branch and fix unit test cases.

**Description**

**Testing Done**

---------

Co-authored-by: pintaoz-aws <[email protected]>
Co-authored-by: pintaoz <[email protected]>

* ADD CLUSTER MANAGEMENT DOCS

**Description**
- Created comprehensive getting started guide for HyperPod cluster management
- Added tab-set format showing both CLI and SDK options for consistency
- Included step-by-step workflow from initialization to monitoring
- Added cross-references to CLI documentation for auto-updating links
- Filled in existing SDK methods (list_clusters, set_cluster_context)

**Testing Done**
Verified reStructuredText tab-set syntax renders correctly

* Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status (#204)

* Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status

* Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status, support status parameter

* Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status, support status parameter

---------

Co-authored-by: Roja Reddy Sareddy <[email protected]>

* Update for Cluster Management CLI commands.

**Description**
Updated md after verification.

**Testing Done**
Verified the commands.

* Add note about default region to docs.

**Description**
Added a note about how the region selection and flag usage works, for better UX.

**Testing Done**
The note shows up as we want it to.

* Update for Cluster Management CLI commands.

**Description**
Updated md after verification.

**Testing Done**
Verified the commands.

* Add note about default region to docs.

**Description**
Added a note about how the region selection and flag usage works, for better UX.

**Testing Done**
The note shows up as we want it to.

* Enhance docs with table formatting and comprehensive API reference

**Description**
- Convert CLI parameter lists to structured tables across all documentation files for better readability
- Add comprehensive docstrings and examples to SDK classes (HPEndpointBase, HyperPodPytorchJob)
- Enhance Sphinx configuration with better autodoc settings and extensions
- Update API reference structure and formatting
- Add custom CSS styling for improved table presentation
- Update documentation requirements and index structure

**Testing Done**
- Verified documentation builds successfully with `make html`
- Confirmed table formatting renders correctly in generated HTML
- Validated API documentation generates properly with enhanced docstrings
- Tested responsive table styling across different screen sizes
- Checked that all parameter information remains accurate and complete

* ADD CLUSTER MANAGEMENT DOCS

**Description**
- Created comprehensive getting started guide for HyperPod cluster management
- Added tab-set format showing both CLI and SDK options for consistency
- Included step-by-step workflow from initialization to monitoring
- Added cross-references to CLI documentation for auto-updating links
- Filled in existing SDK methods (list_clusters, set_cluster_context)

**Testing Done**
Verified reStructuredText tab-set syntax renders correctly

* Updated docs for cli sdk ref (#192)

* Add version comptability check between server K8s and Client python K8s (#138)

* Add k8s version validation check between server and client version according to the supported versioning constraints by k8s

* Fix unit test cases

* Move regex to a constant.

**Description**
- Removed an integration test case that was being mocked.
- Moved a regex to a constant.

**Testing Done**
Unit test cases pass no changes made to integration test cases and they should not be affected.

* Add k8s version validation check between server and client version according to the supported versioning constraints by k8s

* Add ref link for version comptability contraints

**Description**
Added a link to k8s documentation mentioning the constraints that rule the version compatibility of client and server.

**Testing Done**
No breaking changes.

* Update logging information for submitting and deleting training job (#189)

Co-authored-by: pintaoz <[email protected]>

* Enhance docs with table formatting and comprehensive API reference

**Description**
- Convert CLI parameter lists to structured tables across all documentation files for better readability
- Add comprehensive docstrings and examples to SDK classes (HPEndpointBase, HyperPodPytorchJob)
- Enhance Sphinx configuration with better autodoc settings and extensions
- Update API reference structure and formatting
- Add custom CSS styling for improved table presentation
- Update documentation requirements and index structure

**Testing Done**
- Verified documentation builds successfully with `make html`
- Confirmed table formatting renders correctly in generated HTML
- Validated API documentation generates properly with enhanced docstrings
- Tested responsive table styling across different screen sizes
- Checked that all parameter information remains accurate and complete

* FIX ALTERED CODE

**Description**
Fixed the code altered while updating docstrings in `hyperpod_pytorch_job.py` file.

**Testing Done**
The unit test cases all pass.

* FIX TEST CASES TO SKIP IF MODULE NOT FOUND

**Description**
Skipping the test cases if module not found.

**Testing Done**
Unit test cases all pass. Integ test cases cant be run for some reason.

* Update with launch-fast-follow branch and fix unit test cases.

**Description**

**Testing Done**

* Update with launch-fast-follow branch and fix unit test cases.

**Description**

**Testing Done**

---------

Co-authored-by: pintaoz-aws <[email protected]>
Co-authored-by: pintaoz <[email protected]>

* Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status (#204)

* Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status

* Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status, support status parameter

* Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status, support status parameter

---------

Co-authored-by: Roja Reddy Sareddy <[email protected]>

* ADD CLUSTER MANAGEMENT DOCS

**Description**
- Created comprehensive getting started guide for HyperPod cluster management
- Added tab-set format showing both CLI and SDK options for consistency
- Included step-by-step workflow from initialization to monitoring
- Added cross-references to CLI documentation for auto-updating links
- Filled in existing SDK methods (list_clusters, set_cluster_context)

**Testing Done**
Verified reStructuredText tab-set syntax renders correctly

* Update PR some PR comments fixed

**Description**

**Testing Done**

* Update cluster management cli ref to use md.

**Description**
Using markdown for the same of uniformity.

**Testing Done**

* Update for Cluster Management CLI commands.

**Description**
Updated md after verification.

**Testing Done**
Verified the commands.

* Add note about default region to docs.

**Description**
Added a note about how the region selection and flag usage works, for better UX.

**Testing Done**
The note shows up as we want it to.

* Update code lines messed up while fixing merge conflicts.

**Description**

**Testing Done**

* Update docs and README to include task gov and gpu_quota params.

**Description**

**Testing Done**

---------

Co-authored-by: pintaoz-aws <[email protected]>
Co-authored-by: pintaoz <[email protected]>
Co-authored-by: rsareddy0329 <[email protected]>
Co-authored-by: Roja Reddy Sareddy <[email protected]>
…rmat (#220)

* update cloud formation template to 1.1, fix instance group setting format

* fix unit test
* Reorder and update description for each field in cluster creation

- Reordering the fields to match the order in the config.yaml file

- Updating descriptions to match the comments in the config.yaml file

- Updating default values (like resource_name_prefix changed from "hyperpod-cli-integ-test" to "hyp-eks-stack" and hyperpod_cluster_name from "hyperpod-cluster-integ-test" to "hyperpod-cluster")

All unit tests passed.

* update model.py
…d double quotes (#224)

* update cloud formation template to 1.1, fix instance group setting format

* fix unit test

* fix: validation error for json format that accomadates both single and double quotes
**Description**

**Testing Done**
…luster-stack (#219)

* Append UUID to resource name prefix to ensure uniqueness .

---

Tested with unit tests and manual testing

* Update the cluster stack  command to be `cluster-stack` instead of `hyp-cluster-stack`

* Fix
**Description**
- Clarified hyp validate performs syntactic validation only, not AWS resource validation
- Added resource_name_prefix requirement for unique deployment identifiers
- Updated prerequisites and examples with accurate behavior descriptions

**Testing Done**
- Verified validation function implementation matches documentation
* Update CHANGELOG.md for launch fast follow release

* Update to minor version
* Add default availability zone

- Add default AZ IDs
- Updated field description in model.py

Tested by manually entering different AZ IDs in config.yaml and added unit tests

* Pick 2 AZ IDs instead of 1 during submission

* Add example of entering az ID

* Update description in model.py
* Enable Telemetry for Cluster creation

* Telemetry for CLI and updates

* Fix
* Implemented exec command with unit tests

* Minor UX change to help for pod and all-pods

* Better help for exec command usage

* Removed unnecessary comment
* ABstract out some defaut values from the user .
Also add Example Notebooks

* Cleanup and fix

* Cleanup for CLI notebook
**Description**

**Testing Done**
* Add integration test for HP cluster creation workflow

* Add utility functions for integration tests

* Cleaned imports and utils

* Fixed Bugs related to Integ Test

* Probable fix for configure bug

* Revert Previous Changes and Fixed Configure Bug

* update configure import strategy

* remove cluster-stack command from list and describe cli

* Updated monitoring logic to use boto3

* Changed name of cluster to be monitored

---------

Co-authored-by: Molly He <[email protected]>
@nargokul nargokul requested a review from a team as a code owner August 26, 2025 01:26
@nargokul nargokul changed the title Release cm Release Cluster Management Aug 26, 2025
@nargokul nargokul merged commit 5cff2a7 into main Aug 26, 2025
13 of 15 checks passed
@papriwal papriwal deleted the release_cm branch August 27, 2025 20:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants