Skip to content

Conversation

jswudi
Copy link
Contributor

@jswudi jswudi commented Sep 10, 2024

Issue #, if available:

The help message for auto-resume is incorrect.

Description of changes:

HyperPod resilience job auto resume supports in all namespaces.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link
Contributor

@adheshgarg adheshgarg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved

@jswudi jswudi merged commit 3659d55 into aws:main Sep 10, 2024
6 of 10 checks passed
xiaoxshe pushed a commit to xiaoxshe/sagemaker-hyperpod-cli that referenced this pull request Dec 4, 2024
* update the helm chart to create team level roles and bindings

* revert unrelated changes

* Rename quotaAllocationTarget to computeQuotaTarget

* remove kueue related resources from helm chart

* Remove parameters of kueue from chart

* flip the team role creation to false

* Revise readme to add instructions to create the role and binding
xiaoxshe added a commit that referenced this pull request Dec 4, 2024
* add recipes feature for distributed training

* improve unit test coverage for recipes feature

* add support recipes along with command line args

* add recipes

* Crescendo helm chart for role and rolebinding (#17)

* update the helm chart to create team level roles and bindings

* revert unrelated changes

* Rename quotaAllocationTarget to computeQuotaTarget

* remove kueue related resources from helm chart

* Remove parameters of kueue from chart

* flip the team role creation to false

* Revise readme to add instructions to create the role and binding

* add changelog for distributed training

* change to public submodules

* QuotaAllocation support for Hyperpod CLI (#12)

* QuotaAllocation support for Hyperpod CLI

---------

Co-authored-by: Amazon GitHub Automation <[email protected]>
Co-authored-by: Song Jiang <[email protected]>
Co-authored-by: Baiyang Li <[email protected]>
Co-authored-by: baiyli <[email protected]>

* Remove custom_launcher folder

* sync with mainline

---------

Co-authored-by: cansun <[email protected]>
Co-authored-by: Amazon GitHub Automation <[email protected]>
Co-authored-by: Song Jiang <[email protected]>
Co-authored-by: Baiyang Li <[email protected]>
Co-authored-by: baiyli <[email protected]>
Co-authored-by: Can Sun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants