Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions helm_chart/HyperPodHelmChart/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,10 @@ version: 0.1.0
appVersion: "1.16.0"

dependencies:
- name: cert-manager
version: "v1.18.2"
repository: oci://quay.io/jetstack/charts
condition: cert-manager.enabled
- name: training-operators
version: "0.1.0"
repository: "file://charts/training-operators"
Expand Down
9 changes: 9 additions & 0 deletions helm_chart/HyperPodHelmChart/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,15 @@ namespace:
create: true
name: aws-hyperpod

cert-manager:
enabled: true
namespace: cert-manager
global:
leaderElection:
namespace: cert-manager
crds:
enabled: true

mlflow:
enabled: false

Expand Down
15 changes: 15 additions & 0 deletions helm_chart/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ More information about orchestration features for cluster admins [here](https://
| [Kubeflow Training Operator](https://www.kubeflow.org/docs/components/trainer/legacy-v1/overview/) | Installs operators for managing various machine learning training jobs, such as TensorFlow, PyTorch, and MXNet, providing native Kubernetes support for distributed training workloads. | | Yes |
| HyperPod patching | Deploys the RBAC and controller resources needed for orchestrating rolling updates and patching workflows in SageMaker HyperPod clusters. Includes pod eviction and node monitoring. | HyperPod Resiliency | Yes |
| hyperpod-inference-operator | Installs the HyperPod Inference Operator and its dependencies to the cluster, allowing cluster deployment and inferencing of JumpStart, s3-hosted, and FSx-hosted models | No |
| [cert-manager](https://github.com/cert-manager/cert-manager) | Automatically provisions and manages TLS certificates in Kubernetes clusters. Provides certificate lifecycle management including issuance, renewal, and revocation for secure communications. | [Hyperpod training operator](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator.html) | Yes |

> **_Note_** The `mpijob` scheme is disabled in the Training Operator helm chart to avoid conflicting with the MPI Operator.

Expand All @@ -48,6 +49,20 @@ storage:
enabled: true
```

To enable cert-manager for TLS certificate management, pass in `--set cert-manager.enabled=true` when installing or upgrading the main chart or set the following in the values.yaml file:
```
cert-manager:
enabled: true
namespace: cert-manager
global:
leaderElection:
namespace: cert-manager
crds:
enabled: true
```
namespace specifies which name space cert-manager should be installed


---

The following plugins are only required for HyperPod Resiliency if you are using the following supported devices, such as GPU/Neuron instances, unless you install these plugins on your own.
Expand Down
Loading