Skip to content

Commit d2453d6

Browse files
committed
Add notes about HMA patching
1 parent e26e9d1 commit d2453d6

File tree

2 files changed

+19
-1
lines changed

2 files changed

+19
-1
lines changed

helm_chart/install_rig_dependencies.sh

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -481,7 +481,18 @@ confirm_installation_with_user() {
481481
echo "Other pods that depend on aws-node (e.g. CoreDNS, HyperPod HealthMonitoringAgent,...) may experience 'FailedCreatePodSandBox' if the aws-node pods are not available before start up."
482482
echo "Therefore, please allow additional time for K8s to recreate the pods and/or manually recreate the pods (or let K8s recreate after cleaning up) before full cluster use."
483483
echo ""
484-
484+
485+
# Warn user about HMA region
486+
echo ""
487+
echo "⚠️ Note: HyperPod HealthMonitoringAgent (HMA) is a critical dependency for node resilience."
488+
echo "HMA installation is normally handled by the standard (non-RIG) Helm Chart. See https://github.com/aws/sagemaker-hyperpod-cli/blob/main/helm_chart/HyperPodHelmChart/charts/health-monitoring-agent/values.yaml#L2"
489+
echo "The image URI for this component is region-specific. See https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart#6-notes"
490+
echo "To ensure this feature works as intended, please be sure to use the correct image URI."
491+
echo ""
492+
echo "For installations that have already deployed, the image URI can be updated (corrected) using a 'kubectl patch' command. For example:"
493+
echo " kubectl patch daemonset health-monitoring-agent -n aws-hyperpod --patch '{"spec": {"template": {"spec": {"containers": [{"name": "health-monitoring-agent", "image": "767398015722.dkr.ecr.us-east-1.amazonaws.com/hyperpod-health-monitoring-agent:1.0.448.0_1.0.115.0"}]}}}}'"
494+
echo ""
495+
485496
# Warn user about re-running installation
486497
echo ""
487498
echo "⚠️ Note: This installation script should only be run one time for a given HyperPod cluster. Please avoid re-running this installation to avoid duplicated Deployments and Daemonsets and unintended K8s patches to existing objects."

helm_chart/readme.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,13 @@ Notes:
106106
Other pods that depend on aws-node (e.g. CoreDNS, HyperPod HealthMonitoringAgent,...) may experience 'FailedCreatePodSandBox' if the aws-node pods are not available before start up.
107107
Therefore, please allow additional time for K8s to recreate the pods and/or manually recreate the pods (or let K8s recreate after cleaning up) before cluster use.
108108
109+
⚠️ Note: HyperPod HealthMonitoringAgent (HMA) is a critical dependency for node resilience.
110+
HMA installation is normally handled by the standard (non-RIG) Helm Chart. See https://github.com/aws/sagemaker-hyperpod-cli/blob/main/helm_chart/HyperPodHelmChart/charts/health-monitoring-agent/values.yaml#L2
111+
The image URI for this component is region-specific. See https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart#6-notes
112+
To ensure this feature works as intended, please be sure to use the correct image URI.
113+
114+
For installations that have already deployed, the image URI can be updated (corrected) using a 'kubectl patch' command. For example:
115+
kubectl patch daemonset health-monitoring-agent -n aws-hyperpod --patch '{spec: {template: {spec: {containers: [{name: health-monitoring-agent, image: 767398015722.dkr.ecr.us-east-1.amazonaws.com/hyperpod-health-monitoring-agent:1.0.448.0_1.0.115.0}]}}}}'
109116
110117
## 5. Create Team Role
111118

0 commit comments

Comments
 (0)