Skip to content

Commit e155a05

Browse files
papriwalpintaoz-awspintaozrsareddy0329Roja Reddy Sareddy
authored andcommitted
Docs for cluster stack creation (#207)
* Add version comptability check between server K8s and Client python K8s (#138) * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Fix unit test cases * Move regex to a constant. **Description** - Removed an integration test case that was being mocked. - Moved a regex to a constant. **Testing Done** Unit test cases pass no changes made to integration test cases and they should not be affected. * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Add ref link for version comptability contraints **Description** Added a link to k8s documentation mentioning the constraints that rule the version compatibility of client and server. **Testing Done** No breaking changes. * Update logging information for submitting and deleting training job (#189) Co-authored-by: pintaoz <[email protected]> * Enhance docs with table formatting and comprehensive API reference **Description** - Convert CLI parameter lists to structured tables across all documentation files for better readability - Add comprehensive docstrings and examples to SDK classes (HPEndpointBase, HyperPodPytorchJob) - Enhance Sphinx configuration with better autodoc settings and extensions - Update API reference structure and formatting - Add custom CSS styling for improved table presentation - Update documentation requirements and index structure **Testing Done** - Verified documentation builds successfully with `make html` - Confirmed table formatting renders correctly in generated HTML - Validated API documentation generates properly with enhanced docstrings - Tested responsive table styling across different screen sizes - Checked that all parameter information remains accurate and complete * FIX ALTERED CODE **Description** Fixed the code altered while updating docstrings in `hyperpod_pytorch_job.py` file. **Testing Done** The unit test cases all pass. * ADD CLUSTER MANAGEMENT DOCS **Description** - Created comprehensive getting started guide for HyperPod cluster management - Added tab-set format showing both CLI and SDK options for consistency - Included step-by-step workflow from initialization to monitoring - Added cross-references to CLI documentation for auto-updating links - Filled in existing SDK methods (list_clusters, set_cluster_context) **Testing Done** Verified reStructuredText tab-set syntax renders correctly * Update PR some PR comments fixed **Description** **Testing Done** * Update PR some PR comments fixed **Description** **Testing Done** * Update cluster management getting started. **Description** **Testing Done** * Update cluster management cli ref to use md. **Description** Using markdown for the same of uniformity. **Testing Done** * Update cluster management getting started. **Description** Mentioning the missing file generated with `hyp init hyp-cluster` command. **Testing Done** N/A * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status (#204) * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status, support status parameter * Fix: List cluster stacks exclude ones with 'DELETE_COMPLETE' status, support status parameter --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> * ADD CLUSTER MANAGEMENT DOCS **Description** - Created comprehensive getting started guide for HyperPod cluster management - Added tab-set format showing both CLI and SDK options for consistency - Included step-by-step workflow from initialization to monitoring - Added cross-references to CLI documentation for auto-updating links - Filled in existing SDK methods (list_clusters, set_cluster_context) **Testing Done** Verified reStructuredText tab-set syntax renders correctly * Update for Cluster Management CLI commands. **Description** - Commented the complete autogen file for cli cluster management. - Added some updates to commands as required. **Testing Done** Verified the commands. * Update for Cluster Management CLI commands. **Description** Updated md after verification. **Testing Done** Verified the commands. * Add note about default region to docs. **Description** Added a note about how the region selection and flag usage works, for better UX. **Testing Done** The note shows up as we want it to. * Update update commands for hyp-cluster. **Description** Updated the hyp-cluster update command correctly. **Testing Done** Verified the docs are correct. * Fix a unit test case changed while fixing merge conflicts. **Description** **Testing Done** --------- Co-authored-by: pintaoz-aws <[email protected]> Co-authored-by: pintaoz <[email protected]> Co-authored-by: rsareddy0329 <[email protected]> Co-authored-by: Roja Reddy Sareddy <[email protected]>
1 parent 70307a6 commit e155a05

File tree

20 files changed

+948
-13
lines changed

20 files changed

+948
-13
lines changed

doc/cli/cli_index.rst

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
CLI Reference
2+
=============
3+
4+
Complete reference for the SageMaker HyperPod Command Line Interface.
5+
6+
.. toctree::
7+
:hidden:
8+
:maxdepth: 2
9+
10+
cluster_management/cli_cluster_management
11+
training/cli_training
12+
inference/cli_inference
13+
14+
.. container::
15+
16+
.. grid:: 1 1 3 3
17+
:gutter: 3
18+
19+
.. grid-item-card:: Cluster Management CLI
20+
:link: cluster_management/cli_cluster_management
21+
:link-type: doc
22+
:class-card: sd-border-secondary
23+
24+
Cluster stack management commands, options and parameters.
25+
26+
.. grid-item-card:: Training CLI
27+
:link: training/cli_training
28+
:link-type: doc
29+
:class-card: sd-border-secondary
30+
31+
Training CLI commands, options and parameters.
32+
33+
.. grid-item-card:: Inference CLI
34+
:link: inference/cli_inference
35+
:link-type: doc
36+
:class-card: sd-border-secondary
37+
38+
Inference CLI commands, options and parameters.

doc/cli_reference.md renamed to doc/cli/cli_reference.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
99
cli_training
1010
cli_inference
11+
cli_cluster_management
1112
```
1213

1314
Complete reference for the SageMaker HyperPod Command Line Interface.
@@ -32,5 +33,13 @@ Training CLI commands, options and parameters.
3233
Inference CLI commands, options and parameters.
3334
:::
3435

36+
:::{grid-item-card} Cluster Management CLI
37+
:link: cli_cluster_management
38+
:link-type: ref
39+
:class-card: sd-border-secondary
40+
41+
Cluster stack management commands, options and parameters.
42+
:::
43+
3544
::::
3645
::::
Lines changed: 326 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,326 @@
1+
(cli_cluster_management)=
2+
3+
# Cluster Management
4+
5+
Complete reference for SageMaker HyperPod cluster management parameters and configuration options.
6+
7+
```{note}
8+
**Region Configuration**: For commands that accept the `--region` option, if no region is explicitly provided, the command will use the default region from your AWS credentials configuration.
9+
```
10+
11+
* [Initialize Configuration](#hyp-init)
12+
* [Create Cluster Stack](#hyp-create)
13+
* [Update Cluster](#hyp-update-hyp-cluster)
14+
* [List Cluster Stacks](#hyp-list-hyp-cluster)
15+
* [Describe Cluster Stack](#hyp-describe-hyp-cluster)
16+
* [List HyperPod Clusters](#hyp-list-cluster)
17+
* [Set Cluster Context](#hyp-set-cluster-context)
18+
* [Get Cluster Context](#hyp-get-cluster-context)
19+
* [Get Monitoring](#hyp-get-monitoring)
20+
21+
* [Configure Parameters](#hyp-configure)
22+
* [Validate Configuration](#hyp-validate)
23+
* [Reset Configuration](#hyp-reset)
24+
25+
## hyp init
26+
27+
Initialize a template scaffold in the current directory.
28+
29+
#### Syntax
30+
31+
```bash
32+
hyp init TEMPLATE [DIRECTORY] [OPTIONS]
33+
```
34+
35+
#### Parameters
36+
37+
| Parameter | Type | Required | Description |
38+
|-----------|------|----------|-------------|
39+
| `TEMPLATE` | CHOICE | Yes | Template type (hyp-cluster, hyp-pytorch-job, hyp-custom-endpoint, hyp-jumpstart-endpoint) |
40+
| `DIRECTORY` | PATH | No | Target directory (default: current directory) |
41+
| `--version` | TEXT | No | Schema version to use |
42+
43+
## hyp create
44+
45+
Create a new HyperPod cluster stack using the provided configuration.
46+
47+
#### Syntax
48+
49+
```bash
50+
hyp create [OPTIONS]
51+
```
52+
53+
#### Parameters
54+
55+
| Parameter | Type | Required | Description |
56+
|-----------|------|----------|-------------|
57+
| `--region` | TEXT | No | AWS region where the cluster stack will be created |
58+
| `--debug` | FLAG | No | Enable debug logging |
59+
60+
## hyp update hyp-cluster
61+
62+
Update an existing HyperPod cluster configuration.
63+
64+
#### Syntax
65+
66+
```bash
67+
hyp update hyp-cluster [OPTIONS]
68+
```
69+
70+
#### Parameters
71+
72+
| Parameter | Type | Required | Description |
73+
|-----------|------|----------|-------------|
74+
| `--cluster-name` | TEXT | Yes | Name of the cluster to update |
75+
| `--instance-groups` | TEXT | No | JSON string of instance group configurations |
76+
| `--instance-groups-to-delete` | TEXT | No | JSON string of instance groups to delete |
77+
| `--region` | TEXT | No | AWS region of the cluster |
78+
| `--node-recovery` | TEXT | No | Node recovery setting (Automatic or None) |
79+
| `--debug` | FLAG | No | Enable debug logging |
80+
81+
## hyp list hyp-cluster
82+
83+
List all HyperPod cluster stacks (CloudFormation stacks).
84+
85+
#### Syntax
86+
87+
```bash
88+
hyp list hyp-cluster [OPTIONS]
89+
```
90+
91+
#### Parameters
92+
93+
| Parameter | Type | Required | Description |
94+
|-----------|------|----------|-------------|
95+
| `--region` | TEXT | No | AWS region to list stacks from |
96+
| `--status` | TEXT | No | Filter by stack status. Format: "['CREATE_COMPLETE', 'UPDATE_COMPLETE']" |
97+
| `--debug` | FLAG | No | Enable debug logging |
98+
99+
## hyp describe hyp-cluster
100+
101+
Describe a specific HyperPod cluster stack.
102+
103+
#### Syntax
104+
105+
```bash
106+
hyp describe hyp-cluster STACK-NAME [OPTIONS]
107+
```
108+
109+
#### Parameters
110+
111+
| Parameter | Type | Required | Description |
112+
|-----------|------|----------|-------------|
113+
| `STACK-NAME` | TEXT | Yes | Name of the CloudFormation stack to describe |
114+
| `--region` | TEXT | No | AWS region of the stack |
115+
| `--debug` | FLAG | No | Enable debug logging |
116+
117+
## hyp list-cluster
118+
119+
List SageMaker HyperPod clusters with capacity information.
120+
121+
#### Syntax
122+
123+
```bash
124+
hyp list-cluster [OPTIONS]
125+
```
126+
127+
#### Parameters
128+
129+
| Parameter | Type | Required | Description |
130+
|-----------|------|----------|-------------|
131+
| `--region` | TEXT | No | AWS region to list clusters from |
132+
| `--output` | TEXT | No | Output format ("table" or "json", default: "json") |
133+
| `--clusters` | TEXT | No | Comma-separated list of specific cluster names |
134+
| `--namespace` | TEXT | No | Namespace to check capacity for (can be used multiple times) |
135+
| `--debug` | FLAG | No | Enable debug logging |
136+
137+
## hyp set-cluster-context
138+
139+
Connect to a HyperPod EKS cluster and set kubectl context.
140+
141+
#### Syntax
142+
143+
```bash
144+
hyp set-cluster-context [OPTIONS]
145+
```
146+
147+
#### Parameters
148+
149+
| Parameter | Type | Required | Description |
150+
|-----------|------|----------|-------------|
151+
| `--cluster-name` | TEXT | Yes | Name of the HyperPod cluster to connect to |
152+
| `--region` | TEXT | No | AWS region of the cluster |
153+
| `--namespace` | TEXT | No | Kubernetes namespace to connect to |
154+
| `--debug` | FLAG | No | Enable debug logging |
155+
156+
## hyp get-cluster-context
157+
158+
Get context information for the currently connected cluster.
159+
160+
#### Syntax
161+
162+
```bash
163+
hyp get-cluster-context [OPTIONS]
164+
```
165+
166+
#### Parameters
167+
168+
| Parameter | Type | Required | Description |
169+
|-----------|------|----------|-------------|
170+
| `--debug` | FLAG | No | Enable debug logging |
171+
172+
## hyp get-monitoring
173+
174+
Get monitoring configurations for the HyperPod cluster.
175+
176+
#### Syntax
177+
178+
```bash
179+
hyp get-monitoring [OPTIONS]
180+
```
181+
182+
#### Parameters
183+
184+
| Parameter | Type | Required | Description |
185+
|-----------|------|----------|-------------|
186+
| `--grafana` | FLAG | No | Return Grafana dashboard URL |
187+
| `--prometheus` | FLAG | No | Return Prometheus workspace URL |
188+
| `--list` | FLAG | No | Return list of available metrics |
189+
190+
## hyp configure
191+
192+
Configure cluster parameters interactively or via command line.
193+
194+
#### Syntax
195+
196+
```bash
197+
hyp configure [OPTIONS]
198+
```
199+
200+
#### Parameters
201+
202+
This command dynamically supports all configuration parameters available in the current template's schema. Common parameters include:
203+
204+
| Parameter | Type | Required | Description |
205+
|-----------|------|----------|-------------|
206+
| `--resource-name-prefix` | TEXT | No | Prefix for all AWS resources |
207+
| `--stage` | TEXT | No | Deployment stage ("gamma" or "prod") |
208+
| `--vpc-cidr` | TEXT | No | VPC CIDR block |
209+
| `--kubernetes-version` | TEXT | No | Kubernetes version for EKS cluster |
210+
| `--node-recovery` | TEXT | No | Node recovery setting ("Automatic" or "None") |
211+
| `--env` | JSON | No | Environment variables as JSON object |
212+
| `--args` | JSON | No | Command arguments as JSON array |
213+
| `--command` | JSON | No | Command to run as JSON array |
214+
| `--tags` | JSON | No | Resource tags as JSON object |
215+
216+
**Note:** The exact parameters available depend on your current template type and version. Run `hyp configure --help` to see all available options for your specific configuration.
217+
218+
## hyp validate
219+
220+
Validate the current cluster configuration.
221+
222+
#### Syntax
223+
224+
```bash
225+
hyp validate
226+
```
227+
228+
#### Parameters
229+
230+
No parameters required. This command validates the `config.yaml` file in the current directory against the appropriate schema.
231+
232+
## hyp reset
233+
234+
Reset the current directory's config.yaml to default values.
235+
236+
#### Syntax
237+
238+
```bash
239+
hyp reset
240+
```
241+
242+
#### Parameters
243+
244+
No parameters required.
245+
246+
247+
248+
## Parameter Reference
249+
250+
### Common Parameters Across Commands
251+
252+
| Parameter | Type | Description | Default |
253+
|-----------|------|-------------|---------|
254+
| `--region` | TEXT | AWS region | Current AWS profile region |
255+
| `--help` | FLAG | Show command help | - |
256+
| `--verbose` | FLAG | Enable verbose output | false |
257+
258+
### Configuration File Parameters
259+
260+
The `config.yaml` file supports the following parameters:
261+
262+
| Parameter | Type | Description | Default |
263+
|-----------|------|-------------|---------|
264+
| `template` | TEXT | Template name | "hyp-cluster" |
265+
| `namespace` | TEXT | Kubernetes namespace | "kube-system" |
266+
| `stage` | TEXT | Deployment stage | "gamma" |
267+
| `resource_name_prefix` | TEXT | Resource name prefix | "sagemaker-hyperpod-eks" |
268+
| `vpc_cidr` | TEXT | VPC CIDR block | "10.192.0.0/16" |
269+
| `kubernetes_version` | TEXT | Kubernetes version | "1.31" |
270+
| `node_recovery` | TEXT | Node recovery setting | "Automatic" |
271+
| `create_vpc_stack` | BOOLEAN | Create new VPC | true |
272+
| `create_eks_cluster_stack` | BOOLEAN | Create new EKS cluster | true |
273+
| `create_hyperpod_cluster_stack` | BOOLEAN | Create HyperPod cluster | true |
274+
275+
**Note:** The actual available configuration parameters depend on the specific template schema version. Use `hyp init hyp-cluster` to see all available parameters for your version.
276+
277+
## Examples
278+
279+
### Basic Cluster Stack Creation
280+
281+
```bash
282+
# Start with a clean directory
283+
mkdir my-hyperpod-cluster
284+
cd my-hyperpod-cluster
285+
286+
# Initialize cluster configuration
287+
hyp init hyp-cluster
288+
289+
# Configure basic parameters
290+
hyp configure --resource-name-prefix my-cluster --stage prod
291+
292+
# Validate configuration
293+
hyp validate
294+
295+
# Create cluster stack
296+
hyp create --region us-west-2
297+
```
298+
299+
### Update Existing Cluster
300+
301+
```bash
302+
# Update instance groups
303+
hyp update hyp-cluster \
304+
--cluster-name my-cluster \
305+
--instance-groups '[{"InstanceCount":2,"InstanceGroupName":"worker-nodes","InstanceType":"ml.m5.large"}]' \
306+
--region us-west-2
307+
```
308+
309+
### List and Describe
310+
311+
```bash
312+
# List all cluster stacks
313+
hyp list hyp-cluster --region us-west-2
314+
315+
# Describe specific cluster stack
316+
hyp describe hyp-cluster my-stack-name --region us-west-2
317+
318+
# List HyperPod clusters with capacity info
319+
hyp list-cluster --region us-west-2 --output table
320+
321+
# Connect to cluster
322+
hyp set-cluster-context --cluster-name my-cluster --region us-west-2
323+
324+
# Get current context
325+
hyp get-cluster-context
326+
```

0 commit comments

Comments
 (0)