diff --git a/README.md b/README.md
index 100fdff007..4b8df80440 100644
--- a/README.md
+++ b/README.md
@@ -151,6 +151,7 @@ Build, deploy, and manage cloud infrastructure with Infrastructure as Code best
| Server Name | Description | Install |
|-------------|-------------|---------|
+| [Amazon SageMaker HyperPod MCP Server](src/sagemaker-hyperpod-mcp-server) | SageMaker HyperPod cluster management and application deployment | [](https://cursor.com/en/install-mcp?name=awslabs.sagemaker-hyperpod-mcp-server&config=eyJhdXRvQXBwcm92ZSI6W10sImRpc2FibGVkIjpmYWxzZSwiY29tbWFuZCI6InV2eCBhd3NsYWJzLnNhZ2VtYWtlci1oeXBlcnBvZC1tY3Atc2VydmVyQGxhdGVzdCAtLWFsbG93LXdyaXRlIC0tYWxsb3ctc2Vuc2l0aXZlLWRhdGEtYWNjZXNzIiwiZW52Ijp7IkZBU1RNQ1BfTE9HX0xFVkVMIjoiRVJST1IifSwidHJhbnNwb3J0VHlwZSI6InN0ZGlvIn0%3D)
[](https://insiders.vscode.dev/redirect/mcp/install?name=SageMaker%20HyperPod%20MCP%20Server&config=%7B%22autoApprove%22%3A%5B%5D%2C%22disabled%22%3Afalse%2C%22command%22%3A%22uvx%22%2C%22args%22%3A%5B%22awslabs.sagemaker-hyperpod-mcp-server%40latest%22%2C%22--allow-write%22%2C%22--allow-sensitive-data-access%22%5D%2C%22env%22%3A%7B%22FASTMCP_LOG_LEVEL%22%3A%22ERROR%22%7D%2C%22transportType%22%3A%22stdio%22%7D) |
| [Amazon EKS MCP Server](src/eks-mcp-server) | Kubernetes cluster management and application deployment | [](https://cursor.com/en/install-mcp?name=awslabs.eks-mcp-server&config=eyJhdXRvQXBwcm92ZSI6W10sImRpc2FibGVkIjpmYWxzZSwiY29tbWFuZCI6InV2eCBhd3NsYWJzLmVrcy1tY3Atc2VydmVyQGxhdGVzdCAtLWFsbG93LXdyaXRlIC0tYWxsb3ctc2Vuc2l0aXZlLWRhdGEtYWNjZXNzIiwiZW52Ijp7IkZBU1RNQ1BfTE9HX0xFVkVMIjoiRVJST1IifSwidHJhbnNwb3J0VHlwZSI6InN0ZGlvIn0%3D)
[](https://insiders.vscode.dev/redirect/mcp/install?name=EKS%20MCP%20Server&config=%7B%22autoApprove%22%3A%5B%5D%2C%22disabled%22%3Afalse%2C%22command%22%3A%22uvx%22%2C%22args%22%3A%5B%22awslabs.eks-mcp-server%40latest%22%2C%22--allow-write%22%2C%22--allow-sensitive-data-access%22%5D%2C%22env%22%3A%7B%22FASTMCP_LOG_LEVEL%22%3A%22ERROR%22%7D%2C%22transportType%22%3A%22stdio%22%7D) |
| [Amazon ECS MCP Server](src/ecs-mcp-server) | Container orchestration and ECS application deployment | [](https://cursor.com/en/install-mcp?name=awslabs.ecs-mcp-server&config=eyJjb21tYW5kIjoidXZ4IC0tZnJvbSBhd3NsYWJzLWVjcy1tY3Atc2VydmVyIGVjcy1tY3Atc2VydmVyIiwiZW52Ijp7IkFXU19QUk9GSUxFIjoieW91ci1hd3MtcHJvZmlsZSIsIkFXU19SRUdJT04iOiJ5b3VyLWF3cy1yZWdpb24iLCJGQVNUTUNQX0xPR19MRVZFTCI6IkVSUk9SIiwiRkFTVE1DUF9MT0dfRklMRSI6Ii9wYXRoL3RvL2Vjcy1tY3Atc2VydmVyLmxvZyIsIkFMTE9XX1dSSVRFIjoiZmFsc2UiLCJBTExPV19TRU5TSVRJVkVfREFUQSI6ImZhbHNlIn19)
[](https://insiders.vscode.dev/redirect/mcp/install?name=ECS%20MCP%20Server&config=%7B%22command%22%3A%22uvx%22%2C%22args%22%3A%5B%22--from%22%2C%22awslabs-ecs-mcp-server%22%2C%22ecs-mcp-server%22%5D%2C%22env%22%3A%7B%22AWS_PROFILE%22%3A%22your-aws-profile%22%2C%22AWS_REGION%22%3A%22your-aws-region%22%2C%22FASTMCP_LOG_LEVEL%22%3A%22ERROR%22%2C%22FASTMCP_LOG_FILE%22%3A%22%2Fpath%2Fto%2Fecs-mcp-server.log%22%2C%22ALLOW_WRITE%22%3A%22false%22%2C%22ALLOW_SENSITIVE_DATA%22%3A%22false%22%7D%7D) |
| [Finch MCP Server](src/finch-mcp-server) | Local container building with ECR integration | [](https://cursor.com/en/install-mcp?name=awslabs.finch-mcp-server&config=eyJjb21tYW5kIjoidXZ4IGF3c2xhYnMuZmluY2gtbWNwLXNlcnZlckBsYXRlc3QiLCJlbnYiOnsiQVdTX1BST0ZJTEUiOiJkZWZhdWx0IiwiQVdTX1JFR0lPTiI6InVzLXdlc3QtMiIsIkZBU1RNQ1BfTE9HX0xFVkVMIjoiSU5GTyJ9LCJ0cmFuc3BvcnRUeXBlIjoic3RkaW8iLCJkaXNhYmxlZCI6ZmFsc2UsImF1dG9BcHByb3ZlIjpbXX0%3D)
[](https://insiders.vscode.dev/redirect/mcp/install?name=Finch%20MCP%20Server&config=%7B%22command%22%3A%22uvx%22%2C%22args%22%3A%5B%22awslabs.finch-mcp-server%40latest%22%5D%2C%22env%22%3A%7B%22AWS_PROFILE%22%3A%22default%22%2C%22AWS_REGION%22%3A%22us-west-2%22%2C%22FASTMCP_LOG_LEVEL%22%3A%22INFO%22%7D%2C%22transportType%22%3A%22stdio%22%2C%22disabled%22%3Afalse%2C%22autoApprove%22%3A%5B%5D%7D) |
@@ -310,6 +311,7 @@ Interact with AWS HealthAI services.
| Server Name | Description | Install |
|-------------|-------------|---------|
+| [Amazon SageMaker HyperPod MCP Server](src/sagemaker-hyperpod-mcp-server) | SageMaker HyperPod cluster management and application deployment | [](https://cursor.com/en/install-mcp?name=awslabs.sagemaker-hyperpod-mcp-server&config=eyJhdXRvQXBwcm92ZSI6W10sImRpc2FibGVkIjpmYWxzZSwiY29tbWFuZCI6InV2eCBhd3NsYWJzLnNhZ2VtYWtlci1oeXBlcnBvZC1tY3Atc2VydmVyQGxhdGVzdCAtLWFsbG93LXdyaXRlIC0tYWxsb3ctc2Vuc2l0aXZlLWRhdGEtYWNjZXNzIiwiZW52Ijp7IkZBU1RNQ1BfTE9HX0xFVkVMIjoiRVJST1IifSwidHJhbnNwb3J0VHlwZSI6InN0ZGlvIn0%3D)
[](https://insiders.vscode.dev/redirect/mcp/install?name=SageMaker%20HyperPod%20MCP%20Server&config=%7B%22autoApprove%22%3A%5B%5D%2C%22disabled%22%3Afalse%2C%22command%22%3A%22uvx%22%2C%22args%22%3A%5B%22awslabs.sagemaker-hyperpod-mcp-server%40latest%22%2C%22--allow-write%22%2C%22--allow-sensitive-data-access%22%5D%2C%22env%22%3A%7B%22FASTMCP_LOG_LEVEL%22%3A%22ERROR%22%7D%2C%22transportType%22%3A%22stdio%22%7D) |
| [Amazon EKS MCP Server](src/eks-mcp-server) | Kubernetes cluster management and app deployment | [](https://cursor.com/en/install-mcp?name=awslabs.eks-mcp-server&config=eyJhdXRvQXBwcm92ZSI6W10sImRpc2FibGVkIjpmYWxzZSwiY29tbWFuZCI6InV2eCBhd3NsYWJzLmVrcy1tY3Atc2VydmVyQGxhdGVzdCAtLWFsbG93LXdyaXRlIC0tYWxsb3ctc2Vuc2l0aXZlLWRhdGEtYWNjZXNzIiwiZW52Ijp7IkZBU1RNQ1BfTE9HX0xFVkVMIjoiRVJST1IifSwidHJhbnNwb3J0VHlwZSI6InN0ZGlvIn0%3D)
[](https://insiders.vscode.dev/redirect/mcp/install?name=EKS%20MCP%20Server&config=%7B%22autoApprove%22%3A%5B%5D%2C%22disabled%22%3Afalse%2C%22command%22%3A%22uvx%22%2C%22args%22%3A%5B%22awslabs.eks-mcp-server%40latest%22%2C%22--allow-write%22%2C%22--allow-sensitive-data-access%22%5D%2C%22env%22%3A%7B%22FASTMCP_LOG_LEVEL%22%3A%22ERROR%22%7D%2C%22transportType%22%3A%22stdio%22%7D) |
| [Amazon ECS MCP Server](src/ecs-mcp-server) | Containerize and deploy applications to ECS | [](https://cursor.com/en/install-mcp?name=awslabs.ecs-mcp-server&config=eyJjb21tYW5kIjoidXZ4IC0tZnJvbSBhd3NsYWJzLWVjcy1tY3Atc2VydmVyIGVjcy1tY3Atc2VydmVyIiwiZW52Ijp7IkFXU19QUk9GSUxFIjoieW91ci1hd3MtcHJvZmlsZSIsIkFXU19SRUdJT04iOiJ5b3VyLWF3cy1yZWdpb24iLCJGQVNUTUNQX0xPR19MRVZFTCI6IkVSUk9SIiwiRkFTVE1DUF9MT0dfRklMRSI6Ii9wYXRoL3RvL2Vjcy1tY3Atc2VydmVyLmxvZyIsIkFMTE9XX1dSSVRFIjoiZmFsc2UiLCJBTExPV19TRU5TSVRJVkVfREFUQSI6ImZhbHNlIn19)
[](https://insiders.vscode.dev/redirect/mcp/install?name=ECS%20MCP%20Server&config=%7B%22command%22%3A%22uvx%22%2C%22args%22%3A%5B%22--from%22%2C%22awslabs-ecs-mcp-server%22%2C%22ecs-mcp-server%22%5D%2C%22env%22%3A%7B%22AWS_PROFILE%22%3A%22your-aws-profile%22%2C%22AWS_REGION%22%3A%22your-aws-region%22%2C%22FASTMCP_LOG_LEVEL%22%3A%22ERROR%22%2C%22FASTMCP_LOG_FILE%22%3A%22%2Fpath%2Fto%2Fecs-mcp-server.log%22%2C%22ALLOW_WRITE%22%3A%22false%22%2C%22ALLOW_SENSITIVE_DATA%22%3A%22false%22%7D%7D) |
| [Finch MCP Server](src/finch-mcp-server) | Local container building with ECR push | [](https://cursor.com/en/install-mcp?name=awslabs.finch-mcp-server&config=eyJjb21tYW5kIjoidXZ4IGF3c2xhYnMuZmluY2gtbWNwLXNlcnZlckBsYXRlc3QiLCJlbnYiOnsiQVdTX1BST0ZJTEUiOiJkZWZhdWx0IiwiQVdTX1JFR0lPTiI6InVzLXdlc3QtMiIsIkZBU1RNQ1BfTE9HX0xFVkVMIjoiSU5GTyJ9LCJ0cmFuc3BvcnRUeXBlIjoic3RkaW8iLCJkaXNhYmxlZCI6ZmFsc2UsImF1dG9BcHByb3ZlIjpbXX0%3D)
[](https://insiders.vscode.dev/redirect/mcp/install?name=Finch%20MCP%20Server&config=%7B%22command%22%3A%22uvx%22%2C%22args%22%3A%5B%22awslabs.finch-mcp-server%40latest%22%5D%2C%22env%22%3A%7B%22AWS_PROFILE%22%3A%22default%22%2C%22AWS_REGION%22%3A%22us-west-2%22%2C%22FASTMCP_LOG_LEVEL%22%3A%22INFO%22%7D%2C%22transportType%22%3A%22stdio%22%2C%22disabled%22%3Afalse%2C%22autoApprove%22%3A%5B%5D%7D) |
diff --git a/docusaurus/docs/servers/sagemaker-hyperpod-mcp-server.md b/docusaurus/docs/servers/sagemaker-hyperpod-mcp-server.md
new file mode 100644
index 0000000000..11f408696d
--- /dev/null
+++ b/docusaurus/docs/servers/sagemaker-hyperpod-mcp-server.md
@@ -0,0 +1,16 @@
+---
+title: Amazon SageMaker HyperPod MCP Server
+---
+
+import ReadmeContent from "../../../src/sagemaker-hyperpod-mcp-server/README.md";
+
+
+
+
+
diff --git a/docusaurus/sidebars.ts b/docusaurus/sidebars.ts
index ab31203015..9599167adf 100644
--- a/docusaurus/sidebars.ts
+++ b/docusaurus/sidebars.ts
@@ -46,6 +46,7 @@ const sidebars: SidebarsConfig = {
'servers/cdk-mcp-server',
'servers/cfn-mcp-server',
'servers/terraform-mcp-server',
+ 'servers/sagemaker-hyperpod-mcp-server',
'servers/eks-mcp-server',
'servers/ecs-mcp-server',
'servers/finch-mcp-server',
diff --git a/docusaurus/static/assets/server-cards.json b/docusaurus/static/assets/server-cards.json
index 8a5579417e..58e6baf0ba 100644
--- a/docusaurus/static/assets/server-cards.json
+++ b/docusaurus/static/assets/server-cards.json
@@ -199,6 +199,26 @@
"vibe-coding"
]
},
+ {
+ "category": "Infrastructure & Deployment",
+ "description": "SageMaker HyperPod cluster management and application deployment",
+ "icon": "\ud83c\udfd7\ufe0f",
+ "id": "sagemaker-hyperpod-mcp-server",
+ "name": "Amazon SageMaker HyperPod MCP Server",
+ "source_path": "src/sagemaker-hyperpod-mcp-server/",
+ "subcategory": "Container Platforms",
+ "tags": [
+ "sagemaker hyperpod",
+ "eks",
+ "cluster-management",
+ "application-deployment",
+ "vibe-coding",
+ "container-platforms"
+ ],
+ "workflows": [
+ "vibe-coding"
+ ]
+ },
{
"category": "Infrastructure & Deployment",
"description": "Kubernetes cluster management and application deployment",
diff --git a/src/sagemaker-hyperpod-mcp-server/.gitignore b/src/sagemaker-hyperpod-mcp-server/.gitignore
new file mode 100644
index 0000000000..9a75a1cf7c
--- /dev/null
+++ b/src/sagemaker-hyperpod-mcp-server/.gitignore
@@ -0,0 +1,63 @@
+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# Virtual environments
+.venv
+env/
+venv/
+ENV/
+
+# IDE
+.idea/
+.vscode/
+*.swp
+*.swo
+
+# Testing
+.tox/
+.coverage
+.coverage.*
+htmlcov/
+.pytest_cache/
+
+# Ruff
+.ruff_cache/
+
+# Build
+*.manifest
+*.spec
+.pybuilder/
+target/
+build
+
+# Environments
+.env
+.env.local
+.env.*.local
+
+# PyPI
+.pypirc
+
+# MCP param
+*-params.json
diff --git a/src/sagemaker-hyperpod-mcp-server/.python-version b/src/sagemaker-hyperpod-mcp-server/.python-version
new file mode 100644
index 0000000000..c8cfe39591
--- /dev/null
+++ b/src/sagemaker-hyperpod-mcp-server/.python-version
@@ -0,0 +1 @@
+3.10
diff --git a/src/sagemaker-hyperpod-mcp-server/CHANGELOG.md b/src/sagemaker-hyperpod-mcp-server/CHANGELOG.md
new file mode 100644
index 0000000000..92b0342e5f
--- /dev/null
+++ b/src/sagemaker-hyperpod-mcp-server/CHANGELOG.md
@@ -0,0 +1,12 @@
+# Changelog
+
+All notable changes to this project will be documented in this file.
+
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+
+## Unreleased
+
+### Added
+
+- Initial project setup
diff --git a/src/sagemaker-hyperpod-mcp-server/LICENSE b/src/sagemaker-hyperpod-mcp-server/LICENSE
new file mode 100644
index 0000000000..67db858821
--- /dev/null
+++ b/src/sagemaker-hyperpod-mcp-server/LICENSE
@@ -0,0 +1,175 @@
+
+ Apache License
+ Version 2.0, January 2004
+ http://www.apache.org/licenses/
+
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+ 1. Definitions.
+
+ "License" shall mean the terms and conditions for use, reproduction,
+ and distribution as defined by Sections 1 through 9 of this document.
+
+ "Licensor" shall mean the copyright owner or entity authorized by
+ the copyright owner that is granting the License.
+
+ "Legal Entity" shall mean the union of the acting entity and all
+ other entities that control, are controlled by, or are under common
+ control with that entity. For the purposes of this definition,
+ "control" means (i) the power, direct or indirect, to cause the
+ direction or management of such entity, whether by contract or
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
+ outstanding shares, or (iii) beneficial ownership of such entity.
+
+ "You" (or "Your") shall mean an individual or Legal Entity
+ exercising permissions granted by this License.
+
+ "Source" form shall mean the preferred form for making modifications,
+ including but not limited to software source code, documentation
+ source, and configuration files.
+
+ "Object" form shall mean any form resulting from mechanical
+ transformation or translation of a Source form, including but
+ not limited to compiled object code, generated documentation,
+ and conversions to other media types.
+
+ "Work" shall mean the work of authorship, whether in Source or
+ Object form, made available under the License, as indicated by a
+ copyright notice that is included in or attached to the work
+ (an example is provided in the Appendix below).
+
+ "Derivative Works" shall mean any work, whether in Source or Object
+ form, that is based on (or derived from) the Work and for which the
+ editorial revisions, annotations, elaborations, or other modifications
+ represent, as a whole, an original work of authorship. For the purposes
+ of this License, Derivative Works shall not include works that remain
+ separable from, or merely link (or bind by name) to the interfaces of,
+ the Work and Derivative Works thereof.
+
+ "Contribution" shall mean any work of authorship, including
+ the original version of the Work and any modifications or additions
+ to that Work or Derivative Works thereof, that is intentionally
+ submitted to Licensor for inclusion in the Work by the copyright owner
+ or by an individual or Legal Entity authorized to submit on behalf of
+ the copyright owner. For the purposes of this definition, "submitted"
+ means any form of electronic, verbal, or written communication sent
+ to the Licensor or its representatives, including but not limited to
+ communication on electronic mailing lists, source code control systems,
+ and issue tracking systems that are managed by, or on behalf of, the
+ Licensor for the purpose of discussing and improving the Work, but
+ excluding communication that is conspicuously marked or otherwise
+ designated in writing by the copyright owner as "Not a Contribution."
+
+ "Contributor" shall mean Licensor and any individual or Legal Entity
+ on behalf of whom a Contribution has been received by Licensor and
+ subsequently incorporated within the Work.
+
+ 2. Grant of Copyright License. Subject to the terms and conditions of
+ this License, each Contributor hereby grants to You a perpetual,
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+ copyright license to reproduce, prepare Derivative Works of,
+ publicly display, publicly perform, sublicense, and distribute the
+ Work and such Derivative Works in Source or Object form.
+
+ 3. Grant of Patent License. Subject to the terms and conditions of
+ this License, each Contributor hereby grants to You a perpetual,
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+ (except as stated in this section) patent license to make, have made,
+ use, offer to sell, sell, import, and otherwise transfer the Work,
+ where such license applies only to those patent claims licensable
+ by such Contributor that are necessarily infringed by their
+ Contribution(s) alone or by combination of their Contribution(s)
+ with the Work to which such Contribution(s) was submitted. If You
+ institute patent litigation against any entity (including a
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
+ or a Contribution incorporated within the Work constitutes direct
+ or contributory patent infringement, then any patent licenses
+ granted to You under this License for that Work shall terminate
+ as of the date such litigation is filed.
+
+ 4. Redistribution. You may reproduce and distribute copies of the
+ Work or Derivative Works thereof in any medium, with or without
+ modifications, and in Source or Object form, provided that You
+ meet the following conditions:
+
+ (a) You must give any other recipients of the Work or
+ Derivative Works a copy of this License; and
+
+ (b) You must cause any modified files to carry prominent notices
+ stating that You changed the files; and
+
+ (c) You must retain, in the Source form of any Derivative Works
+ that You distribute, all copyright, patent, trademark, and
+ attribution notices from the Source form of the Work,
+ excluding those notices that do not pertain to any part of
+ the Derivative Works; and
+
+ (d) If the Work includes a "NOTICE" text file as part of its
+ distribution, then any Derivative Works that You distribute must
+ include a readable copy of the attribution notices contained
+ within such NOTICE file, excluding those notices that do not
+ pertain to any part of the Derivative Works, in at least one
+ of the following places: within a NOTICE text file distributed
+ as part of the Derivative Works; within the Source form or
+ documentation, if provided along with the Derivative Works; or,
+ within a display generated by the Derivative Works, if and
+ wherever such third-party notices normally appear. The contents
+ of the NOTICE file are for informational purposes only and
+ do not modify the License. You may add Your own attribution
+ notices within Derivative Works that You distribute, alongside
+ or as an addendum to the NOTICE text from the Work, provided
+ that such additional attribution notices cannot be construed
+ as modifying the License.
+
+ You may add Your own copyright statement to Your modifications and
+ may provide additional or different license terms and conditions
+ for use, reproduction, or distribution of Your modifications, or
+ for any such Derivative Works as a whole, provided Your use,
+ reproduction, and distribution of the Work otherwise complies with
+ the conditions stated in this License.
+
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
+ any Contribution intentionally submitted for inclusion in the Work
+ by You to the Licensor shall be under the terms and conditions of
+ this License, without any additional terms or conditions.
+ Notwithstanding the above, nothing herein shall supersede or modify
+ the terms of any separate license agreement you may have executed
+ with Licensor regarding such Contributions.
+
+ 6. Trademarks. This License does not grant permission to use the trade
+ names, trademarks, service marks, or product names of the Licensor,
+ except as required for reasonable and customary use in describing the
+ origin of the Work and reproducing the content of the NOTICE file.
+
+ 7. Disclaimer of Warranty. Unless required by applicable law or
+ agreed to in writing, Licensor provides the Work (and each
+ Contributor provides its Contributions) on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+ implied, including, without limitation, any warranties or conditions
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+ PARTICULAR PURPOSE. You are solely responsible for determining the
+ appropriateness of using or redistributing the Work and assume any
+ risks associated with Your exercise of permissions under this License.
+
+ 8. Limitation of Liability. In no event and under no legal theory,
+ whether in tort (including negligence), contract, or otherwise,
+ unless required by applicable law (such as deliberate and grossly
+ negligent acts) or agreed to in writing, shall any Contributor be
+ liable to You for damages, including any direct, indirect, special,
+ incidental, or consequential damages of any character arising as a
+ result of this License or out of the use or inability to use the
+ Work (including but not limited to damages for loss of goodwill,
+ work stoppage, computer failure or malfunction, or any and all
+ other commercial damages or losses), even if such Contributor
+ has been advised of the possibility of such damages.
+
+ 9. Accepting Warranty or Additional Liability. While redistributing
+ the Work or Derivative Works thereof, You may choose to offer,
+ and charge a fee for, acceptance of support, warranty, indemnity,
+ or other liability obligations and/or rights consistent with this
+ License. However, in accepting such obligations, You may act only
+ on Your own behalf and on Your sole responsibility, not on behalf
+ of any other Contributor, and only if You agree to indemnify,
+ defend, and hold each Contributor harmless for any liability
+ incurred by, or claims asserted against, such Contributor by reason
+ of your accepting any such warranty or additional liability.
diff --git a/src/sagemaker-hyperpod-mcp-server/NOTICE b/src/sagemaker-hyperpod-mcp-server/NOTICE
new file mode 100644
index 0000000000..92c323e491
--- /dev/null
+++ b/src/sagemaker-hyperpod-mcp-server/NOTICE
@@ -0,0 +1,2 @@
+awslabs.sagemaker-hyperpod-mcp-server
+Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
diff --git a/src/sagemaker-hyperpod-mcp-server/README.md b/src/sagemaker-hyperpod-mcp-server/README.md
new file mode 100644
index 0000000000..0f8e892f0c
--- /dev/null
+++ b/src/sagemaker-hyperpod-mcp-server/README.md
@@ -0,0 +1,470 @@
+# Amazon SageMaker HyperPod MCP Server
+
+The Amazon SageMaker HyperPod MCP server provides AI code assistants with resource management tools and real-time cluster state visibility. This provides large language models (LLMs) with essential tooling and contextual awareness, enabling AI code assistants to assist with application development through tailored guidance — from initial setup workflows through ongoing management.
+
+Integrating the HyperPod MCP server into AI code assistants enhances development workflow across all phases, from assisting with initial cluster setup workflows using the same managed CloudFormation templates as the AWS SageMaker HyperPod console UI. Further, it helps with cluster management through high-level workflows and guidance. All of this simplifies complex operations through natural language interactions in AI code assistants.
+
+## Key features
+
+* Enables users of AI code assistants to interact with HyperPod cluster deployment workflows, utilizing the same managed CloudFormation templates used by the HyperPod console UI for consistent and approved deployments.
+* Provides the ability to interface with HyperPod cluster stacks and resources via managed CloudFormation templates and user-provided custom parameter values.
+* Supports full lifecycle management of HyperPod cluster nodes, enabling listing, describing, updating software, and deleting operations.
+
+## Prerequisites
+
+* [Install Python 3.10+](https://www.python.org/downloads/release/python-3100/)
+* [Install the `uv` package manager](https://docs.astral.sh/uv/getting-started/installation/)
+* [Install and configure the AWS CLI with credentials](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html)
+* [Install Pre-commit.](https://pre-commit.com/)
+ - Pre-commit will run automatically before each commit. You can also run pre-commit manually on all files using `pre-commit run --all-files`. If any hook fails, the commit will be aborted, and you will need to fix the issues before committing again.
+## Setup
+
+Add these IAM policies to the IAM role or user that you use to manage your HyperPod cluster resources.
+
+### Read-Only Operations Policy
+
+For read operations, the following permissions are required:
+
+```
+{
+ "Version": "2012-10-17",
+ "Statement": [
+ {
+ "Effect": "Allow",
+ "Action": [
+ "sagemaker:ListClusters",
+ "sagemaker:DescribeCluster",
+ "sagemaker:ListClusterNodes",
+ "sagemaker:DescribeClusterNode",
+ "cloudformation:DescribeStacks"
+ ],
+ "Resource": "*"
+ }
+ ]
+}
+```
+
+### Write Operations Policy
+
+For write operations, we recommend the following IAM policies to ensure successful deployment of HyperPod clusters using the managed CloudFormation templates:
+
+* [**IAMFullAccess**](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/IAMFullAccess.html): Enables creation and management of IAM roles and policies required for cluster operation. After cluster creation and if no new IAM role needs to be created, we recommend reducing the scope of this policy permissions.
+* [**AmazonVPCFullAccess**](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonVPCFullAccess.html): Allows creation and configuration of VPC resources including subnets, route tables, internet gateways, and NAT gateways
+* [**AWSCloudFormationFullAccess**](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AWSCloudFormationFullAccess.html): Provides permissions to create, update, and delete CloudFormation stacks that orchestrate the deployment
+* [**AmazonSageMakerFullAccess**](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerFullAccess.html): Required for creating and managing HyperPod clusters and cluster nodes
+* [**AmazonS3FullAccess**](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonS3FullAccess.html): Required for creating S3 buckets storing LifeCyle scripts and so on
+* [**AWSLambda_FullAccess**](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AWSLambda_FullAccess.html): Required for interacting Lambda functions to manage HyperPod clusters and other resources
+* [**CloudWatchLogsFullAccess**](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/CloudWatchLogsFullAccess.html): Required for operations on CloudWatch logs
+* [**AmazonFSxFullAccess**](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonFSxFullAccess.html): Required for operations on FSx file systems
+* **EKS Full Access (provided below)**: Required for interacting with EKS clusters orchestrating HyperPod
+
+ ```
+ {
+ "Version": "2012-10-17",
+ "Statement": [
+ {
+ "Effect": "Allow",
+ "Action": "eks:*",
+ "Resource": "*"
+ }
+ ]
+ }
+ ```
+
+
+**Important Security Note**: Users should exercise caution when `--allow-write` and `--allow-sensitive-data-access` modes are enabled with these broad permissions, as this combination grants significant privileges to the MCP server. Only enable these flags when necessary and in trusted environments. For production use, consider creating more restrictive custom policies.
+
+## Quickstart
+
+This quickstart guide walks you through the steps to configure the Amazon SageMaker HyperPod MCP Server for use with the [Amazon Q Developer CLI](https://github.com/aws/amazon-q-developer-cli). By following these steps, you'll setup your development environment to leverage the HyperPod MCP Server's tools for managing your Amazon SageMaker HyperPod clusters and resources.
+
+**Set up Pre-Commit**
+1. `pip install pre-commit`
+2. `pre-commit install`
+3. pre-commit will run before commiting. Use `pre-commit run --all-files` to trigger pre-commit without commiting.
+
+
+**Set up Cursor**
+
+| VS Code |
+|:-------:|
+| [](https://insiders.vscode.dev/redirect/mcp/install?name=HyperPod%20MCP%20Server&config=%7B%22autoApprove%22%3A%5B%5D%2C%22disabled%22%3Afalse%2C%22command%22%3A%22uvx%22%2C%22args%22%3A%5B%22awslabs.sagemaker-hyperpod-mcp-server%40latest%22%2C%22--allow-write%22%2C%22--allow-sensitive-data-access%22%5D%2C%22env%22%3A%7B%22FASTMCP_LOG_LEVEL%22%3A%22ERROR%22%7D%2C%22transportType%22%3A%22stdio%22%7D) |
+
+**Set up the Amazon Q Developer CLI**
+
+1. Install the [Amazon Q Developer CLI](https://docs.aws.amazon.com/amazonq/latest/qdeveloper-ug/command-line-installing.html) .
+2. The Q Developer CLI supports MCP servers for tools and prompts out-of-the-box. Edit your Q developer CLI's MCP configuration file named mcp.json following [these instructions](https://docs.aws.amazon.com/amazonq/latest/qdeveloper-ug/command-line-mcp-configuration.html).
+
+The example below includes both the `--allow-write` flag for mutating operations and the `--allow-sensitive-data-access` flag for accessing logs and events (see the Arguments section for more details):
+
+ **For Mac/Linux:**
+
+ ```
+ {
+ "mcpServers": {
+ "awslabs.sagemaker-hyperpod-mcp-server": {
+ "command": "uvx",
+ "args": [
+ "awslabs.sagemaker-hyperpod-mcp-server@latest",
+ "--allow-write",
+ "--allow-sensitive-data-access"
+ ],
+ "env": {
+ "FASTMCP_LOG_LEVEL": "ERROR"
+ },
+ "autoApprove": [],
+ "disabled": false
+ }
+ }
+ }
+ ```
+
+ **For Windows:**
+
+ ```
+ {
+ "mcpServers": {
+ "awslabs.sagemaker-hyperpod-mcp-server": {
+ "command": "uvx",
+ "args": [
+ "--from",
+ "awslabs.sagemaker-hyperpod-mcp-server@latest",
+ "awslabs.sagemaker-hyperpod-mcp-server.exe",
+ "--allow-write",
+ "--allow-sensitive-data-access"
+ ],
+ "env": {
+ "FASTMCP_LOG_LEVEL": "ERROR"
+ },
+ "autoApprove": [],
+ "disabled": false
+ }
+ }
+ }
+ ```
+
+3. Verify your setup by running the `/tools` command in the Q Developer CLI to see the available HyperPod MCP tools.
+
+Note that this is a basic quickstart. You can enable additional capabilities, such as combining more MCP servers like the [AWS Documentation MCP Server](https://awslabs.github.io/mcp/servers/aws-documentation-mcp-server/) into a single MCP server definition. To view an example, see the [Installation and Setup](https://github.com/awslabs/mcp?tab=readme-ov-file#installation-and-setup) guide in AWS MCP Servers on GitHub. To view a real-world implementation with application code in context with an MCP server, see the [Server Developer](https://modelcontextprotocol.io/quickstart/server) guide in Anthropic documentation.
+
+## Configurations
+
+### Arguments
+
+The `args` field in the MCP server definition specifies the command-line arguments passed to the server when it starts. These arguments control how the server is executed and configured. For example:
+
+**For Mac/Linux:**
+```
+{
+ "mcpServers": {
+ "awslabs.sagemaker-hyperpod-mcp-server": {
+ "command": "uvx",
+ "args": [
+ "awslabs.sagemaker-hyperpod-mcp-server@latest",
+ "--allow-write",
+ "--allow-sensitive-data-access"
+ ],
+ "env": {
+ "AWS_PROFILE": "your-profile",
+ "AWS_REGION": "us-east-1"
+ }
+ }
+ }
+}
+```
+
+**For Windows:**
+```
+{
+ "mcpServers": {
+ "awslabs.sagemaker-hyperpod-mcp-server": {
+ "command": "uvx",
+ "args": [
+ "--from",
+ "awslabs.sagemaker-hyperpod-mcp-server@latest",
+ "awslabs.sagemaker-hyperpod-mcp-server.exe",
+ "--allow-write",
+ "--allow-sensitive-data-access"
+ ],
+ "env": {
+ "AWS_PROFILE": "your-profile",
+ "AWS_REGION": "us-east-1"
+ }
+ }
+ }
+}
+```
+
+#### Command Format
+
+The command format differs between operating systems:
+
+**For Mac/Linux:**
+* `awslabs.sagemaker-hyperpod-mcp-server@latest` - Specifies the latest package/version specifier for the MCP client config.
+
+**For Windows:**
+* `--from awslabs.sagemaker-hyperpod-mcp-server@latest awslabs.sagemaker-hyperpod-mcp-server.exe` - Windows requires the `--from` flag to specify the package and the `.exe` extension.
+
+Both formats enable MCP server startup and tool registration.
+
+#### `--allow-write` (optional)
+
+Enables write access mode, which allows mutating operations (e.g., create, update, delete resources) for manage_hyperpod_stacks, manage_hyperpod_cluster_nodes tool operations.
+
+* Default: false (The server runs in read-only mode by default)
+* Example: Add `--allow-write` to the `args` list in your MCP server definition.
+
+#### `--allow-sensitive-data-access` (optional)
+
+Enables access to sensitive data such as logs, events, and cluster details. This flag is required for tools that access potentially sensitive information.
+
+* Default: false (Access to sensitive data is restricted by default)
+* Example: Add `--allow-sensitive-data-access` to the `args` list in your MCP server definition.
+
+### Environment variables
+
+The `env` field in the MCP server definition allows you to configure environment variables that control the behavior of the HyperPod MCP server. For example:
+
+```
+{
+ "mcpServers": {
+ "awslabs.sagemaker-hyperpod-mcp-server": {
+ "env": {
+ "FASTMCP_LOG_LEVEL": "ERROR",
+ "AWS_PROFILE": "my-profile",
+ "AWS_REGION": "us-west-2"
+ }
+ }
+ }
+}
+```
+
+#### `FASTMCP_LOG_LEVEL` (optional)
+
+Sets the logging level verbosity for the server.
+
+* Valid values: "DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"
+* Default: "WARNING"
+* Example: `"FASTMCP_LOG_LEVEL": "ERROR"`
+
+#### `AWS_PROFILE` (optional)
+
+Specifies the AWS profile to use for authentication.
+
+* Default: None (If not set, uses default AWS credentials).
+* Example: `"AWS_PROFILE": "my-profile"`
+
+#### `AWS_REGION` (optional)
+
+Specifies the AWS region where HyperPod clusters are managed, which will be used for all AWS service operations.
+
+* Default: None (If not set, uses default AWS region).
+* Example: `"AWS_REGION": "us-west-2"`
+
+## Tools
+
+The following tools are provided by the HyperPod MCP server for managing Amazon SageMaker HyperPod clusters and resources. Each tool performs a specific action that can be invoked to automate common tasks in your HyperPod clusters.
+
+### HyperPod Cluster Management
+
+#### `manage_hyperpod_stacks`
+
+Provides interface to HyperPod CloudFormation stacks with operations for initiating deployments, describing, and deleting HyperPod clusters and their underlying infrastructure. **Note**: Cluster creation typically takes around 30 minutes to complete.
+
+Features:
+
+* Interfaces with HyperPod cluster deployments using the same managed CloudFormation templates as the HyperPod console UI.
+* Allows users to specify parameter override values as a JSON object for more customized HyperPod stack creation.
+* Describes existing HyperPod CloudFormation stacks, providing details like status, outputs, and creation time.
+* Deletes HyperPod CloudFormation stacks and their associated resources, ensuring proper cleanup.
+* Ensures safety by only modifying/deleting stacks that were originally created by this tool.
+* Does not create, modify, or provision CloudFormation templates - only interfaces with existing managed templates.
+
+Parameters:
+
+* operation (deploy, describe, delete), stack_name, region_name, profile_name, params_file (for deploy)
+
+### HyperPod Cluster Node Operations
+
+#### `manage_hyperpod_cluster_nodes`
+
+Manages SageMaker HyperPod clusters and nodes with both read and write operations.
+
+Features:
+
+* Provides a consolidated interface for all cluster and node-related operations.
+* Supports listing clusters with filtering by name, creation time, and training plan ARN.
+* Supports listing nodes with filtering by creation time and instance group name.
+* Returns detailed information about specific nodes in a cluster.
+* Initiates software updates for all nodes or specific instance groups in a cluster.
+* Deletes multiple nodes from a cluster in a single operation.
+
+Operations:
+
+* **list_clusters**: Lists SageMaker HyperPod clusters with options for pagination and filtering.
+* **list_nodes**: Lists nodes in a SageMaker HyperPod cluster with options for pagination and filtering.
+* **describe_node**: Gets detailed information about a specific node in a SageMaker HyperPod cluster.
+* **update_software**: Updates the software for a SageMaker HyperPod cluster.
+* **batch_delete**: Deletes multiple nodes from a SageMaker HyperPod cluster in a single operation.
+
+Parameters:
+
+* operation (list_clusters, list_nodes, describe_node, update_software, batch_delete)
+* cluster_name (required for all operations except list_clusters)
+* node_id (required for describe_node operation)
+* node_ids (required for batch_delete operation)
+* Additional parameters specific to each operation
+
+
+## Security & permissions
+
+### Features
+
+The HyperPod MCP Server implements the following security features:
+
+1. **AWS Authentication**: Uses AWS credentials from the environment for secure authentication.
+2. **SSL Verification**: Enforces SSL verification for all AWS API calls.
+3. **Resource Tagging**: Tags all created resources for traceability.
+4. **Least Privilege**: Uses IAM roles with appropriate permissions for CloudFormation templates.
+5. **Stack Protection**: Ensures CloudFormation stacks can only be modified by the tool that created them.
+
+### Considerations
+
+When using the HyperPod MCP Server, consider the following:
+
+* **AWS Credentials**: The server needs permission to create and manage HyperPod resources.
+* **Network Security**: Configure VPC and security groups properly for HyperPod clusters.
+* **Authentication**: Use appropriate authentication mechanisms for AWS resources.
+* **Authorization**: Configure IAM properly for AWS resources.
+* **Data Protection**: Encrypt sensitive data in HyperPod clusters.
+* **Logging and Monitoring**: Enable logging and monitoring for HyperPod clusters.
+
+### Permissions
+
+The HyperPod MCP Server can be used for production environments with proper security controls in place. The server runs in read-only mode by default, which is recommended and considered generally safer for production environments. Only explicitly enable write access when necessary. Below are the HyperPod MCP server tools available in read-only versus write-access mode:
+
+* **Read-only mode (default)**: `manage_hyperpod_stacks` (with operation="describe"), `manage_hyperpod_cluster_nodes` (with operations="list_clusters", "list_nodes", "describe_node").
+* **Write-access mode**: (require `--allow-write`): `manage_hyperpod_stacks` (with "deploy", "delete"), `manage_hyperpod_cluster_nodes` (with operations="update_software", "batch_delete").
+
+#### `autoApprove` (optional)
+
+An array within the MCP server definition that lists tool names to be automatically approved by the HyperPod MCP Server client, bypassing user confirmation for those specific tools. For example:
+
+**For Mac/Linux:**
+```
+{
+ "mcpServers": {
+ "awslabs.sagemaker-hyperpod-mcp-server": {
+ "command": "uvx",
+ "args": [
+ "awslabs.sagemaker-hyperpod-mcp-server@latest"
+ ],
+ "env": {
+ "AWS_PROFILE": "hyperpod-mcp-readonly-profile",
+ "AWS_REGION": "us-east-1",
+ "FASTMCP_LOG_LEVEL": "INFO"
+ },
+ "autoApprove": [
+ "manage_hyperpod_stacks",
+ "manage_hyperpod_cluster_nodes"
+ ]
+ }
+ }
+}
+```
+
+**For Windows:**
+```
+{
+ "mcpServers": {
+ "awslabs.sagemaker-hyperpod-mcp-server": {
+ "command": "uvx",
+ "args": [
+ "--from",
+ "awslabs.sagemaker-hyperpod-mcp-server@latest",
+ "awslabs.sagemaker-hyperpod-mcp-server.exe"
+ ],
+ "env": {
+ "AWS_PROFILE": "hyperpod-mcp-readonly-profile",
+ "AWS_REGION": "us-east-1",
+ "FASTMCP_LOG_LEVEL": "INFO"
+ },
+ "autoApprove": [
+ "manage_hyperpod_stacks",
+ "manage_hyperpod_cluster_nodes"
+ ]
+ }
+ }
+}
+```
+
+### Role Scoping Recommendations
+
+In accordance with security best practices, we recommend the following:
+
+1. **Create dedicated IAM roles** to be used by the HyperPod MCP Server with the principle of "least privilege."
+2. **Use separate roles** for read-only and write operations.
+3. **Implement resource tagging** to limit actions to resources created by the server.
+4. **Enable AWS CloudTrail** to audit all API calls made by the server.
+5. **Regularly review** the permissions granted to the server's IAM role.
+6. **Use IAM Access Analyzer** to identify unused permissions that can be removed.
+
+### Sensitive Information Handling
+
+**IMPORTANT**: Do not pass secrets or sensitive information via allowed input mechanisms:
+
+* Do not include secrets or credentials in CloudFormation templates.
+* Do not pass sensitive information directly in the prompt to the model.
+* Avoid using MCP tools for creating secrets, as this would require providing the secret data to the model.
+
+**CloudFormation Template Security**:
+
+* Only use CloudFormation templates from trustworthy sources.
+* The server relies on CloudFormation API validation for template content and does not perform its own validation.
+* Audit CloudFormation templates before applying them to your cluster.
+
+**Instead of passing secrets through MCP**:
+
+* Use AWS Secrets Manager or Parameter Store to store sensitive information.
+* Configure proper IAM roles for service accounts.
+* Use IAM roles for service accounts (IRSA) for AWS service access from pods.
+
+
+### File System Access and Operating Mode
+
+**Important**: This MCP server is intended for **STDIO mode only** as a local server using a single user's credentials. The server runs with the same permissions as the user who started it and has complete access to the file system.
+
+#### Security and Access Considerations
+
+- **Full File System Access**: The server can read from and writeq to any location on the file system where the user has permissions
+- **Host File System Sharing**: When using this server, the host file system is directly accessible
+- **Do Not Modify for Network Use**: This server is designed for local STDIO use only; network operation introduces additional security risks
+
+#### Common File Operations
+
+The MCP server can create a templated params json file to a user-specified absolute file path during hyperpod cluster creation.
+
+
+## General Best Practices
+
+* **Resource Naming**: Use descriptive names for HyperPod clusters and resources.
+* **Error Handling**: Check for errors in tool responses and handle them appropriately.
+* **Resource Cleanup**: Delete unused resources to avoid unnecessary costs.
+* **Monitoring**: Monitor cluster and resource status regularly.
+* **Security**: Follow AWS security best practices for HyperPod clusters.
+* **Backup**: Regularly backup important HyperPod resources.
+
+## General Troubleshooting
+
+* **Permission Errors**: Verify that your AWS credentials have the necessary permissions.
+* **CloudFormation Errors**: Check the CloudFormation console for stack creation errors.
+* **SageMaker API Errors**: Verify that the HyperPod cluster is running and accessible.
+* **Network Issues**: Check VPC and security group configurations.
+* **Client Errors**: Verify that the MCP client is configured correctly.
+* **Log Level**: Increase the log level to DEBUG for more detailed logs.
+
+For general HyperPod issues, consult the [Amazon SageMaker HyperPod documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod.html).
+
+## Version
+
+Current MCP server version: 0.1.0
diff --git a/src/sagemaker-hyperpod-mcp-server/awslabs/__init__.py b/src/sagemaker-hyperpod-mcp-server/awslabs/__init__.py
new file mode 100644
index 0000000000..5c624673e0
--- /dev/null
+++ b/src/sagemaker-hyperpod-mcp-server/awslabs/__init__.py
@@ -0,0 +1,16 @@
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# This file is part of the awslabs namespace.
+# It is intentionally minimal to support PEP 420 namespace packages.
diff --git a/src/sagemaker-hyperpod-mcp-server/awslabs/sagemaker_hyperpod_mcp_server/__init__.py b/src/sagemaker-hyperpod-mcp-server/awslabs/sagemaker_hyperpod_mcp_server/__init__.py
new file mode 100644
index 0000000000..2bcd57027a
--- /dev/null
+++ b/src/sagemaker-hyperpod-mcp-server/awslabs/sagemaker_hyperpod_mcp_server/__init__.py
@@ -0,0 +1,17 @@
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""awslabs.sagemaker-hyperpod-mcp-server"""
+
+__version__ = '0.1.0'
diff --git a/src/sagemaker-hyperpod-mcp-server/awslabs/sagemaker_hyperpod_mcp_server/aws_helper.py b/src/sagemaker-hyperpod-mcp-server/awslabs/sagemaker_hyperpod_mcp_server/aws_helper.py
new file mode 100644
index 0000000000..abfb213b9d
--- /dev/null
+++ b/src/sagemaker-hyperpod-mcp-server/awslabs/sagemaker_hyperpod_mcp_server/aws_helper.py
@@ -0,0 +1,153 @@
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""AWS helper for the HyperPod MCP Server."""
+
+import boto3
+import os
+import time
+from awslabs.sagemaker_hyperpod_mcp_server.consts import SUPPORTED_REGIONS
+from botocore.config import Config
+from loguru import logger
+from pydantic import validate_call
+from typing import Any, Dict, Optional, cast, get_args
+
+
+# TODO: Import version from package
+__version__ = '0.1.0'
+
+
+class AwsHelper:
+ """Helper class for AWS operations.
+
+ This class provides utility methods for interacting with AWS services,
+ including region and profile management and client creation.
+
+ This class implements a singleton pattern with a client cache to avoid
+ creating multiple clients for the same service. The cache includes TTL-based
+ expiration and size limits to prevent memory issues and handle credential rotation.
+ """
+
+ # Singleton instance
+ _instance = None
+
+ # Client cache with AWS service name as key
+ _client_cache: Dict[str, Any] = {}
+
+ # Cache metadata for TTL and size management
+ _cache_metadata: Dict[str, float] = {} # key -> timestamp
+ _cache_ttl: int = 1800 # 30 minutes TTL
+ _cache_max_size: int = 100 # Maximum 100 cache entries
+
+ @staticmethod
+ def get_aws_region() -> Optional[SUPPORTED_REGIONS]:
+ """Get the AWS region from the environment if set."""
+ region = os.environ.get('AWS_REGION')
+ return cast(SUPPORTED_REGIONS, region) if region in get_args(SUPPORTED_REGIONS) else None
+
+ @staticmethod
+ def get_aws_profile() -> Optional[str]:
+ """Get the AWS profile from the environment if set."""
+ return os.environ.get('AWS_PROFILE')
+
+ @classmethod
+ @validate_call
+ def create_boto3_client(
+ cls, service_name: str, region_name: Optional[SUPPORTED_REGIONS] = None
+ ) -> Any:
+ """Create or retrieve a cached boto3 client with the appropriate profile and region.
+
+ The client is configured with a custom user agent suffix 'awslabs/mcp/sagemaker-hyperpod-mcp-server/{version}'
+ to identify API calls made by the HyperPod MCP Server. Clients are cached to improve performance
+ and reduce resource usage.
+
+ Args:
+ service_name: The AWS service name (e.g., 'sagemaker')
+ region_name: Optional region name override
+
+ Returns:
+ A boto3 client for the specified service
+
+ Raises:
+ Exception: If there's an error creating the client
+ """
+ try:
+ # Get region from parameter or environment if set
+ region: Optional[SUPPORTED_REGIONS] = (
+ region_name if region_name is not None else cls.get_aws_region()
+ )
+
+ # Get profile from environment if set
+ profile = cls.get_aws_profile()
+
+ # Use service name as the cache key
+ cache_key = f'{service_name}+{region_name}'
+
+ # Check if client is already in cache and not expired
+ current_time = time.time()
+ if cache_key in cls._client_cache:
+ # Check TTL expiration (lazy expiration)
+ if cache_key in cls._cache_metadata:
+ cache_time = cls._cache_metadata[cache_key]
+ if current_time - cache_time < cls._cache_ttl:
+ logger.info(
+ f'Using cached boto3 client for {service_name} in {region_name}'
+ )
+ return cls._client_cache[cache_key]
+ else:
+ # Expired - remove from cache
+ logger.info(
+ f'Cache expired for {service_name} in {region_name}, creating new client'
+ )
+ del cls._client_cache[cache_key]
+ del cls._cache_metadata[cache_key]
+ else:
+ # No metadata, treat as expired
+ del cls._client_cache[cache_key]
+
+ # Create config with user agent suffix
+ config = Config(
+ user_agent_extra=f'awslabs/mcp/sagemaker-hyperpod-mcp-server/{__version__}'
+ )
+
+ # Create session with profile if specified
+ if profile:
+ session = boto3.Session(profile_name=profile)
+ if region is not None:
+ client = session.client(service_name, region_name=region, config=config)
+ else:
+ client = session.client(service_name, config=config)
+ else:
+ if region is not None:
+ client = boto3.client(service_name, region_name=region, config=config)
+ else:
+ client = boto3.client(service_name, config=config)
+
+ # Enforce cache size limit before adding new entry
+ if len(cls._client_cache) >= cls._cache_max_size:
+ # Remove oldest entry (simple FIFO eviction)
+ oldest_key = min(cls._cache_metadata.keys(), key=lambda k: cls._cache_metadata[k])
+ logger.info(f'Cache size limit reached, evicting oldest entry: {oldest_key}')
+ del cls._client_cache[oldest_key]
+ del cls._cache_metadata[oldest_key]
+
+ # Cache the client with timestamp metadata
+ cls._client_cache[cache_key] = client
+ cls._cache_metadata[cache_key] = current_time
+
+ logger.info(f'Created and cached new boto3 client for {service_name} in {region_name}')
+ return client
+ except Exception as e:
+ # Re-raise with more context
+ raise Exception(f'Failed to create boto3 client for {service_name}: {str(e)}')
diff --git a/src/sagemaker-hyperpod-mcp-server/awslabs/sagemaker_hyperpod_mcp_server/consts.py b/src/sagemaker-hyperpod-mcp-server/awslabs/sagemaker_hyperpod_mcp_server/consts.py
new file mode 100644
index 0000000000..3cd68f8b69
--- /dev/null
+++ b/src/sagemaker-hyperpod-mcp-server/awslabs/sagemaker_hyperpod_mcp_server/consts.py
@@ -0,0 +1,62 @@
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Constants for the HyperPod MCP Server."""
+
+from typing import Literal, TypeAlias
+
+
+# HyperPod Stack Management Operations
+STACK_DEPLOY_OPERATION = 'deploy'
+STACK_DESCRIBE_OPERATION = 'describe'
+STACK_DELETE_OPERATION = 'delete'
+
+# HyperPod Node Management Operations
+LIST_CLUSTERS_OPERATION = 'list_clusters'
+LIST_NODES_OPERATION = 'list_nodes'
+DESCRIBE_NODE_OPERATION = 'describe_node'
+UPDATE_SOFTWARE_OPERATION = 'update_software'
+BATCH_DELETE_OPERATION = 'batch_delete'
+
+# AWS CloudFormation
+CFN_CAPABILITY_IAM = 'CAPABILITY_IAM'
+CFN_CAPABILITY_NAMED_IAM = 'CAPABILITY_NAMED_IAM'
+CAPABILITY_AUTO_EXPAND = 'CAPABILITY_AUTO_EXPAND'
+CFN_ON_FAILURE_DELETE = 'DELETE'
+CFN_STACK_TAG_KEY = 'CreatedBy'
+CFN_STACK_TAG_VALUE = 'HyperPodMCPServer'
+HYPERPOD_CFN_TEMPLATE_URL_EKS = 'https://aws-sagemaker-hyperpod-cluster-setup-us-east-1-prod.s3.us-east-1.amazonaws.com/templates/main-stack-eks-based-template.yaml'
+HYPERPOD_CFN_TEMPLATE_URL_SLURM = 'https://aws-sagemaker-hyperpod-cluster-setup-us-east-1-prod.s3.us-east-1.amazonaws.com/templates-slurm/main-stack-slurm-based-template.yaml'
+
+# Error message templates
+STACK_NOT_OWNED_ERROR_TEMPLATE = (
+ 'Stack {stack_name} exists but was not created by {tool_name}. '
+ 'For safety reasons, this tool will only {operation} stacks that were created by itself. '
+ 'To manage this stack, please use the AWS Console, CLI, or the tool that created it.'
+)
+
+
+STACK_OPERATIONS = Literal['deploy', 'describe', 'delete']
+
+SUPPORTED_REGIONS = Literal['us-east-1', 'us-east-2', 'us-west-1', 'us-west-2']
+
+CLUSTER_ORCHESTRATORS = Literal['eks', 'slurm']
+
+NODE_OPERATIONS: TypeAlias = Literal[
+ 'list_clusters',
+ 'list_nodes',
+ 'describe_node',
+ 'update_software',
+ 'batch_delete',
+]
diff --git a/src/sagemaker-hyperpod-mcp-server/awslabs/sagemaker_hyperpod_mcp_server/hyperpod_cluster_node_handler.py b/src/sagemaker-hyperpod-mcp-server/awslabs/sagemaker_hyperpod_mcp_server/hyperpod_cluster_node_handler.py
new file mode 100644
index 0000000000..c1b326f657
--- /dev/null
+++ b/src/sagemaker-hyperpod-mcp-server/awslabs/sagemaker_hyperpod_mcp_server/hyperpod_cluster_node_handler.py
@@ -0,0 +1,1427 @@
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""HyperPod cluster node handler for the HyperPod MCP Server."""
+
+import os
+from awslabs.sagemaker_hyperpod_mcp_server.aws_helper import AwsHelper
+from awslabs.sagemaker_hyperpod_mcp_server.consts import (
+ BATCH_DELETE_OPERATION,
+ DESCRIBE_NODE_OPERATION,
+ LIST_CLUSTERS_OPERATION,
+ LIST_NODES_OPERATION,
+ NODE_OPERATIONS,
+ SUPPORTED_REGIONS,
+ UPDATE_SOFTWARE_OPERATION,
+)
+from awslabs.sagemaker_hyperpod_mcp_server.logging_helper import LogLevel, log_with_request_id
+from awslabs.sagemaker_hyperpod_mcp_server.models import (
+ BatchDeleteClusterNodesError,
+ BatchDeleteClusterNodesResponse,
+ ClusterEbsVolumeConfig,
+ ClusterInstancePlacement,
+ ClusterInstanceStatusDetails,
+ ClusterInstanceStorageConfig,
+ ClusterLifeCycleConfig,
+ ClusterNodeDetails,
+ ClusterNodeSummary,
+ ClusterSummary,
+ DeploymentConfiguration,
+ DescribeClusterNodeResponse,
+ ListClusterNodesResponse,
+ ListClustersResponse,
+ UpdateClusterSoftwareInstanceGroupSpecification,
+ UpdateClusterSoftwareResponse,
+ VpcConfig,
+)
+from mcp.server.fastmcp import Context
+from mcp.types import TextContent
+from pydantic import Field, validate_call
+from typing import Any, List, Literal, Optional, Union
+
+
+class HyperPodClusterNodeHandler:
+ """Handler for HyperPod cluster node operations in the HyperPod MCP Server.
+
+ This class provides tools for interacting with SageMaker HyperPod cluster nodes.
+ """
+
+ def __init__(
+ self,
+ mcp,
+ allow_write: bool = False,
+ allow_sensitive_data_access: bool = False,
+ ):
+ """Initialize the HyperPod cluster node handler.
+
+ Args:
+ mcp: The MCP server instance
+ allow_write: Whether to enable write access (default: False)
+ allow_sensitive_data_access: Whether to allow access to sensitive data (default: False)
+ """
+ self.mcp = mcp
+ self.allow_write = allow_write
+ self.allow_sensitive_data_access = allow_sensitive_data_access
+
+ # Register tools
+ # temp workaround for update cluster, remove once update is fixed
+ self.mcp.tool(name='describe_hp_cluster')(self.describe_hp_cluster)
+ self.mcp.tool(name='update_hp_cluster')(self.update_hp_cluster)
+
+ self.mcp.tool(name='manage_hyperpod_cluster_nodes')(self.manage_hyperpod_cluster_nodes)
+
+ def get_sagemaker_client(
+ self,
+ ctx: Context,
+ region_name: Optional[SUPPORTED_REGIONS] = None,
+ profile_name: Optional[str] = None,
+ ):
+ """Get a SageMaker client for the specified region and profile.
+
+ Args:
+ ctx: The MCP context
+ region_name: Optional AWS region name
+ profile_name: Optional AWS profile name. Using the correct profile is important
+ for successful API calls, especially for SageMaker HyperPod operations.
+
+ Returns:
+ A boto3 SageMaker client
+ """
+ # Set AWS_PROFILE environment variable if profile_name is provided
+ if profile_name:
+ log_with_request_id(ctx, LogLevel.INFO, f'Using AWS profile: {profile_name}')
+ os.environ['AWS_PROFILE'] = profile_name
+
+ return AwsHelper.create_boto3_client('sagemaker', region_name=region_name)
+
+ @validate_call
+ async def manage_hyperpod_cluster_nodes(
+ self,
+ ctx: Context,
+ operation: NODE_OPERATIONS = Field(
+ description='Operation to perform: list_clusters, list_nodes, describe_node, update_software, or batch_delete. Choose "list_clusters" or "list_nodes" or "describe_node" for read-only operations when write access is disabled.',
+ ),
+ cluster_name: Optional[str] = Field(
+ None,
+ description='The name of the cluster. Required for all operations except "list_clusters".',
+ ),
+ node_id: Optional[str] = Field(
+ None,
+ description='The ID of the SageMaker HyperPod cluster node. Required for "describe_node" operation.',
+ ),
+ node_ids: Optional[List[str]] = Field(
+ None,
+ description='The list of node IDs to delete from the cluster. Required for "batch_delete" operation.',
+ ),
+ # Parameters for list_clusters operation
+ max_results: Optional[int] = Field(
+ 10,
+ description='The maximum number of results to return in the response. Default: 10. Used for "list_clusters" and "list_nodes" operations.',
+ ge=1,
+ le=100,
+ ),
+ next_token: Optional[str] = Field(
+ None,
+ description='If the response to a previous request was truncated, the response includes a NextToken. To retrieve the next set of results, use the token in the next request. Used for "list_clusters" and "list_nodes" operations.',
+ ),
+ name_contains: Optional[str] = Field(
+ None,
+ description='A filter that returns only clusters whose name contains the specified string. Used for "list_clusters" operation.',
+ ),
+ # Parameters for list_nodes operation
+ creation_time_after: Optional[str] = Field(
+ None,
+ description='Filter for nodes/clusters created after the specified time. Accepts formats: ISO 8601 (e.g., 2014-10-01T20:30:00Z), date only (e.g., 2014-10-01), or Unix time in seconds. Used for "list_clusters" and "list_nodes" operations.',
+ ),
+ creation_time_before: Optional[str] = Field(
+ None,
+ description='Filter for nodes/clusters created before the specified time. Accepts formats: ISO 8601 (e.g., 2014-10-01T20:30:00Z), date only (e.g., 2014-10-01), or Unix time in seconds. Used for "list_clusters" and "list_nodes" operations.',
+ ),
+ instance_group_name_contains: Optional[str] = Field(
+ None,
+ description='Filter for nodes in instance groups whose name contains the specified string. Used for "list_nodes" operation.',
+ ),
+ sort_by: Optional[Literal['CREATION_TIME', 'NAME']] = Field(
+ default='CREATION_TIME', description='The field to sort results by...'
+ ),
+ sort_order: Optional[Literal['Ascending', 'Descending']] = Field(
+ default='Ascending',
+ description='The sort order for results. The default is Ascending. Used for "list_clusters" and "list_nodes" operations.',
+ ),
+ training_plan_arn: Optional[str] = Field(
+ None,
+ description='The Amazon Resource Name (ARN) of the training plan to filter clusters by. Used for "list_clusters" operation.',
+ ),
+ # Parameters for update_software operation
+ deployment_config: Optional[DeploymentConfiguration] = Field(
+ None,
+ description='The configuration to use when updating the AMI versions. Used for "update_software" operation.',
+ ),
+ instance_groups: Optional[List[UpdateClusterSoftwareInstanceGroupSpecification]] = Field(
+ None,
+ description='The array of instance groups for which to update AMI versions. Used for "update_software" operation.',
+ ),
+ # Common parameters
+ region_name: Optional[SUPPORTED_REGIONS] = Field(
+ 'us-east-1',
+ description='AWS region name. Default is us-east-1.',
+ ),
+ profile_name: Optional[str] = Field(
+ None,
+ description='AWS profile name. If not provided, uses the default profile.',
+ ),
+ ) -> Union[
+ ListClustersResponse,
+ ListClusterNodesResponse,
+ DescribeClusterNodeResponse,
+ UpdateClusterSoftwareResponse,
+ BatchDeleteClusterNodesResponse,
+ ]:
+ """Manage SageMaker HyperPod clusters and nodes with both read and write operations.
+
+ This tool provides operations for managing SageMaker HyperPod clusters and nodes, including listing clusters,
+ listing nodes, describing a specific node, updating cluster software, and deleting nodes. It serves as a consolidated
+ interface for all cluster and node-related operations, simplifying the management of HyperPod resources.
+
+ ## Operations
+ - **list_clusters**: List SageMaker HyperPod clusters with options for pagination and filtering
+ - **list_nodes**: List nodes in a SageMaker HyperPod cluster with options for pagination and filtering
+ - **describe_node**: Get detailed information about a specific node in a SageMaker HyperPod cluster
+ - **update_software**: Update the software for a SageMaker HyperPod cluster
+ - **batch_delete**: Delete multiple nodes from a SageMaker HyperPod cluster in a single operation
+
+ ## Response Information
+ The response type varies based on the operation:
+ - list_clusters: Returns ListClustersResponse with a list of clusters
+ - list_nodes: Returns ListClusterNodesResponse with a list of nodes
+ - describe_node: Returns DescribeClusterNodeResponse with detailed node information
+ - update_software: Returns UpdateClusterSoftwareResponse with the cluster ARN
+ - batch_delete: Returns BatchDeleteClusterNodesResponse with details of the deletion operation
+
+ ## Important Notes
+ - ALWAYS show the important notes for operations batch_delete and update_software BEFORE execute the operations
+ - For update_software:
+ The UpgradeClusterSoftware API call may impact your SageMaker HyperPod cluster uptime and availability. Plan accordingly to mitigate potential disruptions to your workloads
+ - For batch_delete:
+ - BEFORE running the tool, ALWAYS remind user all followings
+ - To safeguard your work, back up your data to Amazon S3 or an FSx for Lustre file system before invoking
+ the API on a worker node group. This will help prevent any potential data loss from the instance root volume.
+ For more information about backup, see Use the backup script provided by SageMaker HyperPod:
+ https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-backup-restore.html
+ - If you want to invoke this API on an existing cluster, you'll first need to patch the cluster by running
+ the UpdateClusterSoftware API. For more information about patching a cluster, see Update the SageMaker
+ HyperPod platform software of a cluster:
+ https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-update-software.html
+ - Deleting nodes will permanently remove them from the cluster
+ - This operation cannot be undone
+ - Ensure you have selected the correct nodes before proceeding
+ - This operation requires write access to be enabled for the handler
+
+ ## Usage Tips
+ - Use "list_clusters" operation to get an overview of all available clusters in a specified region
+ - Use "list_nodes" operation to get an overview of all nodes in a specific cluster
+ - Use "describe_node" operation to get detailed information about a specific node
+ - Use "update_software" operation to update the software on all nodes or specific instance groups
+ - Use "batch_delete" operation to delete multiple nodes in a single request
+ - Specify region_name to operate on a cluster in a specific region
+ - Specify profile_name to use a specific AWS profile with appropriate permissions
+
+ ## Fallback Options:
+ - If this tool fails, advise using AWS SageMaker CLI alternatives:
+ - List clusters: `aws sagemaker list-clusters --region `
+ - List nodes: `aws sagemaker list-cluster-nodes --cluster-name --region `
+ - Describe node: `aws sagemaker describe-cluster-node --cluster-name --node-id --region `
+ - Update software: `aws sagemaker update-cluster-software --cluster-name --region `
+ - Delete nodes: `aws sagemaker batch-delete-cluster-nodes --cluster-name --node-ids --region `
+ - Or, as another alternative: Advise using SageMaker HyperPod console for cluster and node management
+
+ Args:
+ ctx: MCP context
+ operation: Operation to perform (list_clusters, list_nodes, describe_node, update_software, or batch_delete)
+ cluster_name: The name of the cluster (required for all operations except list_clusters)
+ node_id: The ID of the node (required for describe_node operation)
+ node_ids: List of node IDs to delete (required for batch_delete operation)
+ max_results: Maximum number of results to return (for list_clusters and list_nodes operations)
+ next_token: Token for pagination (for list_clusters and list_nodes operations)
+ name_contains: Filter clusters by name (for list_clusters operation)
+ creation_time_after: Filter by creation time after (for list_clusters and list_nodes operations)
+ creation_time_before: Filter by creation time before (for list_clusters and list_nodes operations)
+ instance_group_name_contains: Filter by instance group name (for list_nodes operation)
+ sort_by: Sort field (for list_clusters and list_nodes operations)
+ sort_order: Sort order (for list_clusters and list_nodes operations)
+ training_plan_arn: Filter clusters by training plan ARN (for list_clusters operation)
+ deployment_config: Configuration for the update process (for update_software operation)
+ instance_groups: Specific instance groups to update (for update_software operation)
+ region_name: AWS region name (default: us-east-1)
+ profile_name: AWS profile name (optional)
+
+ Returns:
+ Union[ListClustersResponse, ListClusterNodesResponse, DescribeClusterNodeResponse, UpdateClusterSoftwareResponse, BatchDeleteClusterNodesResponse]:
+ Response specific to the operation performed
+ """
+ try:
+ # Validate operation-specific required parameters
+ if operation != 'list_clusters' and cluster_name is None:
+ raise ValueError(
+ 'cluster_name is required for all operations except list_clusters'
+ )
+ if operation == 'describe_node' and node_id is None:
+ raise ValueError('node_id is required for describe_node operation')
+ if operation == 'batch_delete' and (node_ids is None or len(node_ids) == 0):
+ raise ValueError('node_ids is required for batch_delete operation')
+
+ # Set default values for None parameters to satisfy type checker
+ if max_results is None:
+ max_results = 10
+
+ # Check if write access is disabled and trying to perform a mutating operation
+ if not self.allow_write and operation in [
+ UPDATE_SOFTWARE_OPERATION,
+ BATCH_DELETE_OPERATION,
+ ]:
+ error_message = f'Operation {operation} is not allowed without write access'
+ log_with_request_id(ctx, LogLevel.ERROR, error_message)
+
+ # Return appropriate response type based on operation
+ if operation == UPDATE_SOFTWARE_OPERATION:
+ return UpdateClusterSoftwareResponse(
+ isError=True,
+ content=[TextContent(type='text', text=error_message)],
+ cluster_arn='',
+ )
+ elif operation == BATCH_DELETE_OPERATION:
+ # Ensure cluster_name is not None for the response
+ safe_cluster_name = cluster_name if cluster_name is not None else ''
+ return BatchDeleteClusterNodesResponse(
+ isError=True,
+ content=[TextContent(type='text', text=error_message)],
+ cluster_name=safe_cluster_name,
+ successful=[],
+ failed=None,
+ )
+
+ # Dispatch to the appropriate operation handler
+ if operation == LIST_CLUSTERS_OPERATION:
+ return await self._list_hp_clusters(
+ ctx=ctx,
+ max_results=max_results,
+ next_token=next_token,
+ name_contains=name_contains,
+ creation_time_after=creation_time_after,
+ creation_time_before=creation_time_before,
+ sort_by=sort_by,
+ sort_order=sort_order,
+ training_plan_arn=training_plan_arn,
+ region_name=region_name,
+ profile_name=profile_name,
+ )
+ elif operation == LIST_NODES_OPERATION:
+ # Ensure cluster_name is not None
+ if cluster_name is None:
+ raise ValueError('cluster_name is required for list_nodes operation')
+ return await self._list_hp_cluster_nodes(
+ ctx=ctx,
+ cluster_name=cluster_name,
+ creation_time_after=creation_time_after,
+ creation_time_before=creation_time_before,
+ instance_group_name_contains=instance_group_name_contains,
+ max_results=max_results,
+ next_token=next_token,
+ sort_by=sort_by,
+ sort_order=sort_order,
+ region_name=region_name,
+ profile_name=profile_name,
+ )
+ elif operation == DESCRIBE_NODE_OPERATION:
+ # Ensure cluster_name and node_id are not None
+ if cluster_name is None:
+ raise ValueError('cluster_name is required for describe_node operation')
+ if node_id is None:
+ raise ValueError('node_id is required for describe_node operation')
+ return await self._describe_hp_cluster_node(
+ ctx=ctx,
+ cluster_name=cluster_name,
+ node_id=node_id,
+ region_name=region_name,
+ profile_name=profile_name,
+ )
+ elif operation == UPDATE_SOFTWARE_OPERATION:
+ # Ensure cluster_name is not None
+ if cluster_name is None:
+ raise ValueError('cluster_name is required for update_software operation')
+ return await self._update_hp_cluster_software(
+ ctx=ctx,
+ cluster_name=cluster_name,
+ deployment_config=deployment_config,
+ instance_groups=instance_groups,
+ region_name=region_name,
+ profile_name=profile_name,
+ )
+ elif operation == 'batch_delete':
+ # Ensure cluster_name and node_ids are not None
+ if cluster_name is None:
+ raise ValueError('cluster_name is required for batch_delete operation')
+ if node_ids is None:
+ raise ValueError('node_ids is required for batch_delete operation')
+ return await self._batch_delete_hp_cluster_nodes(
+ ctx=ctx,
+ cluster_name=cluster_name,
+ node_ids=node_ids,
+ region_name=region_name,
+ profile_name=profile_name,
+ )
+ else:
+ error_message = f'Invalid operation: {operation}. Must be one of: {LIST_CLUSTERS_OPERATION}, {LIST_NODES_OPERATION}, {DESCRIBE_NODE_OPERATION}, {UPDATE_SOFTWARE_OPERATION}, {BATCH_DELETE_OPERATION}'
+ log_with_request_id(ctx, LogLevel.ERROR, error_message)
+ # Default to ListClusterNodesResponse for invalid operations
+ return ListClusterNodesResponse(
+ isError=True,
+ content=[TextContent(type='text', text=error_message)],
+ nodes=[],
+ next_token=None,
+ )
+ except ValueError as e:
+ # Re-raise ValueError for parameter validation errors
+ log_with_request_id(ctx, LogLevel.ERROR, f'Parameter validation error: {str(e)}')
+ raise
+ except Exception as e:
+ error_message = f'Error in manage_hyperpod_cluster_nodes: {str(e)}'
+ log_with_request_id(ctx, LogLevel.ERROR, error_message)
+ # Default to ListClusterNodesResponse for general exceptions
+ return ListClusterNodesResponse(
+ isError=True,
+ content=[TextContent(type='text', text=error_message)],
+ nodes=[],
+ next_token=None,
+ )
+
+ async def _list_hp_clusters(
+ self,
+ ctx: Context,
+ max_results: int = Field(
+ 10,
+ description='The maximum number of clusters to return in the response. Default: 10.',
+ ge=1,
+ le=100,
+ ),
+ next_token: Optional[str] = Field(
+ None,
+ description='If the response to a previous ListClusters request was truncated, the response includes a NextToken. To retrieve the next set of clusters, use the token in the next request.',
+ ),
+ name_contains: Optional[str] = Field(
+ None,
+ description='A filter that returns only clusters whose name contains the specified string.',
+ ),
+ creation_time_after: Optional[str] = Field(
+ None,
+ description='A filter that returns only clusters created after the specified time. Accepts formats: ISO 8601 (e.g., 2014-10-01T20:30:00.000Z), date only (e.g., 2014-10-01), or Unix time in seconds.',
+ ),
+ creation_time_before: Optional[str] = Field(
+ None,
+ description='A filter that returns only clusters created before the specified time. Accepts formats: ISO 8601 (e.g., 2014-10-01T20:30:00.000Z), date only (e.g., 2014-10-01), or Unix time in seconds.',
+ ),
+ sort_by: Optional[Literal['NAME', 'CREATION_TIME']] = Field(
+ default='CREATION_TIME',
+ description='The field to sort results by. The default is CREATION_TIME.',
+ ),
+ sort_order: Optional[Literal['Ascending', 'Descending']] = Field(
+ default='Ascending',
+ description='The sort order for results. The default is Ascending.',
+ ),
+ training_plan_arn: Optional[str] = Field(
+ None,
+ description='The Amazon Resource Name (ARN) of the training plan to filter clusters by.',
+ ),
+ region_name: Optional[SUPPORTED_REGIONS] = Field(
+ 'us-east-1',
+ description='AWS region name. Default is us-east-1.',
+ ),
+ profile_name: Optional[str] = Field(
+ None,
+ description='AWS profile name. If not provided, uses the default profile.',
+ ),
+ ) -> ListClustersResponse:
+ """List SageMaker HyperPod clusters.
+
+ This tool lists SageMaker HyperPod clusters with options for pagination and filtering.
+ It returns information about each cluster including name, ARN, status, creation time,
+ and training plan ARNs.
+
+ ## Response Information
+ The response includes a summary of each cluster with cluster name, ARN, status,
+ creation time, and training plan ARNs.
+
+ ## Usage Tips
+ - Use max_results and next_token for pagination when there are many clusters
+ - Use name_contains to filter clusters by name
+ - Use creation_time_after and creation_time_before to filter by creation time, input should be formated to something like 2014-10-01T20:30:00.000Z, 2014-10-01T12:30:00.000-08:00, 2014-10-01, 1412195400
+ - Use training_plan_arn to filter clusters by training plan
+ - Use sort_by and sort_order to control the order of results
+ - Specify region_name to list clusters in a specific region
+ - Specify profile_name to use a specific AWS profile with appropriate permissions
+ for SageMaker HyperPod operations
+
+ Args:
+ ctx: MCP context
+ max_results: Maximum number of clusters to return (default: 10)
+ next_token: Token for pagination (optional)
+ name_contains: Filter clusters by name (optional)
+ creation_time_after: Filter by creation time after as string (example format: 2014-10-01T20:30:00.000Z, 2014-10-01T12:30:00.000-08:00, 2014-10-01, 1412195400) (optional)
+ creation_time_before: Filter by creation time before as string (example format: 2014-10-01T20:30:00.000Z, 2014-10-01T12:30:00.000-08:00, 2014-10-01, 1412195400) (optional)
+ sort_by: Sort field (default: CREATION_TIME)
+ sort_order: Sort order (default: Ascending)
+ training_plan_arn: Filter clusters by training plan ARN (optional)
+ region_name: AWS region name (default: us-east-1)
+ profile_name: AWS profile name (optional)
+
+ Returns:
+ ListClustersResponse with list of clusters
+ """
+ try:
+ # Get SageMaker client
+ sagemaker_client = self.get_sagemaker_client(
+ ctx, region_name=region_name, profile_name=profile_name
+ )
+
+ # Prepare parameters for list_clusters API call
+ params: dict[str, Any] = {}
+
+ # Add parameters only if they are provided
+ if max_results is not None:
+ params['MaxResults'] = max_results
+ if next_token is not None:
+ params['NextToken'] = next_token
+ if name_contains is not None:
+ params['NameContains'] = name_contains
+ if creation_time_after is not None:
+ params['CreationTimeAfter'] = creation_time_after
+ if creation_time_before is not None:
+ params['CreationTimeBefore'] = creation_time_before
+ if sort_by is not None:
+ params['SortBy'] = sort_by
+ if sort_order is not None:
+ params['SortOrder'] = sort_order
+ if training_plan_arn is not None:
+ params['TrainingPlanArn'] = training_plan_arn
+
+ # Call SageMaker API to list clusters
+ log_with_request_id(
+ ctx, LogLevel.INFO, f'Calling SageMaker list_clusters API with params: {params}'
+ )
+ try:
+ response = sagemaker_client.list_clusters(**params)
+ log_with_request_id(
+ ctx, LogLevel.INFO, f'SageMaker list_clusters API response: {response}'
+ )
+ except Exception as e:
+ log_with_request_id(
+ ctx, LogLevel.ERROR, f'SageMaker list_clusters API error: {str(e)}'
+ )
+ raise
+
+ # Extract clusters from response
+ clusters = []
+ for cluster in response.get('ClusterSummaries', []):
+ cluster_summary = ClusterSummary(
+ cluster_name=cluster.get('ClusterName', ''),
+ cluster_arn=cluster.get('ClusterArn', ''),
+ cluster_status=cluster.get('ClusterStatus', ''),
+ creation_time=str(cluster.get('CreationTime', '')),
+ training_plan_arns=cluster.get('TrainingPlanArns'),
+ )
+ clusters.append(cluster_summary)
+
+ # Get next token for pagination
+ next_token_response = response.get('NextToken')
+
+ # Log success
+ log_with_request_id(
+ ctx,
+ LogLevel.INFO,
+ f'Successfully listed {len(clusters)} SageMaker HyperPod clusters',
+ )
+
+ # Return success response
+ return ListClustersResponse(
+ isError=False,
+ content=[
+ TextContent(
+ type='text',
+ text=f'Successfully listed {len(clusters)} SageMaker HyperPod clusters',
+ )
+ ],
+ clusters=clusters,
+ next_token=next_token_response,
+ )
+
+ except Exception as e:
+ # Log error
+ error_msg = f'Failed to list SageMaker HyperPod clusters: {str(e)}'
+ log_with_request_id(ctx, LogLevel.ERROR, error_msg)
+
+ # Return error response
+ return ListClustersResponse(
+ isError=True,
+ content=[TextContent(type='text', text=error_msg)],
+ clusters=[],
+ next_token=None,
+ )
+
+ async def describe_hp_cluster(
+ self,
+ ctx: Context,
+ cluster_name: str = Field(
+ ...,
+ description='The name of the cluster to describe.',
+ ),
+ region_name: SUPPORTED_REGIONS = Field(
+ description='AWS region name. Default is us-east-1.',
+ ),
+ profile_name: Optional[str] = Field(
+ None,
+ description='AWS profile name. If not provided, uses the default profile.',
+ ),
+ ):
+ """Describe a SageMaker HyperPod cluster.
+
+ Args:
+ ctx: MCP context
+ cluster_name: REQUIRED - Target cluster for describe cluster api
+ region_name: REQUIRED - AWS region name
+ profile_name: AWS profile name (optional)
+
+ ## Fallback Options:
+ - If this tool fails, advise using AWS SageMaker CLI option: `aws sagemaker describe-cluster --cluster-name --region `
+ - Or as another alternative, advise checking directly in the SageMaker HyperPod console (Amazon SageMaker AI → HyperPod Clusters → Cluster Management → select cluster)
+
+ Returns:
+ describe cluster response
+ """
+ sagemaker_client = self.get_sagemaker_client(
+ ctx, region_name=region_name, profile_name=profile_name
+ )
+ params = {'ClusterName': cluster_name}
+ response = sagemaker_client.describe_cluster(**params)
+ return response
+
+ async def update_hp_cluster(
+ self,
+ ctx: Context,
+ cluster_name: str = Field(
+ ...,
+ description='The name of the cluster to update.',
+ ),
+ instance_groups: list = Field(
+ ...,
+ description='List of instance groups to update.',
+ ),
+ region_name: SUPPORTED_REGIONS = Field(
+ description='AWS region name. Default is us-east-1.',
+ ),
+ profile_name: Optional[str] = Field(
+ None,
+ description='AWS profile name. If not provided, uses the default profile.',
+ ),
+ ):
+ """Update a SageMaker HyperPod clusters.
+
+ Notes:
+ - before using this tool, ensure you first have the most recent cluster instance group configurations by first calling the describe_hp_cluster tool first.
+ - modify the instance group configuration based on user's request
+ - important: Use "InstanceCount" (NOT "CurrentCount" or "TargetCount") for desired target count
+ - pass the configuration back in the instance group parameter
+ - example instance groups parameter
+ "instance_groups": [
+ ⋮ {
+ ⋮ "OverrideVpcConfig": {
+ ⋮ "SecurityGroupIds": [
+ ⋮ "<>"
+ ⋮ ],
+ ⋮ "Subnets": [
+ ⋮ "<>"
+ ⋮ ]
+ ⋮ },
+ ⋮ "InstanceCount": <>,
+ ⋮ "InstanceGroupName": "<>",
+ ⋮ "InstanceStorageConfigs": [
+ ⋮ {
+ ⋮ "EbsVolumeConfig": {
+ ⋮ "VolumeSizeInGB": <>
+ ⋮ }
+ ⋮ }
+ ⋮ ],
+ ⋮ "LifeCycleConfig": {
+ ⋮ "SourceS3Uri": "<>",
+ ⋮ "OnCreate": "<>"
+ ⋮ },
+ ⋮ "InstanceType": "<>",
+ ⋮ "ThreadsPerCore": <>,
+ ⋮ "ExecutionRole": "<>"
+ ⋮ }
+ ⋮ ],
+
+ ## Fallback Options:
+ - If this tool fails, advise using AWS SageMaker CLI option: `aws sagemaker update-cluster --region ` with all appropriate parameters
+ - Or as another alternative, advise making updates directly in the SageMaker HyperPod console (Amazon SageMaker AI → HyperPod Clusters → Cluster Management → select cluster → Edit)
+ - To verify results: use CLI `aws sagemaker describe-cluster --cluster-name ` or directly verify in console
+
+ Args:
+ ctx: MCP context
+ cluster_name: REQUIRED: cluster name to update
+ instance_groups: REQUIRED: instance group configurations
+ region_name: REQUIRED - AWS region name
+ profile_name: AWS profile name (optional)
+
+ Returns:
+ update cluster response
+ """
+ if not self.allow_write:
+ error_msg = 'Write access is not enabled for this handler. Cannot update cluster.'
+ log_with_request_id(ctx, LogLevel.ERROR, error_msg)
+ return {'isError': True, 'errorMessage': error_msg}
+
+ # First try-catch: Create SageMaker client and prepare parameters
+ try:
+ sagemaker_client = self.get_sagemaker_client(
+ ctx, region_name=region_name, profile_name=profile_name
+ )
+ params = {'ClusterName': cluster_name, 'InstanceGroups': instance_groups}
+ except Exception as e:
+ error_msg = f'Failed to prepare SageMaker client or parameters: {str(e)}'
+ log_with_request_id(ctx, LogLevel.ERROR, error_msg)
+ return {'isError': True, 'errorMessage': error_msg}
+
+ # Second try-catch: Make the API call
+ try:
+ log_with_request_id(
+ ctx,
+ LogLevel.INFO,
+ f'Calling SageMaker update_cluster API with params: {params}',
+ )
+ response = sagemaker_client.update_cluster(**params)
+ log_with_request_id(
+ ctx,
+ LogLevel.INFO,
+ f'SageMaker update_cluster API response: {response}',
+ )
+ except Exception as e:
+ error_msg = f'SageMaker update_cluster API error: {str(e)}'
+ log_with_request_id(ctx, LogLevel.ERROR, error_msg)
+ return {'isError': True, 'errorMessage': error_msg}
+
+ # Log success
+ log_with_request_id(
+ ctx,
+ LogLevel.INFO,
+ f'Successfully updated SageMaker HyperPod cluster: {cluster_name}',
+ )
+
+ return response
+
+ async def _describe_hp_cluster_node(
+ self,
+ ctx: Context,
+ cluster_name: str = Field(
+ ...,
+ description='The name of the cluster.',
+ ),
+ node_id: str = Field(
+ ...,
+ description='The ID of the SageMaker HyperPod cluster node.',
+ min_length=1,
+ max_length=256,
+ pattern=r'i-[a-f0-9]{8}(?:[a-f0-9]{9})?',
+ ),
+ region_name: Optional[SUPPORTED_REGIONS] = Field(
+ 'us-east-1',
+ description='AWS region name. Default is us-east-1.',
+ ),
+ profile_name: Optional[str] = Field(
+ None,
+ description='AWS profile name. If not provided, uses the default profile.',
+ ),
+ ) -> DescribeClusterNodeResponse:
+ """Describe a SageMaker HyperPod cluster node.
+
+ This tool describes a specific node in a SageMaker HyperPod cluster.
+ It returns detailed information about the node including instance group name, instance ID, instance status,
+ instance type, launch time, last software update time, and other configuration details.
+
+ ## Response Information
+ The response includes detailed information about the node including:
+ - Instance group name and ID
+ - Instance status and type
+ - Launch time and last software update time
+ - Storage configurations
+ - Network configurations
+ - Placement information
+ - And more
+
+ ## Usage Tips
+ - Use this tool to get detailed information about a specific node in a cluster
+ - You need both the cluster name and node ID to identify the node
+ - Specify region_name to describe a node in a specific region
+ - Specify profile_name to use a specific AWS profile with appropriate permissions
+ for SageMaker HyperPod operations
+
+ Args:
+ ctx: MCP context
+ cluster_name: The name of the cluster
+ node_id: The ID of the SageMaker HyperPod cluster node
+ region_name: AWS region name (default: us-east-1)
+ profile_name: AWS profile name (optional)
+
+ Returns:
+ DescribeClusterNodeResponse with node details
+ """
+ try:
+ # Get SageMaker client
+ sagemaker_client = self.get_sagemaker_client(
+ ctx, region_name=region_name, profile_name=profile_name
+ )
+
+ # Prepare parameters for describe_cluster_node API call
+ params = {'ClusterName': cluster_name, 'NodeId': node_id}
+
+ # Call SageMaker API to describe cluster node
+ log_with_request_id(
+ ctx,
+ LogLevel.INFO,
+ f'Calling SageMaker describe_cluster_node API with params: {params}',
+ )
+ try:
+ response = sagemaker_client.describe_cluster_node(**params)
+ log_with_request_id(
+ ctx, LogLevel.INFO, f'SageMaker describe_cluster_node API response: {response}'
+ )
+ except Exception as e:
+ log_with_request_id(
+ ctx, LogLevel.ERROR, f'SageMaker describe_cluster_node API error: {str(e)}'
+ )
+ raise
+
+ # Extract node details from response
+ node_details_data = response.get('NodeDetails', {})
+
+ # Extract instance status details
+ instance_status_data = node_details_data.get('InstanceStatus', {})
+ instance_status_details = ClusterInstanceStatusDetails(
+ status=instance_status_data.get(
+ 'Status', 'Pending'
+ ), # Default to Pending if not provided
+ message=instance_status_data.get('Message'),
+ )
+
+ # Process instance storage configs
+ instance_storage_configs = []
+ for storage_config in node_details_data.get('InstanceStorageConfigs', []):
+ # Process EBS volume config
+ ebs_volume_config = None
+ if 'EbsVolumeConfig' in storage_config:
+ ebs_volume_config = ClusterEbsVolumeConfig(
+ volume_size_in_gb=storage_config['EbsVolumeConfig'].get('VolumeSizeInGb')
+ )
+
+ # Create instance storage config
+ instance_storage_config = ClusterInstanceStorageConfig(
+ ebs_volume_config=ebs_volume_config
+ )
+ instance_storage_configs.append(instance_storage_config)
+
+ # Process life cycle config
+ life_cycle_config = None
+ if (
+ 'LifeCycleConfig' in node_details_data
+ and node_details_data['LifeCycleConfig'].get('OnCreate')
+ and node_details_data['LifeCycleConfig'].get('SourceS3Uri')
+ ):
+ life_cycle_config = ClusterLifeCycleConfig(
+ on_create=node_details_data['LifeCycleConfig'].get('OnCreate'),
+ source_s3_uri=node_details_data['LifeCycleConfig'].get('SourceS3Uri'),
+ )
+
+ # Process override VPC config
+ override_vpc_config = None
+ if 'OverrideVpcConfig' in node_details_data:
+ override_vpc_config = VpcConfig(
+ security_group_ids=node_details_data['OverrideVpcConfig'].get(
+ 'SecurityGroupIds'
+ ),
+ subnets=node_details_data['OverrideVpcConfig'].get('Subnets'),
+ )
+
+ # Process placement
+ placement = None
+ if 'Placement' in node_details_data:
+ placement = ClusterInstancePlacement(
+ availability_zone=node_details_data['Placement'].get('AvailabilityZone'),
+ availability_zone_id=node_details_data['Placement'].get('AvailabilityZoneId'),
+ )
+
+ # Create node details
+ node_details = ClusterNodeDetails(
+ instance_group_name=node_details_data.get('InstanceGroupName', ''),
+ instance_id=node_details_data.get('InstanceId', ''),
+ instance_status=instance_status_details,
+ instance_storage_configs=instance_storage_configs
+ if instance_storage_configs
+ else None,
+ instance_type=node_details_data.get('InstanceType', ''),
+ last_software_update_time=str(node_details_data.get('LastSoftwareUpdateTime'))
+ if node_details_data.get('LastSoftwareUpdateTime')
+ else None,
+ launch_time=str(node_details_data.get('LaunchTime'))
+ if node_details_data.get('LaunchTime')
+ else None,
+ life_cycle_config=life_cycle_config,
+ override_vpc_config=override_vpc_config,
+ placement=placement,
+ private_dns_hostname=node_details_data.get('PrivateDnsHostname'),
+ private_primary_ip=node_details_data.get('PrivatePrimaryIp'),
+ private_primary_ipv6=node_details_data.get('PrivatePrimaryIpv6'),
+ threads_per_core=node_details_data.get('ThreadsPerCore'),
+ )
+
+ # Log success
+ log_with_request_id(
+ ctx,
+ LogLevel.INFO,
+ f'Successfully described SageMaker HyperPod cluster node: {node_id}',
+ )
+
+ # Return success response
+ return DescribeClusterNodeResponse(
+ isError=False,
+ content=[
+ TextContent(
+ type='text',
+ text=f'Successfully described SageMaker HyperPod cluster node: {node_id}',
+ )
+ ],
+ node_details=node_details,
+ )
+
+ except Exception as e:
+ # Log error
+ error_msg = f'Failed to describe SageMaker HyperPod cluster node: {str(e)}'
+ log_with_request_id(ctx, LogLevel.ERROR, error_msg)
+
+ # Return error response
+ return DescribeClusterNodeResponse(
+ isError=True,
+ content=[TextContent(type='text', text=error_msg)],
+ node_details=None,
+ )
+
+ async def _list_hp_cluster_nodes(
+ self,
+ ctx: Context,
+ cluster_name: str = Field(
+ ...,
+ description='The name of the cluster.',
+ ),
+ creation_time_after: Optional[str] = Field(
+ None,
+ description='Filter for nodes created after the specified time. Accepts formats: ISO 8601 (e.g., 2014-10-01T20:30:00Z), date only (e.g., 2014-10-01), or Unix time in seconds.',
+ ),
+ creation_time_before: Optional[str] = Field(
+ None,
+ description='Filter for nodes created before the specified time. Accepts formats: ISO 8601 (e.g., 2014-10-01T20:30:00Z), date only (e.g., 2014-10-01), or Unix time in seconds.',
+ ),
+ instance_group_name_contains: Optional[str] = Field(
+ None,
+ description='Filter for nodes in instance groups whose name contains the specified string.',
+ ),
+ max_results: int = Field(
+ 10,
+ description='The maximum number of nodes to return in the response. Default: 10.',
+ ge=1,
+ le=100,
+ ),
+ next_token: Optional[str] = Field(
+ None,
+ description='If the response to a previous ListClusterNodes request was truncated, the response includes a NextToken. To retrieve the next set of nodes, use the token in the next request.',
+ ),
+ sort_by: Optional[Literal['CREATION_TIME', 'NAME']] = Field(
+ default='CREATION_TIME',
+ description='The field to sort results by. The default is CREATION_TIME.',
+ ),
+ sort_order: Optional[Literal['Ascending', 'Descending']] = Field(
+ default='Ascending',
+ description='The sort order for results. The default is Ascending.',
+ ),
+ region_name: Optional[SUPPORTED_REGIONS] = Field(
+ 'us-east-1',
+ description='AWS region name. Default is us-east-1.',
+ ),
+ profile_name: Optional[str] = Field(
+ None,
+ description='AWS profile name. If not provided, uses the default profile.',
+ ),
+ ) -> ListClusterNodesResponse:
+ """List SageMaker HyperPod cluster nodes.
+
+ This tool lists nodes in a SageMaker HyperPod cluster with options for pagination and filtering.
+ It returns information about each node including instance group name, instance ID, instance status,
+ instance type, launch time, and last software update time.
+
+ ## Response Information
+ The response includes a summary of each node with instance group name, instance ID, instance status,
+ instance type, launch time, and last software update time.
+
+ ## Usage Tips
+ - Use max_results and next_token for pagination when there are many nodes
+ - Use instance_group_name_contains to filter nodes by instance group name
+ - Use creation_time_after and creation_time_before to filter by creation time
+ - Use sort_by and sort_order to control the order of results
+ - Specify region_name to list nodes in a specific region
+ - Specify profile_name to use a specific AWS profile with appropriate permissions
+ for SageMaker HyperPod operations
+
+ Args:
+ ctx: MCP context
+ cluster_name: The name of the cluster
+ creation_time_after: Filter by creation time after as string (optional)
+ creation_time_before: Filter by creation time before as string (optional)
+ instance_group_name_contains: Filter by instance group name (optional)
+ max_results: Maximum number of nodes to return (default: 10)
+ next_token: Token for pagination (optional)
+ sort_by: Sort field (default: CREATION_TIME)
+ sort_order: Sort order (default: Ascending)
+ region_name: AWS region name (default: us-east-1)
+ profile_name: AWS profile name (optional)
+
+ Returns:
+ ListClusterNodesResponse with list of nodes
+ """
+ try:
+ # Get SageMaker client
+ sagemaker_client = self.get_sagemaker_client(
+ ctx, region_name=region_name, profile_name=profile_name
+ )
+
+ # Prepare parameters for list_cluster_nodes API call
+ params: dict[str, Any] = {'ClusterName': cluster_name}
+
+ # Add parameters only if they are provided
+ if max_results is not None:
+ params['MaxResults'] = max_results
+ if next_token is not None:
+ params['NextToken'] = next_token
+ if instance_group_name_contains is not None:
+ params['InstanceGroupNameContains'] = instance_group_name_contains
+ if creation_time_after is not None:
+ params['CreationTimeAfter'] = creation_time_after
+ if creation_time_before is not None:
+ params['CreationTimeBefore'] = creation_time_before
+ if sort_by is not None:
+ params['SortBy'] = sort_by
+ if sort_order is not None:
+ params['SortOrder'] = sort_order
+
+ # Call SageMaker API to list cluster nodes
+ log_with_request_id(
+ ctx,
+ LogLevel.INFO,
+ f'Calling SageMaker list_cluster_nodes API with params: {params}',
+ )
+ try:
+ response = sagemaker_client.list_cluster_nodes(**params)
+ log_with_request_id(
+ ctx, LogLevel.INFO, f'SageMaker list_cluster_nodes API response: {response}'
+ )
+ except Exception as e:
+ log_with_request_id(
+ ctx, LogLevel.ERROR, f'SageMaker list_cluster_nodes API error: {str(e)}'
+ )
+ raise
+
+ # Extract nodes from response
+ nodes = []
+ for node in response.get('ClusterNodeSummaries', []):
+ # Extract instance status details
+ instance_status_data = node.get('InstanceStatus', {})
+ instance_status_details = ClusterInstanceStatusDetails(
+ status=instance_status_data.get(
+ 'Status', 'Pending'
+ ), # Default to Pending if not provided
+ message=instance_status_data.get('Message'),
+ )
+
+ node_summary = ClusterNodeSummary(
+ instance_group_name=node.get('InstanceGroupName', ''),
+ instance_id=node.get('InstanceId', ''),
+ instance_status=instance_status_details,
+ instance_type=node.get('InstanceType', ''),
+ launch_time=str(node.get('LaunchTime', '')),
+ last_software_update_time=str(node.get('LastSoftwareUpdateTime', ''))
+ if node.get('LastSoftwareUpdateTime')
+ else None,
+ )
+ nodes.append(node_summary)
+
+ # Get next token for pagination
+ next_token_response = response.get('NextToken')
+
+ # Log success
+ log_with_request_id(
+ ctx,
+ LogLevel.INFO,
+ f'Successfully listed {len(nodes)} SageMaker HyperPod cluster nodes',
+ )
+
+ # Return success response
+ return ListClusterNodesResponse(
+ isError=False,
+ content=[
+ TextContent(
+ type='text',
+ text=f'Successfully listed {len(nodes)} SageMaker HyperPod cluster nodes',
+ )
+ ],
+ nodes=nodes,
+ next_token=next_token_response,
+ )
+
+ except Exception as e:
+ # Log error
+ error_msg = f'Failed to list SageMaker HyperPod cluster nodes: {str(e)}'
+ log_with_request_id(ctx, LogLevel.ERROR, error_msg)
+
+ # Return error response
+ return ListClusterNodesResponse(
+ isError=True,
+ content=[TextContent(type='text', text=error_msg)],
+ nodes=[],
+ next_token=None,
+ )
+
+ async def _update_hp_cluster_software(
+ self,
+ ctx: Context,
+ cluster_name: str = Field(
+ ...,
+ description='The name or ARN of the SageMaker HyperPod cluster to update for security patching.',
+ min_length=0,
+ max_length=256,
+ pattern=r'(arn:aws[a-z\-]*:sagemaker:[a-z0-9\-]*:[0-9]{12}:cluster/[a-z0-9]{12})|([a-zA-Z0-9](-*[a-zA-Z0-9]){0,62})',
+ ),
+ deployment_config: Optional[DeploymentConfiguration] = Field(
+ None,
+ description='The configuration to use when updating the AMI versions.',
+ ),
+ instance_groups: Optional[List[UpdateClusterSoftwareInstanceGroupSpecification]] = Field(
+ None,
+ description='The array of instance groups for which to update AMI versions.',
+ ),
+ region_name: Optional[SUPPORTED_REGIONS] = Field(
+ 'us-east-1',
+ description='AWS region name. Default is us-east-1.',
+ ),
+ profile_name: Optional[str] = Field(
+ None,
+ description='AWS profile name. If not provided, uses the default profile.',
+ ),
+ ) -> UpdateClusterSoftwareResponse:
+ """Update the software for a SageMaker HyperPod cluster.
+
+ This tool updates the software for a SageMaker HyperPod cluster.
+ It initiates a software update for all nodes in the cluster.
+
+ ## Response Information
+ The response includes the ARN of the cluster being updated.
+
+ ## Usage Tips
+ - Use this tool to update the software on all nodes in a SageMaker HyperPod cluster
+ - Specify instance_groups to update only specific instance groups in the cluster
+ - Configure deployment_config to control how the update is performed:
+ - Use auto_rollback_configuration to specify alarms that trigger rollback
+ - Use rolling_update_policy to control batch sizes during updates
+ - Use wait_interval_in_seconds to control the wait time between updates
+ - The update process may take some time to complete
+ - You can check the status of the update using the list_hp_cluster_nodes tool
+ - Specify region_name to update a cluster in a specific region
+ - Specify profile_name to use a specific AWS profile with appropriate permissions
+ for SageMaker HyperPod operations
+
+ Args:
+ ctx: MCP context
+ cluster_name: The name or ARN of the cluster to update
+ deployment_config: Configuration for the update process (optional)
+ instance_groups: Specific instance groups to update (optional)
+ region_name: AWS region name (default: us-east-1)
+ profile_name: AWS profile name (optional)
+
+ Returns:
+ UpdateClusterSoftwareResponse with cluster ARN
+ """
+ try:
+ # Get SageMaker client
+ sagemaker_client = self.get_sagemaker_client(
+ ctx, region_name=region_name, profile_name=profile_name
+ )
+
+ # Prepare parameters for update_cluster_software API call
+ params: dict[str, Any] = {'ClusterName': cluster_name}
+
+ # Add deployment configuration if provided
+ if deployment_config:
+ deployment_config_dict: dict[str, Any] = {}
+
+ # Add auto rollback configuration if provided
+ if deployment_config.auto_rollback_configuration:
+ auto_rollback_config = []
+ for alarm in deployment_config.auto_rollback_configuration:
+ auto_rollback_config.append({'AlarmName': alarm.alarm_name})
+ if auto_rollback_config:
+ deployment_config_dict['AutoRollbackConfiguration'] = auto_rollback_config
+
+ # Add rolling update policy if provided
+ if deployment_config.rolling_update_policy:
+ rolling_update_policy = {}
+
+ # Add maximum batch size if provided
+ if deployment_config.rolling_update_policy.maximum_batch_size:
+ maximum_batch_size = {
+ 'Type': deployment_config.rolling_update_policy.maximum_batch_size.type,
+ 'Value': deployment_config.rolling_update_policy.maximum_batch_size.value,
+ }
+ rolling_update_policy['MaximumBatchSize'] = maximum_batch_size
+
+ # Add rollback maximum batch size if provided
+ if deployment_config.rolling_update_policy.rollback_maximum_batch_size:
+ rollback_maximum_batch_size = {
+ 'Type': deployment_config.rolling_update_policy.rollback_maximum_batch_size.type,
+ 'Value': deployment_config.rolling_update_policy.rollback_maximum_batch_size.value,
+ }
+ rolling_update_policy['RollbackMaximumBatchSize'] = (
+ rollback_maximum_batch_size
+ )
+
+ if rolling_update_policy:
+ deployment_config_dict['RollingUpdatePolicy'] = rolling_update_policy
+
+ # Add wait interval in seconds if provided
+ if deployment_config.wait_interval_in_seconds is not None:
+ deployment_config_dict['WaitIntervalInSeconds'] = (
+ deployment_config.wait_interval_in_seconds
+ )
+
+ # Add deployment config to params if not empty
+ if deployment_config_dict:
+ params['DeploymentConfig'] = deployment_config_dict
+
+ # Add instance groups if provided
+ if instance_groups:
+ instance_groups_list = []
+ for group in instance_groups:
+ instance_groups_list.append({'InstanceGroupName': group.instance_group_name})
+ if instance_groups_list:
+ params['InstanceGroups'] = instance_groups_list
+
+ # Call SageMaker API to update cluster software
+ log_with_request_id(
+ ctx,
+ LogLevel.INFO,
+ f'Calling SageMaker update_cluster_software API with params: {params}',
+ )
+ try:
+ response = sagemaker_client.update_cluster_software(**params)
+ log_with_request_id(
+ ctx,
+ LogLevel.INFO,
+ f'SageMaker update_cluster_software API response: {response}',
+ )
+ except Exception as e:
+ log_with_request_id(
+ ctx, LogLevel.ERROR, f'SageMaker update_cluster_software API error: {str(e)}'
+ )
+ raise
+
+ # Extract cluster ARN from response
+ cluster_arn = response.get('ClusterArn', '')
+
+ # Log success
+ log_with_request_id(
+ ctx,
+ LogLevel.INFO,
+ f'Successfully initiated software update for SageMaker HyperPod cluster: {cluster_name}',
+ )
+
+ # Return success response
+ return UpdateClusterSoftwareResponse(
+ isError=False,
+ content=[
+ TextContent(
+ type='text',
+ text=f'Successfully initiated software update for SageMaker HyperPod cluster: {cluster_name}',
+ )
+ ],
+ cluster_arn=cluster_arn,
+ )
+
+ except Exception as e:
+ # Log error
+ error_msg = f'Failed to update software for SageMaker HyperPod cluster: {str(e)}'
+ log_with_request_id(ctx, LogLevel.ERROR, error_msg)
+
+ # Return error response
+ return UpdateClusterSoftwareResponse(
+ isError=True,
+ content=[TextContent(type='text', text=error_msg)],
+ cluster_arn='',
+ )
+
+ async def _batch_delete_hp_cluster_nodes(
+ self,
+ ctx: Context,
+ cluster_name: str = Field(
+ ...,
+ description='The name of the cluster.',
+ min_length=0,
+ max_length=256,
+ pattern=r'(arn:aws[a-z\-]*:sagemaker:[a-z0-9\-]*:[0-9]{12}:cluster/[a-z0-9]{12})|([a-zA-Z0-9](-*[a-zA-Z0-9]){0,62})',
+ ),
+ node_ids: List[str] = Field(
+ ...,
+ description='The list of node IDs to delete from the cluster.',
+ min_length=1,
+ max_length=99,
+ ),
+ region_name: Optional[SUPPORTED_REGIONS] = Field(
+ 'us-east-1',
+ description='AWS region name. Default is us-east-1.',
+ ),
+ profile_name: Optional[str] = Field(
+ None,
+ description='AWS profile name. If not provided, uses the default profile.',
+ ),
+ ) -> BatchDeleteClusterNodesResponse:
+ """Delete multiple nodes from a SageMaker HyperPod cluster.
+
+ This tool deletes multiple nodes from a SageMaker HyperPod cluster in a single operation.
+ It returns information about the deleted nodes and any failures that occurred during deletion.
+
+ ## Response Information
+ The response includes the cluster name, a list of successfully deleted node IDs,
+ and details about any failed node deletions.
+
+ ## Note
+ - For SageMaker HyperPod clusters using the Slurm workload manager, you cannot remove instances that are
+ configured as Slurm controller nodes.
+ - If you need to delete more than 99 instances, contact Support for assistance.
+
+ ## Usage Tips
+ - Use this tool to delete multiple nodes from a cluster in a single operation
+ - You can delete up to 99 nodes in a single request
+ - If some node deletions fail, the response will include details about the failures
+ - Specify region_name to delete nodes in a specific region
+ - Specify profile_name to use a specific AWS profile with appropriate permissions
+ for SageMaker HyperPod operations
+
+ Args:
+ ctx: MCP context
+ cluster_name: The name of the cluster
+ node_ids: List of node IDs to delete from the cluster
+ region_name: AWS region name (default: us-east-1)
+ profile_name: AWS profile name (optional)
+
+ Returns:
+ BatchDeleteClusterNodesResponse with details of the deletion operation
+ """
+ try:
+ # Get SageMaker client
+ sagemaker_client = self.get_sagemaker_client(
+ ctx, region_name=region_name, profile_name=profile_name
+ )
+
+ # Prepare parameters for batch_delete_cluster_nodes API call
+ params = {'ClusterName': cluster_name, 'NodeIds': node_ids}
+
+ # Call SageMaker API to batch delete cluster nodes
+ log_with_request_id(
+ ctx,
+ LogLevel.INFO,
+ f'Calling SageMaker batch_delete_cluster_nodes API with params: {params}',
+ )
+ try:
+ response = sagemaker_client.batch_delete_cluster_nodes(**params)
+ log_with_request_id(
+ ctx,
+ LogLevel.INFO,
+ f'SageMaker batch_delete_cluster_nodes API response: {response}',
+ )
+ except Exception as e:
+ log_with_request_id(
+ ctx,
+ LogLevel.ERROR,
+ f'SageMaker batch_delete_cluster_nodes API error: {str(e)}',
+ )
+ raise
+
+ # Extract successful and failed deletions from response
+ successful_node_ids = response.get('Successful', [])
+ failed_deletions = response.get('Failed', [])
+
+ # Convert failed deletions to BatchDeleteClusterNodesError objects
+ failed_deletions_list = []
+ for failure in failed_deletions:
+ failed_deletions_list.append(
+ BatchDeleteClusterNodesError(
+ code=failure.get('Code', ''),
+ message=failure.get('Message', ''),
+ node_id=failure.get('NodeId', ''),
+ )
+ )
+
+ # Log success
+ log_with_request_id(
+ ctx,
+ LogLevel.INFO,
+ f'Successfully deleted {len(successful_node_ids)} nodes from SageMaker HyperPod cluster: {cluster_name}',
+ )
+ if failed_deletions_list:
+ log_with_request_id(
+ ctx,
+ LogLevel.WARNING,
+ f'Failed to delete {len(failed_deletions_list)} nodes from SageMaker HyperPod cluster: {cluster_name}',
+ )
+
+ # Return success response
+ return BatchDeleteClusterNodesResponse(
+ isError=False,
+ content=[
+ TextContent(
+ type='text',
+ text=f'Successfully deleted {len(successful_node_ids)} nodes from SageMaker HyperPod cluster: {cluster_name}. Failed deletions: {len(failed_deletions_list)}',
+ )
+ ],
+ cluster_name=cluster_name,
+ successful=successful_node_ids,
+ failed=failed_deletions_list if failed_deletions_list else None,
+ )
+
+ except Exception as e:
+ # Log error
+ error_msg = f'Failed to delete nodes from SageMaker HyperPod cluster: {str(e)}'
+ log_with_request_id(ctx, LogLevel.ERROR, error_msg)
+
+ # Return error response
+ return BatchDeleteClusterNodesResponse(
+ isError=True,
+ content=[TextContent(type='text', text=error_msg)],
+ cluster_name=cluster_name,
+ successful=[],
+ failed=None,
+ )
diff --git a/src/sagemaker-hyperpod-mcp-server/awslabs/sagemaker_hyperpod_mcp_server/hyperpod_stack_handler.py b/src/sagemaker-hyperpod-mcp-server/awslabs/sagemaker_hyperpod_mcp_server/hyperpod_stack_handler.py
new file mode 100644
index 0000000000..172e1dab48
--- /dev/null
+++ b/src/sagemaker-hyperpod-mcp-server/awslabs/sagemaker_hyperpod_mcp_server/hyperpod_stack_handler.py
@@ -0,0 +1,702 @@
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""HyperPod stack handler for the HyperPod MCP Server."""
+
+import json
+import yaml # type: ignore
+from awslabs.sagemaker_hyperpod_mcp_server.aws_helper import AwsHelper
+from awslabs.sagemaker_hyperpod_mcp_server.consts import (
+ CAPABILITY_AUTO_EXPAND,
+ CFN_CAPABILITY_IAM,
+ CFN_CAPABILITY_NAMED_IAM,
+ CFN_ON_FAILURE_DELETE,
+ CFN_STACK_TAG_KEY,
+ CFN_STACK_TAG_VALUE,
+ CLUSTER_ORCHESTRATORS,
+ HYPERPOD_CFN_TEMPLATE_URL_EKS,
+ HYPERPOD_CFN_TEMPLATE_URL_SLURM,
+ STACK_DELETE_OPERATION,
+ STACK_DEPLOY_OPERATION,
+ STACK_DESCRIBE_OPERATION,
+ STACK_NOT_OWNED_ERROR_TEMPLATE,
+ STACK_OPERATIONS,
+ SUPPORTED_REGIONS,
+)
+from awslabs.sagemaker_hyperpod_mcp_server.logging_helper import LogLevel, log_with_request_id
+from awslabs.sagemaker_hyperpod_mcp_server.models import (
+ DeleteStackResponse,
+ DeployStackResponse,
+ DescribeStackResponse,
+)
+from mcp.server.fastmcp import Context
+from mcp.types import TextContent
+from pydantic import Field, validate_call
+from typing import Dict, List, Optional, Tuple, Union
+from yaml.loader import SafeLoader # type: ignore
+
+
+# Custom YAML loader for CloudFormation templates
+class CloudFormationLoader(SafeLoader):
+ """Custom YAML loader that handles CloudFormation intrinsic functions."""
+
+ pass
+
+
+# Add constructors for CloudFormation intrinsic functions
+def construct_cfn_tag(loader, tag_suffix, node):
+ """Generic constructor for CloudFormation intrinsic functions."""
+ if isinstance(node, yaml.ScalarNode):
+ return {tag_suffix: loader.construct_scalar(node)}
+ elif isinstance(node, yaml.SequenceNode):
+ return {tag_suffix: loader.construct_sequence(node)}
+ elif isinstance(node, yaml.MappingNode):
+ return {tag_suffix: loader.construct_mapping(node)}
+ else:
+ return None
+
+
+# Register constructors for common CloudFormation intrinsic functions
+for tag in [
+ 'Ref',
+ 'Condition',
+ 'GetAtt',
+ 'Equals',
+ 'If',
+ 'Not',
+ 'And',
+ 'Or',
+ 'FindInMap',
+ 'Base64',
+ 'Join',
+ 'Sub',
+ 'Select',
+ 'Split',
+ 'ImportValue',
+ 'GetAZs',
+ 'Transform',
+ 'ForEach',
+]:
+ CloudFormationLoader.add_constructor(
+ f'!{tag}', lambda loader, node, tag=tag: construct_cfn_tag(loader, tag, node)
+ )
+
+
+class HyperPodStackHandler:
+ """Handler for Amazon HyperPod CloudFormation stack operations.
+
+ This class provides tools for creating, managing, and deleting CloudFormation
+ stacks for HyperPod clusters.
+ """
+
+ def __init__(self, mcp, allow_write: bool = False):
+ """Initialize the HyperPod stack handler.
+
+ Args:
+ mcp: The MCP server instance
+ allow_write: Whether to enable write access (default: False)
+ """
+ self.mcp = mcp
+ self.allow_write = allow_write
+
+ # Register tools
+ self.mcp.tool(name='manage_hyperpod_stacks')(self.manage_hyperpod_stacks)
+
+ @validate_call
+ def _ensure_stack_ownership(
+ self,
+ ctx: Context,
+ stack_name: str,
+ region_name: SUPPORTED_REGIONS,
+ operation: str,
+ ) -> Tuple[bool, Optional[Dict], Optional[str]]:
+ """Ensure that a stack exists and was created by this tool.
+
+ Args:
+ ctx: The MCP context
+ stack_name: Name of the stack to verify
+ region_name: region to perform the API call in
+ operation: Operation being performed (for error messages)
+
+ Returns:
+ Tuple of (success, stack_details, error_message)
+ - success: True if the stack exists and was created by this tool
+ - stack_details: Stack details if the stack exists, None otherwise
+ - error_message: Error message if the stack doesn't exist or wasn't created by this tool, None if successful
+ """
+ try:
+ # Create CloudFormation client
+ cfn_client = AwsHelper.create_boto3_client('cloudformation', region_name)
+
+ # Get stack details
+ stack_details = cfn_client.describe_stacks(StackName=stack_name)
+ stack = stack_details['Stacks'][0]
+
+ # Verify the stack was created by our tool
+ tags = stack.get('Tags', [])
+ is_our_stack = False
+ for tag in tags:
+ if tag.get('Key') == CFN_STACK_TAG_KEY and tag.get('Value') == CFN_STACK_TAG_VALUE:
+ is_our_stack = True
+ break
+
+ if not is_our_stack:
+ error_message = STACK_NOT_OWNED_ERROR_TEMPLATE.format(
+ stack_name=stack_name, tool_name=CFN_STACK_TAG_VALUE, operation=operation
+ )
+ log_with_request_id(ctx, LogLevel.ERROR, error_message)
+ return False, stack, error_message
+
+ return True, stack, None
+ except Exception as e:
+ if 'does not exist' in str(e):
+ error_message = f'Stack {stack_name} not found or cannot be accessed: {str(e)}'
+ else:
+ error_message = f'Error verifying stack ownership: {str(e)}'
+
+ log_with_request_id(ctx, LogLevel.ERROR, error_message)
+ return False, None, error_message
+
+ @validate_call
+ async def manage_hyperpod_stacks(
+ self,
+ ctx: Context,
+ operation: STACK_OPERATIONS = Field(
+ description='Operation to perform: deploy, describe, or delete. Choose "describe" for read-only operations when write access is disabled.',
+ ),
+ region_name: SUPPORTED_REGIONS = Field(
+ description='AWS region name. Default is us-east-1.',
+ ),
+ stack_name: str = Field(
+ description='Name of the CloudFormation stack (for deploy, describe and delete operations).',
+ ),
+ cluster_orchestrator: CLUSTER_ORCHESTRATORS = Field(
+ 'eks',
+ description='Cluster orchestrator type. Must be either "eks" or "slurm". Default is "eks".',
+ ),
+ params_file: Optional[str] = Field(
+ None,
+ description="""Absolute path for the CloudFormation template parameters(for deploy operations).
+ IMPORTANT: Assistant must provide the full absolute path to the template file, as the MCP client and server might not run from the same location.""",
+ ),
+ profile_name: Optional[str] = Field(
+ None,
+ description='AWS profile name. If not provided, uses the default profile.',
+ ),
+ ) -> Union[
+ 'DeployStackResponse',
+ 'DescribeStackResponse',
+ 'DeleteStackResponse',
+ ]:
+ r"""Manage SageMaker HyperPod Cluster through CloudFormation stacks.
+
+ This tool provides operations for managing HyperPod CloudFormation stacks, including creating parameters for cloudformation template,
+ deploying stacks, retrieving hyperpod stack and deployment information, and deleting hyperpod stacks. It serves as the primary
+ mechanism for creating and managing HyperPod clusters through CloudFormation, enabling standardized
+ cluster creation, configuration updates, and resource cleanup.
+
+ ## Notes
+ - Tell user about the working directory which is the current directory. The tool will use directory to store all required files for the user.
+ - After you asked a question, do NOT do anything until you got the user response, do NOT run manage_hyperpod_stacks yet
+ - Use this tool instead of direct AWS CLI commands for creating and managing HyperPod resources.
+ - Use this tool's standardized parameters for creating HyperPod clusters with proper configuration.
+ - DO NOT create HyperPod clusters by generating CloudFormation templates from scratch.
+ - when user asks to create a hyperpod cluster, NEVER ask to check what HyperPod clusters the user currently have
+ - CRITICAL: when user asks to delete a hyperpod cluster, NEVER ask how user's hyperpod cluster was created, just proceed with 'delete' operation. The corresponding Cloudformation stack name should be in this format: "-stack". If no such stack exists, then the hyperpod cluster might not be created via the MCP tools here.
+
+ ## Parameter Collection Process
+ IMPORTANT: ALWAYS first ask for ALL operation-specific REQUIRED parameters from the user BEFORE making any tool calls. NEVER assume or generate parameter values.
+ IMPORTANT: ALWAYS ask one question at a time.
+
+ For 'deploy' operation:
+ - region_name: REQUIRED: ask user to region of deployment. Limit user input to the major US region. Ensure this argument matches the AvailabilityZoneIds parameter key.
+ - allow following selections:
+ • us-east-1 (N. Virginia)
+ • us-east-2 (Ohio)
+ • us-west-1 (N. California)
+ • us-west-2 (Oregon)
+ - stack_name: REQUIRED - generate a stack name and present to the user. should be in this format: "-stack".
+ - cluster_orchestrator: REQUIRED: ask user to specify "eks" or "slurm"
+ - params_file: REQUIRED - the parameters file should follow the below format. Ask the user to customize the parameters marked as "" one by one. At the end, ask user if they want to add additional instance group.
+ - when cluster_orchestrator is "slurm", InstanceGroupSettings ParameterValue should also include InstanceGroupType of value Compute or Controller or Login; place it right after InstanceType. At least 1 Controller and 1 Compute node group required. ONLY 1 Controller, 1 Login group is allowed throughout ALL specified InstanceGroupSettings
+ - when asking questions regarding InstanceGroupSettings, ask user for both the number of instance and type of instance at the same time. Naming format: "-params.json"
+ [
+ {
+ "ParameterKey": "HyperPodClusterName",
+ "ParameterValue": "" // can be any natural english words less than 32 characters and no space. default should be sagemaker-hyperpod
+ },
+ {
+ "ParameterKey": "ResourceNamePrefix",
+ "ParameterValue": "hp--" // Generate a new 6-digit hash randomly and silently, do NOT use execute_bash commands.
+ },
+ {
+ "ParameterKey": "Stage", // do not reveil this to user, but always include this key and value.
+ "ParameterValue": "prod"
+ },
+ {
+ "ParameterKey": "AvailabilityZoneIds",
+ "ParameterValue": "