feat: Amazon SageMaker HyperPod MCP Server #1208

jade710 · 2025-08-28T21:26:57Z

Fixes

Summary

Changes

The Amazon SageMaker HyperPod MCP Server enables AI code assistants to interact with AWS SageMaker HyperPod clusters through natural language interactions with enhanced HyperPod user experiences. This server provides tools for streamlining interactions with HyperPod clusters, from assisting with setup workflows to ongoing management. It offers a secure interface for interacting with clusters that utilize the same managed CloudFormation templates as the AWS SageMaker HyperPod console UI (which have already been reviewed and approved) and managing cluster nodes.

The MCP server implements 2 core tools covering CloudFormation-based cluster deployment and management, comprehensive node operations including listing, describing, updating software, and batch deletion. The server operates with security-first principles, including read-only mode by default, explicit flags for write operations, and proper AWS authentication integration.

User experience

HyperPod Cluster Setup

Provide AWS SageMaker HyperPod console-consistent cluster quick setup experience through AI agents
Utilize the same managed CloudFormation templates used by the AWS SageMaker HyperPod console UI for consistent and approved deployments
Allow user to specify customized instance groups, resource location and naming

HyperPod Cluster Management

Interact with HyperPod clusters and their dependent infrastructure resources via CloudFormation stacks
Enable AI agents to assist with scaling HyperPod cluster instance groups per user needs
Help with stack deletion workflows with proper resource cleanup to avoid orphaned resources

HyperPod Cluster Node Management

Support listing all nodes in a HyperPod cluster and providing node-specific details
Update software across all nodes or specific instance groups to maintain security and performance
Batch delete nodes that are no longer needed while preserving critical data

With the HyperPod MCP Server, users can have conversations like:

User: "Create a new HyperPod cluster with 2 ml.m5.xlarge instances"
AI Assistant: "I'll help you set up a HyperPod cluster deployment. Let me start by asking you some questions to configure your cluster properly."

User: "Add 2 more instances to my HyperPod cluster instance group 1"
AI Assistant: "I'll help you add 2 more instances to instance group 1 (ig1) in your HyperPod cluster. This will increase the target count from 1 to 3 instances."

User: "Update my cluster software"
AI Assistant: "I'll help you update the software for your HyperPod cluster. This will update the AMI versions across all nodes in the cluster."

Checklist

If your change doesn't seem to apply, please leave them unchecked.

I have reviewed the contributing guidelines
I have performed a self-review of this change
Changes have been tested
Changes are documented

Is this a breaking change? (Y/N) N

RFC issue number: #1182

Checklist:

Migration process documented
Implement warnings (if it can live side by side)

Acknowledgment

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the project license.

… time, add params files to gitignore

src/sagemaker-hyperpod-mcp-server/awslabs/sagemaker_hyperpod_mcp_server/hyperpod_kb_handler.py

codecov · 2025-08-28T21:27:46Z

Codecov Report

❌ Patch coverage is 86.96925% with 89 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.41%. Comparing base (5cf3b2c) to head (9e96583).
⚠️ Report is 5 commits behind head on main.

Files with missing lines	Patch %	Lines
...perpod_mcp_server/hyperpod_cluster_node_handler.py	78.79%	21 Missing and 39 partials ⚠️
...aker_hyperpod_mcp_server/hyperpod_stack_handler.py	89.88%	12 Missing and 5 partials ⚠️
...wslabs/sagemaker_hyperpod_mcp_server/aws_helper.py	81.66%	8 Missing and 3 partials ⚠️
...bs/sagemaker_hyperpod_mcp_server/logging_helper.py	95.45%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1208      +/-   ##
==========================================
- Coverage   89.44%   89.41%   -0.04%     
==========================================
  Files         724      732       +8     
  Lines       50959    51642     +683     
  Branches     8144     8246     +102     
==========================================
+ Hits        45581    46175     +594     
- Misses       3467     3508      +41     
- Partials     1911     1959      +48

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

github-actions · 2025-09-19T01:34:11Z

This pull request is now marked as stale because it hasn't seen activity for a while. Add a comment or it will be closed soon. If you wish to exclude this issue from being marked as stale, add the "backlog" label.

jade710 · 2025-09-19T16:32:05Z

review ongoing

krokoko · 2025-09-24T19:54:13Z

Need to update the codeowners file with github usernames of people owning this MCP server

src/sagemaker-hyperpod-mcp-server/README.md

krokoko · 2025-09-24T20:16:36Z

src/sagemaker-hyperpod-mcp-server/awslabs/sagemaker_hyperpod_mcp_server/aws_helper.py

+    _instance = None
+
+    # Client cache with AWS service name as key
+    _client_cache: Dict[str, Any] = {}


For the cache, could we improve it by adding a TTL based expiration, and a maximum size limit ?
Also, what happens if we change credentials, is the cache invalidated ?

Will revise in next commit. Changing creds might not immediately invalidate cache before TTL

src/sagemaker-hyperpod-mcp-server/awslabs/sagemaker_hyperpod_mcp_server/hyperpod_kb_handler.py

...r-hyperpod-mcp-server/awslabs/sagemaker_hyperpod_mcp_server/hyperpod_cluster_node_handler.py

krokoko · 2025-09-24T20:19:18Z

...r-hyperpod-mcp-server/awslabs/sagemaker_hyperpod_mcp_server/hyperpod_cluster_node_handler.py

This file is huge, any way to break it down in smaller ones ?

for now we try to keep the functionalities in one place

...r-hyperpod-mcp-server/awslabs/sagemaker_hyperpod_mcp_server/hyperpod_cluster_node_handler.py

krokoko · 2025-09-24T20:20:52Z

...r-hyperpod-mcp-server/awslabs/sagemaker_hyperpod_mcp_server/hyperpod_cluster_node_handler.py

+        return AwsHelper.create_boto3_client('sagemaker', region_name=region_name)
+
+    @validate_call
+    async def manage_hyperpod_cluster_nodes(


This function signature is pretty complex, any way to make it easier for the client ?

the tool consolidated multiple hyperpod operations and serves as an entry for the cluster node management, might not be trivial to refactor this.

krokoko · 2025-09-24T20:23:26Z

src/sagemaker-hyperpod-mcp-server/awslabs/sagemaker_hyperpod_mcp_server/server.py

+from awslabs.sagemaker_hyperpod_mcp_server.hyperpod_stack_handler import HyperPodStackHandler
+from loguru import logger
+from mcp.server.fastmcp import FastMCP
+


We should configure loguru as per the guidelines:

logger.remove() logger.add(sys.stderr, level=os.getenv('FASTMCP_LOG_LEVEL', 'WARNING'))

will revise in next commit

krokoko · 2025-09-24T20:36:54Z

...r-hyperpod-mcp-server/awslabs/sagemaker_hyperpod_mcp_server/hyperpod_cluster_node_handler.py

+            Union[ListClustersResponse, ListClusterNodesResponse, DescribeClusterNodeResponse, UpdateClusterSoftwareResponse, BatchDeleteClusterNodesResponse]:
+            Response specific to the operation performed
+        """
+        try:


Not a blocker now, but I think we could improve input validation in general. For instance, we could have a pydantic model with enhanced validation methods in a separate file and use it. It could look like:

from awslabs.sagemaker_hyperpod_mcp_server.validators import InputParameterValidation, InputValidators async def manage_hyperpod_cluster_nodes( self, ctx: Context, operation: NODE_OPERATIONS = Field(...), cluster_name: Optional[str] = Field(None, ...), node_id: Optional[str] = Field(None, ...), node_ids: Optional[List[str]] = Field(None, ...), # ... other parameters ) -> Union[ListClustersResponse, ...]: """Manage SageMaker HyperPod clusters.""" try: validation_data = InputParameterValidation( cluster_name=cluster_name, node_id=node_id, node_ids=node_ids, operation=operation, region_name=region_name, profile_name=profile_name )

Related, it would be great to create a standardized error response system

github-actions · 2025-10-09T01:34:12Z

This pull request is now marked as stale because it hasn't seen activity for a while. Add a comment or it will be closed soon. If you wish to exclude this issue from being marked as stale, add the "backlog" label.

github-actions · 2025-10-12T01:33:49Z

Closing this pull request as it hasn't seen activity for a while. Please add a comment @mentioning a maintainer to reopen. If you wish to exclude this issue from being marked as stale, add the "backlog" label.

jade710 · 2025-10-17T18:59:18Z

Need to update the codeowners file with github usernames of people owning this MCP server

where is this file?

Yunlin Qi and others added 10 commits August 27, 2025 10:04

feat: hyperpod mcp server full implementation

98c2acb

docs: add documentation page for hyperpod mcp server

aaeca16

Merge branch 'awslabs:main' into main

a7b1d1e

chore: use sequential try-catch blocks for update_hp_cluster tool

67d6262

chore: revise stack handler fallback instructions

00c50ee

chore: update deployment time to better reflect real-world deployment…

f7240a6

… time, add params files to gitignore

chore: remove docker support for now

e53d2ca

chore: update server namespace to sagemaker-hyperpod-mcp-server

93b74c4

chore: update main README for new sagemaker-hyperpod-mcp-server

aa76239

chore: add instruction for param file naming

b8cecad

jade710 requested review from a team as code owners August 28, 2025 21:26

github-project-automation bot added this to awslabs/mcp Project Aug 28, 2025

github-project-automation bot moved this to To triage in awslabs/mcp Project Aug 28, 2025

github-advanced-security bot found potential problems Aug 28, 2025

View reviewed changes

src/sagemaker-hyperpod-mcp-server/awslabs/sagemaker_hyperpod_mcp_server/hyperpod_kb_handler.py Fixed Show fixed Hide fixed

jade710 and others added 5 commits August 29, 2025 10:52

Merge branch 'awslabs:main' into main

c8f417e

fix: address pyright build issues

f4b6862

fix: update uv version to 0.8.10

c6b24e3

fix: more fixing for pyright build

0a2248b

fix: code cleanups for pyright build

8a75220

scottschreckengaust self-assigned this Sep 3, 2025

chore: revise README for functionalities

923949e

github-actions bot added the stale These are items that have been around for a long time without progress label Sep 19, 2025

Merge branch 'main' into main

465f970

chore: revise readme

74ab7a5

github-actions bot removed the stale These are items that have been around for a long time without progress label Sep 20, 2025

anitalewis mentioned this pull request Sep 22, 2025

RFC: Add New AWS SageMaker HyperPod MCP Server #1182

Open

Merge branch 'main' into main

42fda28

krokoko reviewed Sep 24, 2025

View reviewed changes

src/sagemaker-hyperpod-mcp-server/README.md Show resolved Hide resolved

krokoko reviewed Sep 24, 2025

View reviewed changes

src/sagemaker-hyperpod-mcp-server/awslabs/sagemaker_hyperpod_mcp_server/hyperpod_kb_handler.py Outdated Show resolved Hide resolved

krokoko reviewed Sep 24, 2025

View reviewed changes

...r-hyperpod-mcp-server/awslabs/sagemaker_hyperpod_mcp_server/hyperpod_cluster_node_handler.py Outdated Show resolved Hide resolved

krokoko reviewed Sep 24, 2025

View reviewed changes

...r-hyperpod-mcp-server/awslabs/sagemaker_hyperpod_mcp_server/hyperpod_cluster_node_handler.py Outdated Show resolved Hide resolved

krokoko reviewed Sep 24, 2025

View reviewed changes

github-actions bot added the stale These are items that have been around for a long time without progress label Oct 9, 2025

github-actions bot closed this Oct 12, 2025

github-project-automation bot moved this from To triage to Done in awslabs/mcp Project Oct 12, 2025

jade710 and others added 3 commits October 17, 2025 09:08

Merge branch 'awslabs:main' into main

7bf5ff9

feat: add Slurm orchestrator option in create cluster

ff84be4

chore: remove knowledge base tool for now

164c4a4

krokoko reopened this Oct 17, 2025

Yunlin Qi and others added 3 commits October 17, 2025 11:03

chore: revise unit tests accordingly

f3ef3c9

chore: address review comments

5a280c7

Merge branch 'awslabs:main' into main

9e96583

github-actions bot removed the stale These are items that have been around for a long time without progress label Oct 18, 2025

krokoko added the new mcp server A new MCP server ideally linked to an issue label Oct 22, 2025

feat: Amazon SageMaker HyperPod MCP Server #1208

Are you sure you want to change the base?

feat: Amazon SageMaker HyperPod MCP Server #1208

Uh oh!

Conversation

jade710 commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

User experience

HyperPod Cluster Setup

HyperPod Cluster Management

HyperPod Cluster Node Management

Checklist

Acknowledgment

Uh oh!

Uh oh!

codecov bot commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Sep 19, 2025

Uh oh!

jade710 commented Sep 19, 2025

Uh oh!

krokoko commented Sep 24, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 9, 2025

Uh oh!

github-actions bot commented Oct 12, 2025

Uh oh!

jade710 commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jade710 commented Aug 28, 2025 •

edited

Loading

codecov bot commented Aug 28, 2025 •

edited

Loading