-
Notifications
You must be signed in to change notification settings - Fork 990
feat: Amazon SageMaker HyperPod MCP Server #1208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… time, add params files to gitignore
src/sagemaker-hyperpod-mcp-server/awslabs/sagemaker_hyperpod_mcp_server/hyperpod_kb_handler.py
Fixed
Show fixed
Hide fixed
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1208 +/- ##
==========================================
- Coverage 89.44% 89.41% -0.04%
==========================================
Files 724 732 +8
Lines 50959 51642 +683
Branches 8144 8246 +102
==========================================
+ Hits 45581 46175 +594
- Misses 3467 3508 +41
- Partials 1911 1959 +48 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This pull request is now marked as stale because it hasn't seen activity for a while. Add a comment or it will be closed soon. If you wish to exclude this issue from being marked as stale, add the "backlog" label. |
review ongoing |
Need to update the codeowners file with github usernames of people owning this MCP server |
_instance = None | ||
|
||
# Client cache with AWS service name as key | ||
_client_cache: Dict[str, Any] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the cache, could we improve it by adding a TTL based expiration, and a maximum size limit ?
Also, what happens if we change credentials, is the cache invalidated ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will revise in next commit. Changing creds might not immediately invalidate cache before TTL
src/sagemaker-hyperpod-mcp-server/awslabs/sagemaker_hyperpod_mcp_server/hyperpod_kb_handler.py
Outdated
Show resolved
Hide resolved
...r-hyperpod-mcp-server/awslabs/sagemaker_hyperpod_mcp_server/hyperpod_cluster_node_handler.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is huge, any way to break it down in smaller ones ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for now we try to keep the functionalities in one place
...r-hyperpod-mcp-server/awslabs/sagemaker_hyperpod_mcp_server/hyperpod_cluster_node_handler.py
Outdated
Show resolved
Hide resolved
return AwsHelper.create_boto3_client('sagemaker', region_name=region_name) | ||
|
||
@validate_call | ||
async def manage_hyperpod_cluster_nodes( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function signature is pretty complex, any way to make it easier for the client ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the tool consolidated multiple hyperpod operations and serves as an entry for the cluster node management, might not be trivial to refactor this.
from awslabs.sagemaker_hyperpod_mcp_server.hyperpod_stack_handler import HyperPodStackHandler | ||
from loguru import logger | ||
from mcp.server.fastmcp import FastMCP | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should configure loguru as per the guidelines:
logger.remove()
logger.add(sys.stderr, level=os.getenv('FASTMCP_LOG_LEVEL', 'WARNING'))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will revise in next commit
Union[ListClustersResponse, ListClusterNodesResponse, DescribeClusterNodeResponse, UpdateClusterSoftwareResponse, BatchDeleteClusterNodesResponse]: | ||
Response specific to the operation performed | ||
""" | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a blocker now, but I think we could improve input validation in general. For instance, we could have a pydantic model with enhanced validation methods in a separate file and use it. It could look like:
from awslabs.sagemaker_hyperpod_mcp_server.validators import InputParameterValidation, InputValidators
async def manage_hyperpod_cluster_nodes(
self,
ctx: Context,
operation: NODE_OPERATIONS = Field(...),
cluster_name: Optional[str] = Field(None, ...),
node_id: Optional[str] = Field(None, ...),
node_ids: Optional[List[str]] = Field(None, ...),
# ... other parameters
) -> Union[ListClustersResponse, ...]:
"""Manage SageMaker HyperPod clusters."""
try:
validation_data = InputParameterValidation(
cluster_name=cluster_name,
node_id=node_id,
node_ids=node_ids,
operation=operation,
region_name=region_name,
profile_name=profile_name
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related, it would be great to create a standardized error response system
This pull request is now marked as stale because it hasn't seen activity for a while. Add a comment or it will be closed soon. If you wish to exclude this issue from being marked as stale, add the "backlog" label. |
Closing this pull request as it hasn't seen activity for a while. Please add a comment @mentioning a maintainer to reopen. If you wish to exclude this issue from being marked as stale, add the "backlog" label. |
where is this file? |
Fixes
Summary
Changes
The Amazon SageMaker HyperPod MCP Server enables AI code assistants to interact with AWS SageMaker HyperPod clusters through natural language interactions with enhanced HyperPod user experiences. This server provides tools for streamlining interactions with HyperPod clusters, from assisting with setup workflows to ongoing management. It offers a secure interface for interacting with clusters that utilize the same managed CloudFormation templates as the AWS SageMaker HyperPod console UI (which have already been reviewed and approved) and managing cluster nodes.
The MCP server implements 2 core tools covering CloudFormation-based cluster deployment and management, comprehensive node operations including listing, describing, updating software, and batch deletion. The server operates with security-first principles, including read-only mode by default, explicit flags for write operations, and proper AWS authentication integration.
User experience
HyperPod Cluster Setup
HyperPod Cluster Management
HyperPod Cluster Node Management
With the HyperPod MCP Server, users can have conversations like:
User: "Create a new HyperPod cluster with 2 ml.m5.xlarge instances"
AI Assistant: "I'll help you set up a HyperPod cluster deployment. Let me start by asking you some questions to configure your cluster properly."
User: "Add 2 more instances to my HyperPod cluster instance group 1"
AI Assistant: "I'll help you add 2 more instances to instance group 1 (ig1) in your HyperPod cluster. This will increase the target count from 1 to 3 instances."
User: "Update my cluster software"
AI Assistant: "I'll help you update the software for your HyperPod cluster. This will update the AMI versions across all nodes in the cluster."
Checklist
If your change doesn't seem to apply, please leave them unchecked.
Is this a breaking change? (Y/N) N
RFC issue number: #1182
Checklist:
Acknowledgment
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the project license.