Skip to content

Conversation

@jade710
Copy link

@jade710 jade710 commented Aug 28, 2025

Fixes

Summary

Changes

The Amazon SageMaker HyperPod MCP Server enables AI code assistants to interact with AWS SageMaker HyperPod clusters through natural language interactions with enhanced HyperPod user experiences. This server provides tools for streamlining interactions with HyperPod clusters, from assisting with setup workflows to ongoing management. It offers a secure interface for interacting with clusters that utilize the same managed CloudFormation templates as the AWS SageMaker HyperPod console UI (which have already been reviewed and approved) and managing cluster nodes.

The MCP server implements 2 core tools covering CloudFormation-based cluster deployment and management, comprehensive node operations including listing, describing, updating software, and batch deletion. The server operates with security-first principles, including read-only mode by default, explicit flags for write operations, and proper AWS authentication integration.

User experience

HyperPod Cluster Setup

  • Provide AWS SageMaker HyperPod console-consistent cluster quick setup experience through AI agents
  • Utilize the same managed CloudFormation templates used by the AWS SageMaker HyperPod console UI for consistent and approved deployments
  • Allow user to specify customized instance groups, resource location and naming

HyperPod Cluster Management

  • Interact with HyperPod clusters and their dependent infrastructure resources via CloudFormation stacks
  • Enable AI agents to assist with scaling HyperPod cluster instance groups per user needs
  • Help with stack deletion workflows with proper resource cleanup to avoid orphaned resources

HyperPod Cluster Node Management

  • Support listing all nodes in a HyperPod cluster and providing node-specific details
  • Update software across all nodes or specific instance groups to maintain security and performance
  • Batch delete nodes that are no longer needed while preserving critical data

With the HyperPod MCP Server, users can have conversations like:

User: "Create a new HyperPod cluster with 2 ml.m5.xlarge instances"
AI Assistant: "I'll help you set up a HyperPod cluster deployment. Let me start by asking you some questions to configure your cluster properly."

User: "Add 2 more instances to my HyperPod cluster instance group 1"
AI Assistant: "I'll help you add 2 more instances to instance group 1 (ig1) in your HyperPod cluster. This will increase the target count from 1 to 3 instances."

User: "Update my cluster software"
AI Assistant: "I'll help you update the software for your HyperPod cluster. This will update the AMI versions across all nodes in the cluster."

Checklist

If your change doesn't seem to apply, please leave them unchecked.

  • I have reviewed the contributing guidelines
  • I have performed a self-review of this change
  • Changes have been tested
  • Changes are documented

Is this a breaking change? (Y/N) N

RFC issue number: #1182

Checklist:

  • Migration process documented
  • Implement warnings (if it can live side by side)

Acknowledgment

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the project license.

@codecov
Copy link

codecov bot commented Aug 28, 2025

Codecov Report

❌ Patch coverage is 86.96925% with 89 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.41%. Comparing base (5cf3b2c) to head (9e96583).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
...perpod_mcp_server/hyperpod_cluster_node_handler.py 78.79% 21 Missing and 39 partials ⚠️
...aker_hyperpod_mcp_server/hyperpod_stack_handler.py 89.88% 12 Missing and 5 partials ⚠️
...wslabs/sagemaker_hyperpod_mcp_server/aws_helper.py 81.66% 8 Missing and 3 partials ⚠️
...bs/sagemaker_hyperpod_mcp_server/logging_helper.py 95.45% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1208      +/-   ##
==========================================
- Coverage   89.44%   89.41%   -0.04%     
==========================================
  Files         724      732       +8     
  Lines       50959    51642     +683     
  Branches     8144     8246     +102     
==========================================
+ Hits        45581    46175     +594     
- Misses       3467     3508      +41     
- Partials     1911     1959      +48     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@scottschreckengaust scottschreckengaust self-assigned this Sep 3, 2025
@github-actions
Copy link
Contributor

This pull request is now marked as stale because it hasn't seen activity for a while. Add a comment or it will be closed soon. If you wish to exclude this issue from being marked as stale, add the "backlog" label.

@github-actions github-actions bot added the stale These are items that have been around for a long time without progress label Sep 19, 2025
@jade710
Copy link
Author

jade710 commented Sep 19, 2025

review ongoing

@github-actions github-actions bot removed the stale These are items that have been around for a long time without progress label Sep 20, 2025
@krokoko
Copy link
Contributor

krokoko commented Sep 24, 2025

Need to update the codeowners file with github usernames of people owning this MCP server

_instance = None

# Client cache with AWS service name as key
_client_cache: Dict[str, Any] = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the cache, could we improve it by adding a TTL based expiration, and a maximum size limit ?
Also, what happens if we change credentials, is the cache invalidated ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will revise in next commit. Changing creds might not immediately invalidate cache before TTL

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is huge, any way to break it down in smaller ones ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now we try to keep the functionalities in one place

return AwsHelper.create_boto3_client('sagemaker', region_name=region_name)

@validate_call
async def manage_hyperpod_cluster_nodes(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function signature is pretty complex, any way to make it easier for the client ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tool consolidated multiple hyperpod operations and serves as an entry for the cluster node management, might not be trivial to refactor this.

from awslabs.sagemaker_hyperpod_mcp_server.hyperpod_stack_handler import HyperPodStackHandler
from loguru import logger
from mcp.server.fastmcp import FastMCP

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should configure loguru as per the guidelines:

logger.remove()
logger.add(sys.stderr, level=os.getenv('FASTMCP_LOG_LEVEL', 'WARNING'))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will revise in next commit

Union[ListClustersResponse, ListClusterNodesResponse, DescribeClusterNodeResponse, UpdateClusterSoftwareResponse, BatchDeleteClusterNodesResponse]:
Response specific to the operation performed
"""
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker now, but I think we could improve input validation in general. For instance, we could have a pydantic model with enhanced validation methods in a separate file and use it. It could look like:

from awslabs.sagemaker_hyperpod_mcp_server.validators import InputParameterValidation, InputValidators

async def manage_hyperpod_cluster_nodes(
    self,
    ctx: Context,
    operation: NODE_OPERATIONS = Field(...),
    cluster_name: Optional[str] = Field(None, ...),
    node_id: Optional[str] = Field(None, ...),
    node_ids: Optional[List[str]] = Field(None, ...),
    # ... other parameters
) -> Union[ListClustersResponse, ...]:
    """Manage SageMaker HyperPod clusters."""
    
    try:
        validation_data = InputParameterValidation(
            cluster_name=cluster_name,
            node_id=node_id,
            node_ids=node_ids,
            operation=operation,
            region_name=region_name,
            profile_name=profile_name
        )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related, it would be great to create a standardized error response system

@github-actions
Copy link
Contributor

github-actions bot commented Oct 9, 2025

This pull request is now marked as stale because it hasn't seen activity for a while. Add a comment or it will be closed soon. If you wish to exclude this issue from being marked as stale, add the "backlog" label.

@github-actions github-actions bot added the stale These are items that have been around for a long time without progress label Oct 9, 2025
@github-actions
Copy link
Contributor

Closing this pull request as it hasn't seen activity for a while. Please add a comment @mentioning a maintainer to reopen. If you wish to exclude this issue from being marked as stale, add the "backlog" label.

@github-actions github-actions bot closed this Oct 12, 2025
@github-project-automation github-project-automation bot moved this from To triage to Done in awslabs/mcp Project Oct 12, 2025
@krokoko krokoko reopened this Oct 17, 2025
@jade710
Copy link
Author

jade710 commented Oct 17, 2025

Need to update the codeowners file with github usernames of people owning this MCP server

where is this file?

@github-actions github-actions bot removed the stale These are items that have been around for a long time without progress label Oct 18, 2025
@krokoko krokoko added the new mcp server A new MCP server ideally linked to an issue label Oct 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new mcp server A new MCP server ideally linked to an issue

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants