-
Notifications
You must be signed in to change notification settings - Fork 990
Description
Is this related to an existing feature request or issue?
No
Summary
The Amazon SageMaker HyperPod MCP Server enables AI code assistants to interact with AWS SageMaker HyperPod clusters through natural language interactions with enhanced HyperPod user experiences. This server provides tools for streamlining interactions with HyperPod clusters, from assisting with setup workflows to ongoing management. It offers a secure interface for interacting with clusters that utilize the same managed CloudFormation templates as the AWS SageMaker HyperPod console UI (which have already been reviewed and approved), and managing cluster nodes.
The MCP server implements 2 core tools covering CloudFormation-based cluster deployment and management, comprehensive node operations including listing, describing, updating software, and batch deletion. The server operates with security-first principles, including read-only mode by default, explicit flags for write operations, and proper AWS authentication integration.
Use case
HyperPod Cluster Setup
- Provide AWS SageMaker HyperPod console-consistent cluster (EKS orchestrated) quick setup experience through AI agents
- Utilize the same managed CloudFormation templates used by the AWS SageMaker HyperPod console UI for consistent and approved deployments
- Allow user to specify customized instance groups, resource location and naming
HyperPod Cluster Management
- Interact with HyperPod clusters and their dependent infrastructure resources via CloudFormation stacks
- Enable AI agents to assist with scaling HyperPod cluster instance groups per user needs
- Help with stack deletion workflows with proper resource cleanup to avoid orphaned resources
HyperPod Cluster Node Management
- Support listing all nodes in a HyperPod cluster and providing node-specific details
- Update software across all nodes or specific instance groups to maintain security and performance
- Batch delete nodes that are no longer needed while preserving critical data
With the HyperPod MCP Server, users can have conversations like:
User: "Create a new HyperPod cluster with 2 ml.m5.xlarge instances"
AI Assistant: "I'll help you set up a HyperPod cluster deployment. Let me start by asking you some questions to configure your cluster properly."
User: "Add 2 more instances to my HyperPod cluster instance group 1"
AI Assistant: "I'll help you add 2 more instances to instance group 1 (ig1) in your HyperPod cluster. This will increase the target count from 1 to 3 instances."
User: "Update my cluster software"
AI Assistant: "I'll help you update the software for your HyperPod cluster. This will update the AMI versions across all nodes in the cluster."
Proposal
What We're Building
A Python-based MCP server that exposes 3 comprehensive tools for AI assistants:
HyperPod Stack Management Tools
manage_hyperpod_stacks
- Provides interface to CloudFormation stacks for HyperPod clusters with operations for initiating deployments, describing, and deleting stacks, supporting customized parameter overrides for tailored cluster configurations. This tool leverages the same managed CloudFormation templates used by the HyperPod console UI.
HyperPod Cluster Node Tools
manage_hyperpod_cluster_nodes
- Provides comprehensive node management capabilities including listing clusters, listing nodes, describing specific nodes, updating cluster software, and batch deleting nodes
Technical Approach
- Uses boto3 for AWS API interactions with SageMaker and CloudFormation services
- Implements MCP protocol via FastMCP framework
- Supports standard AWS authentication (IAM roles, credentials)
- Configurable via command-line flags (--allow-write, --aws-profile, --aws-region)
- Includes comprehensive error handling and structured logging
- Does not create, modify, or provision CloudFormation templates - only interfaces with existing managed templates
Integration
Works with following MCP-compatible AI assistant:
- Amazon Q Developer - Primary development and testing environment
Scope and Limitations
Important Clarifications:
- This MCP server does not create, modify, or provision CloudFormation templates
- It exclusively uses the same managed CloudFormation templates that are available through the HyperPod console UI
- The server provides an interface layer to assist with deployment workflows, not direct infrastructure provisioning
- All cluster deployments utilize AWS-managed, pre-approved templates and configurations
Out of scope
The following capabilities may be included in future releases:
- Creating HyperPod cluster with Slurm orchestration
- Advanced network configurations and VPC provisioning
- Automated training plan setup functionalities
- User interfaces (web dashboards, mobile apps, or frontend components) - purely backend MCP functionality
- Custom CloudFormation template creation or modification
Potential challenges
- Stay in sync with SageMaker HyperPod API updates: HyperPod service can introduce new APIs or make updates to existing APIs, need to ensure MCP tools leveraging the APIs are robust to API updates and provide fallback mechanisms like CLI call alternatives
Dependencies and Integrations
Core Dependencies:
- AWS SDK for Python (Boto3)
- FastMCP
- Pydantic
AWS Services Integration:
- Amazon SageMaker (primary HyperPod cluster management)
- AWS CloudFormation (infrastructure as code deployment)
- AWS IAM (authentication and authorization)
- Amazon VPC (networking for HyperPod clusters)
- Amazon EC2 (compute instance management)
- Amazon EKS (Kubernetes integration)
- Amazon S3 (lifecycle scripts for HyperPod clusters)
- Amazon FSx for Lustre (high-performance file systems)
Alternative solutions
Metadata
Metadata
Assignees
Labels
Type
Projects
Status