Skip to content

RFC: Add New AWS SageMaker HyperPod MCP Server #1182

@jade710

Description

@jade710

Is this related to an existing feature request or issue?

No

Summary

The Amazon SageMaker HyperPod MCP Server enables AI code assistants to interact with AWS SageMaker HyperPod clusters through natural language interactions with enhanced HyperPod user experiences. This server provides tools for streamlining interactions with HyperPod clusters, from assisting with setup workflows to ongoing management. It offers a secure interface for interacting with clusters that utilize the same managed CloudFormation templates as the AWS SageMaker HyperPod console UI (which have already been reviewed and approved), and managing cluster nodes.

The MCP server implements 2 core tools covering CloudFormation-based cluster deployment and management, comprehensive node operations including listing, describing, updating software, and batch deletion. The server operates with security-first principles, including read-only mode by default, explicit flags for write operations, and proper AWS authentication integration.

Use case

HyperPod Cluster Setup

  • Provide AWS SageMaker HyperPod console-consistent cluster (EKS orchestrated) quick setup experience through AI agents
  • Utilize the same managed CloudFormation templates used by the AWS SageMaker HyperPod console UI for consistent and approved deployments
  • Allow user to specify customized instance groups, resource location and naming

HyperPod Cluster Management

  • Interact with HyperPod clusters and their dependent infrastructure resources via CloudFormation stacks
  • Enable AI agents to assist with scaling HyperPod cluster instance groups per user needs
  • Help with stack deletion workflows with proper resource cleanup to avoid orphaned resources

HyperPod Cluster Node Management

  • Support listing all nodes in a HyperPod cluster and providing node-specific details
  • Update software across all nodes or specific instance groups to maintain security and performance
  • Batch delete nodes that are no longer needed while preserving critical data

With the HyperPod MCP Server, users can have conversations like:

User: "Create a new HyperPod cluster with 2 ml.m5.xlarge instances"
AI Assistant: "I'll help you set up a HyperPod cluster deployment. Let me start by asking you some questions to configure your cluster properly."

User: "Add 2 more instances to my HyperPod cluster instance group 1"
AI Assistant: "I'll help you add 2 more instances to instance group 1 (ig1) in your HyperPod cluster. This will increase the target count from 1 to 3 instances."

User: "Update my cluster software"
AI Assistant: "I'll help you update the software for your HyperPod cluster. This will update the AMI versions across all nodes in the cluster."

Proposal

What We're Building

A Python-based MCP server that exposes 3 comprehensive tools for AI assistants:

HyperPod Stack Management Tools
  • manage_hyperpod_stacks - Provides interface to CloudFormation stacks for HyperPod clusters with operations for initiating deployments, describing, and deleting stacks, supporting customized parameter overrides for tailored cluster configurations. This tool leverages the same managed CloudFormation templates used by the HyperPod console UI.
HyperPod Cluster Node Tools
  • manage_hyperpod_cluster_nodes - Provides comprehensive node management capabilities including listing clusters, listing nodes, describing specific nodes, updating cluster software, and batch deleting nodes

Technical Approach

  • Uses boto3 for AWS API interactions with SageMaker and CloudFormation services
  • Implements MCP protocol via FastMCP framework
  • Supports standard AWS authentication (IAM roles, credentials)
  • Configurable via command-line flags (--allow-write, --aws-profile, --aws-region)
  • Includes comprehensive error handling and structured logging
  • Does not create, modify, or provision CloudFormation templates - only interfaces with existing managed templates

Integration

Works with following MCP-compatible AI assistant:

  • Amazon Q Developer - Primary development and testing environment

Scope and Limitations

Important Clarifications:

  • This MCP server does not create, modify, or provision CloudFormation templates
  • It exclusively uses the same managed CloudFormation templates that are available through the HyperPod console UI
  • The server provides an interface layer to assist with deployment workflows, not direct infrastructure provisioning
  • All cluster deployments utilize AWS-managed, pre-approved templates and configurations

Out of scope

The following capabilities may be included in future releases:

  • Creating HyperPod cluster with Slurm orchestration
  • Advanced network configurations and VPC provisioning
  • Automated training plan setup functionalities
  • User interfaces (web dashboards, mobile apps, or frontend components) - purely backend MCP functionality
  • Custom CloudFormation template creation or modification

Potential challenges

  • Stay in sync with SageMaker HyperPod API updates: HyperPod service can introduce new APIs or make updates to existing APIs, need to ensure MCP tools leveraging the APIs are robust to API updates and provide fallback mechanisms like CLI call alternatives

Dependencies and Integrations

Core Dependencies:

  • AWS SDK for Python (Boto3)
  • FastMCP
  • Pydantic

AWS Services Integration:

  • Amazon SageMaker (primary HyperPod cluster management)
  • AWS CloudFormation (infrastructure as code deployment)
  • AWS IAM (authentication and authorization)
  • Amazon VPC (networking for HyperPod clusters)
  • Amazon EC2 (compute instance management)
  • Amazon EKS (Kubernetes integration)
  • Amazon S3 (lifecycle scripts for HyperPod clusters)
  • Amazon FSx for Lustre (high-performance file systems)

Alternative solutions

Metadata

Metadata

Labels

RFC-proposalA Request for Comments to announce intentions and get early feedback (mainly for new MCP servers)

Projects

Status

In progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions