From cc9fb875922f367750487c70fb454f37a71b8263 Mon Sep 17 00:00:00 2001 From: Min Badar Date: Fri, 15 May 2026 22:55:19 -0700 Subject: [PATCH] feat(sagemaker-ai): add HyperPod debugging skills Add new skills for diagnosing and troubleshooting HyperPod clusters: - hyperpod-cluster-debugger: cluster-wide diagnostics - hyperpod-nccl: NCCL failure diagnosis - hyperpod-node-debugger: per-node issue triage - hyperpod-performance-debugger: performance bottleneck analysis - hyperpod-slurm-debugger: Slurm scheduler issues Also updates hyperpod-ssm, hyperpod-version-checker, and hyperpod-issue-report with related improvements. Updates README with new skill documentation. --- plugins/sagemaker-ai/README.md | 45 +- .../skills/hyperpod-cluster-debugger/SKILL.md | 198 ++ .../references/capacity-planning.md | 124 + .../references/cloudformation-errors.md | 84 + .../references/cluster-diagnostics-detail.md | 463 +++ .../references/cluster-operations.md | 270 ++ .../references/iam-permissions.md | 40 + .../references/lifecycle-scripts.md | 111 + .../scripts/diagnose-cluster.sh | 1621 +++++++++++ .../references/troubleshooting.md | 2 +- .../skills/hyperpod-nccl/SKILL.md | 187 ++ .../references/debugging-guide.md | 1011 +++++++ .../references/error-patterns-quick-ref.md | 47 + .../hyperpod-nccl/references/operations.md | 393 +++ .../references/performance-testing.md | 247 ++ .../hyperpod-nccl/scripts/nccl-diagnose.sh | 2563 +++++++++++++++++ .../skills/hyperpod-node-debugger/SKILL.md | 269 ++ .../references/node-diagnostics-detail.md | 1074 +++++++ .../references/node-issue-catalog.md | 141 + .../scripts/check-efa-sg.sh | 355 +++ .../scripts/check-node-reachability.sh | 389 +++ .../scripts/check-vpc-config.sh | 508 ++++ .../scripts/triage-cluster.sh | 1258 ++++++++ .../hyperpod-performance-debugger/SKILL.md | 185 ++ .../references/perf-details.md | 202 ++ .../scripts/perf-snapshot.sh | 666 +++++ .../skills/hyperpod-slurm-debugger/SKILL.md | 243 ++ .../references/slurm-details.md | 318 ++ .../scripts/slurm-diagnose.sh | 802 ++++++ .../sagemaker-ai/skills/hyperpod-ssm/SKILL.md | 14 +- .../skills/hyperpod-ssm/scripts/ssm-exec.sh | 28 +- .../scripts/hyperpod_check_versions.sh | 10 +- 32 files changed, 13840 insertions(+), 28 deletions(-) create mode 100644 plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/SKILL.md create mode 100644 plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/references/capacity-planning.md create mode 100644 plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/references/cloudformation-errors.md create mode 100644 plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/references/cluster-diagnostics-detail.md create mode 100644 plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/references/cluster-operations.md create mode 100644 plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/references/iam-permissions.md create mode 100644 plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/references/lifecycle-scripts.md create mode 100755 plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/scripts/diagnose-cluster.sh create mode 100644 plugins/sagemaker-ai/skills/hyperpod-nccl/SKILL.md create mode 100644 plugins/sagemaker-ai/skills/hyperpod-nccl/references/debugging-guide.md create mode 100644 plugins/sagemaker-ai/skills/hyperpod-nccl/references/error-patterns-quick-ref.md create mode 100644 plugins/sagemaker-ai/skills/hyperpod-nccl/references/operations.md create mode 100644 plugins/sagemaker-ai/skills/hyperpod-nccl/references/performance-testing.md create mode 100755 plugins/sagemaker-ai/skills/hyperpod-nccl/scripts/nccl-diagnose.sh create mode 100644 plugins/sagemaker-ai/skills/hyperpod-node-debugger/SKILL.md create mode 100644 plugins/sagemaker-ai/skills/hyperpod-node-debugger/references/node-diagnostics-detail.md create mode 100644 plugins/sagemaker-ai/skills/hyperpod-node-debugger/references/node-issue-catalog.md create mode 100755 plugins/sagemaker-ai/skills/hyperpod-node-debugger/scripts/check-efa-sg.sh create mode 100755 plugins/sagemaker-ai/skills/hyperpod-node-debugger/scripts/check-node-reachability.sh create mode 100755 plugins/sagemaker-ai/skills/hyperpod-node-debugger/scripts/check-vpc-config.sh create mode 100755 plugins/sagemaker-ai/skills/hyperpod-node-debugger/scripts/triage-cluster.sh create mode 100644 plugins/sagemaker-ai/skills/hyperpod-performance-debugger/SKILL.md create mode 100644 plugins/sagemaker-ai/skills/hyperpod-performance-debugger/references/perf-details.md create mode 100755 plugins/sagemaker-ai/skills/hyperpod-performance-debugger/scripts/perf-snapshot.sh create mode 100644 plugins/sagemaker-ai/skills/hyperpod-slurm-debugger/SKILL.md create mode 100644 plugins/sagemaker-ai/skills/hyperpod-slurm-debugger/references/slurm-details.md create mode 100755 plugins/sagemaker-ai/skills/hyperpod-slurm-debugger/scripts/slurm-diagnose.sh diff --git a/plugins/sagemaker-ai/README.md b/plugins/sagemaker-ai/README.md index 764821ff..c867fcb5 100644 --- a/plugins/sagemaker-ai/README.md +++ b/plugins/sagemaker-ai/README.md @@ -3,24 +3,29 @@ This plugin brings deep AWS AI/ML expertise directly into your coding assistant, covering the surface area of [Amazon SageMaker AI](https://aws.amazon.com/sagemaker/ai/); currently, skills are provided to assist with the following capability areas: - **Model Customization** — End-to-end guided workflows for fine-tuning foundation models, from use case definition through data preparation, training, evaluation, and deployment on Amazon SageMaker AI. -- **HyperPod Cluster Operations** — Remote command execution on nodes via SSM, version checking, and diagnostic reporting for SageMaker HyperPod training clusters. +- **HyperPod Cluster Operations** — Remote command execution on nodes via SSM, version checking, diagnostic reporting, and deep debugging for SageMaker HyperPod training clusters. ## Agent Skills -| # | Skill | Description | Documentation | -| -- | -------------------------- | ------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------- | -| 1 | `planning` | Builds a dynamic, step-by-step plan tailored to your intents | [SKILL.md](skills/planning/SKILL.md) | -| 2 | `directory-management` | Manages project directory setup, artifact organization, and plan association for new or existing projects | [SKILL.md](skills/directory-management/SKILL.md) | -| 3 | `use-case-specification` | Guided, conversational process to define your model customization use case goals, key stakeholders, and success criteria | [SKILL.md](skills/use-case-specification/SKILL.md) | -| 4 | `dataset-evaluation` | Dataset quality validation, format detection, and data requirements analysis | [SKILL.md](skills/dataset-evaluation/SKILL.md) | -| 5 | `dataset-transformation` | Dataset format conversion and preparation for SageMaker-compatible training formats | [SKILL.md](skills/dataset-transformation/SKILL.md) | -| 6 | `finetuning-setup` | Fine-tuning technique selection (SFT, DPO, RLVR, etc.) and base model selection | [SKILL.md](skills/finetuning-setup/SKILL.md) | -| 7 | `finetuning` | Hyperparameter configuration and training job execution | [SKILL.md](skills/finetuning/SKILL.md) | -| 8 | `model-evaluation` | Evaluation design, benchmark selection, LLM-as-a-judge, and model comparison | [SKILL.md](skills/model-evaluation/SKILL.md) | -| 9 | `model-deployment` | Deployment configuration and endpoint setup (SageMaker or Bedrock) | [SKILL.md](skills/model-deployment/SKILL.md) | -| 10 | `hyperpod-ssm` | Remote command execution and file transfer on HyperPod cluster nodes via SSM | [SKILL.md](skills/hyperpod-ssm/SKILL.md) | -| 11 | `hyperpod-version-checker` | Check and compare software component versions across HyperPod cluster nodes | [SKILL.md](skills/hyperpod-version-checker/SKILL.md) | -| 12 | `hyperpod-issue-report` | Generate diagnostic reports for HyperPod troubleshooting and support cases | [SKILL.md](skills/hyperpod-issue-report/SKILL.md) | +| # | Skill | Description | Documentation | +| -- | ------------------------------- | ------------------------------------------------------------------------------------------------------------------------ | --------------------------------------------------------- | +| 1 | `planning` | Builds a dynamic, step-by-step plan tailored to your intents | [SKILL.md](skills/planning/SKILL.md) | +| 2 | `directory-management` | Manages project directory setup, artifact organization, and plan association for new or existing projects | [SKILL.md](skills/directory-management/SKILL.md) | +| 3 | `use-case-specification` | Guided, conversational process to define your model customization use case goals, key stakeholders, and success criteria | [SKILL.md](skills/use-case-specification/SKILL.md) | +| 4 | `dataset-evaluation` | Dataset quality validation, format detection, and data requirements analysis | [SKILL.md](skills/dataset-evaluation/SKILL.md) | +| 5 | `dataset-transformation` | Dataset format conversion and preparation for SageMaker-compatible training formats | [SKILL.md](skills/dataset-transformation/SKILL.md) | +| 6 | `finetuning-setup` | Fine-tuning technique selection (SFT, DPO, RLVR, etc.) and base model selection | [SKILL.md](skills/finetuning-setup/SKILL.md) | +| 7 | `finetuning` | Hyperparameter configuration and training job execution | [SKILL.md](skills/finetuning/SKILL.md) | +| 8 | `model-evaluation` | Evaluation design, benchmark selection, LLM-as-a-judge, and model comparison | [SKILL.md](skills/model-evaluation/SKILL.md) | +| 9 | `model-deployment` | Deployment configuration and endpoint setup (SageMaker or Bedrock) | [SKILL.md](skills/model-deployment/SKILL.md) | +| 10 | `hyperpod-ssm` | Remote command execution and file transfer on HyperPod cluster nodes via SSM | [SKILL.md](skills/hyperpod-ssm/SKILL.md) | +| 11 | `hyperpod-version-checker` | Check and compare software component versions across HyperPod cluster nodes | [SKILL.md](skills/hyperpod-version-checker/SKILL.md) | +| 12 | `hyperpod-issue-report` | Generate diagnostic reports for HyperPod troubleshooting and support cases | [SKILL.md](skills/hyperpod-issue-report/SKILL.md) | +| 13 | `hyperpod-cluster-debugger` | Diagnose cluster-wide HyperPod problems — creation failures, EFA health, lifecycle scripts, capacity | [SKILL.md](skills/hyperpod-cluster-debugger/SKILL.md) | +| 14 | `hyperpod-nccl` | Diagnose NCCL failures — training hangs, AllReduce timeouts, EFA errors, rendezvous failures | [SKILL.md](skills/hyperpod-nccl/SKILL.md) | +| 15 | `hyperpod-node-debugger` | Diagnose per-node issues — GPU hardware, EFA, disk/memory pressure, container runtime | [SKILL.md](skills/hyperpod-node-debugger/SKILL.md) | +| 16 | `hyperpod-performance-debugger` | Diagnose performance issues — uneven NCCL bandwidth, filesystem throughput, straggler nodes | [SKILL.md](skills/hyperpod-performance-debugger/SKILL.md) | +| 17 | `hyperpod-slurm-debugger` | Diagnose Slurm scheduler issues — nodes stuck down/drain, jobs pending, GRES miscounts, auto-resume | [SKILL.md](skills/hyperpod-slurm-debugger/SKILL.md) | ## MCP Servers @@ -99,12 +104,22 @@ The HyperPod skills provide operational tooling for Amazon SageMaker HyperPod AI - **`hyperpod-ssm`** — Run commands and transfer files on cluster nodes via AWS Systems Manager (SSM), without needing direct SSH access. - **`hyperpod-version-checker`** — Check and compare software component versions (drivers, libraries, frameworks) across cluster nodes to identify drift or incompatibilities. - **`hyperpod-issue-report`** — Generate comprehensive issue reports that collect system state, logs, and configuration details for troubleshooting or support case submission. +- **`hyperpod-cluster-debugger`** — Diagnose cluster-wide problems including creation/deployment failures, EFA health checks, lifecycle script errors, and capacity issues. +- **`hyperpod-nccl`** — Diagnose NCCL failures and training-pod issues such as AllReduce timeouts, EFA/libfabric errors, rendezvous failures, and container OOM. +- **`hyperpod-node-debugger`** — Diagnose per-node issues including GPU hardware faults (XID, ECC, NVLink), EFA, disk/memory pressure, and container runtime problems. +- **`hyperpod-performance-debugger`** — Diagnose performance bottlenecks such as uneven NCCL bandwidth across nodes, filesystem throughput issues, and straggler nodes. +- **`hyperpod-slurm-debugger`** — Diagnose Slurm scheduler and node-daemon issues including nodes stuck in down/drain, jobs pending, GRES miscounts, and auto-resume failures. ### Examples - "Check the GPU memory usage on all nodes in my HyperPod cluster using SSM" - "Check driver versions on my HyperPod cluster" - "Generate an issue report for my HyperPod cluster" +- "My HyperPod cluster creation failed, help me debug it" +- "Training is hanging with NCCL timeout errors" +- "A node in my cluster is unhealthy, diagnose it" +- "My training is slower than expected across nodes" +- "Slurm jobs are stuck pending even though nodes show idle" ## Supported Environments diff --git a/plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/SKILL.md b/plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/SKILL.md new file mode 100644 index 00000000..bba49c5a --- /dev/null +++ b/plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/SKILL.md @@ -0,0 +1,198 @@ +--- +name: hyperpod-cluster-debugger +description: Diagnose and remediate cluster-wide HyperPod (EKS or Slurm) problems — creation / deployment failures (CloudFormation, EFA health check, lifecycle scripts, capacity), EKS access, node replacement, CloudFormation nested-stack errors, post-maintenance rollback state, dangling nodes, autoscaler conflicts. Includes `--validate` pre-flight. Read-only. +metadata: + version: "0.0.1" +--- + +# HyperPod Cluster Debugger + +**Operating policy.** Run read-only diagnostics yourself. Never run a command that changes cluster, node, or workload state — present each one as a **Suggested command (run this yourself)** block and wait for the customer to run it. Destructive order: **investigate → reboot → replace** (replace destroys root + secondary volumes; not supported on Slurm controller nodes). + +**Before any state-changing CLI: ask if it's IaC-managed.** HyperPod clusters, SGs, EKS access entries, and IAM are usually provisioned via CloudFormation / CDK / Terraform. If yes, the fix belongs in IaC — running the CLI will drift and the next deploy reverts it. Use the CLI only when IaC is unavailable (locked out, predates IaC, mid-review). + +`scripts/diagnose-cluster.sh` is read-only: it collects state via AWS APIs (and SSM for Slurm controller health) and prints each issue as `[FAIL] ... → references/.md §
`. + +| Reference | Open when | +| ------------------------------------------------------------------------- | ------------------------------------------------------------------- | +| [cluster-diagnostics-detail.md](references/cluster-diagnostics-detail.md) | Per-finding remediation runbook (§ A–L) | +| [cluster-operations.md](references/cluster-operations.md) | Operational deep-dives (EFA SG, EKS access, SSM, Slurm, filesystem) | +| [cloudformation-errors.md](references/cloudformation-errors.md) | § H needs the full per-resource CFN error catalog | +| [capacity-planning.md](references/capacity-planning.md) | § B or `--validate` flags capacity / subnet sizing | +| [lifecycle-scripts.md](references/lifecycle-scripts.md) | § C points at a specific lifecycle failure | +| [iam-permissions.md](references/iam-permissions.md) | Full IAM policy for the diagnostic | + +--- + +## Workflow + +1. Collect HyperPod cluster name (not EKS name), region, exact error string. +2. Run `scripts/diagnose-cluster.sh` (or `--validate` for pre-create). +3. For every `[FAIL]` line, `Read` the referenced section. +4. Present finding, root cause, and the Suggested-command block verbatim. Wait for customer approval. +5. Re-run the diagnostic to confirm. + +--- + +## Step 1: Run diagnostics + +```bash +# Diagnose an existing cluster: +bash scripts/diagnose-cluster.sh --cluster --region + +# Pre-flight (no cluster needed) — validates SGs, subnets, IAM, VPC endpoints, +# optionally S3 lifecycle scripts and per-AZ capacity: +bash scripts/diagnose-cluster.sh --validate --region \ + --sg-ids --subnet-ids [--iam-role ] \ + [--s3-uri s3:///path/] [--instance-type ml.p5.48xlarge] +``` + +Pass `--instance-type` when the target instance type is known — enables the per-AZ capacity check (warns if none of the provided subnets are in an AZ that offers that type, which causes insufficient-capacity failures at creation time). + +Tags: `[PASS]` · `[FAIL]` (counted, has `→ references/...` pointer) · `[WARN]` · `[INFO]`. Priorities: **P0** blocks operation · **P1** degraded · **P2** informational. + +--- + +## Step 2: Match signal → section + +**Error messages / events:** + +| Signal | Section | +| ---------------------------------------------------------------------------- | -------------------------------------------------------------- | +| `"EFA health checks did not run successfully"` (public-doc verbatim signal) | **[A: EFA Health Checks](#a-efa-health-checks)** | +| Insufficient-capacity or AZ-mismatch failure at creation | **[B: Capacity & AZ](#b-capacity--az)** | +| Lifecycle-script failure or timeout during provisioning | **[C: Lifecycle Scripts](#c-lifecycle-scripts)** | +| kubectl auth error (server asks for credentials / no API group list) | **[D: EKS Access](#d-eks-access--kubectl)** | +| `InService` but not all instances visible | **[E: Cluster Provisioning](#e-cluster-provisioning)** | +| `"Target is not connected"` / SSM errors | **[F: SSM Connectivity](#f-ssm-connectivity)** | +| Node replacement not happening / `batch-replace` not working | **[G: Node Replacement](#g-node-replacement)** | +| `"Embedded stack failed"` / any CloudFormation error | **[H: CloudFormation Errors](#h-cloudformation-errors)** | +| `UpdateClusterSoftware` failed or cluster in post-maintenance rollback state | **[J: AMI & Cluster Updates](#j-ami--cluster-updates)** | +| Dangling / orphaned nodes in EKS vs `list-cluster-nodes` | **[K: Dangling Nodes & Cleanup](#k-dangling-nodes--cleanup)** | +| Cluster Autoscaler breaks after HyperPod attached | **[L: Autoscaler Compatibility](#l-autoscaler-compatibility)** | +| Slow I/O, FSx throughput saturated | [cluster-operations.md § 9](references/cluster-operations.md) | +| Slurm node name → instance ID lookup | **[I: Utilities](#i-utilities)** | + +--- + +## A: EFA Health Checks + +SG missing self-reference. Add inbound + outbound self-ref to every SG on the cluster, plus least-privilege egress for the AWS APIs the node needs (HTTPS 443 to S3 / ECR / SageMaker / SSM / STS / CloudWatch Logs — via VPC-endpoint prefix-lists when possible). Full procedure: [cluster-diagnostics-detail.md § A](references/cluster-diagnostics-detail.md#a-efa-health-checks). + +## B: Capacity & AZ + +Instance type unavailable in the requested AZ. Verify with `describe-instance-type-offerings`, then change AZ, use Flexible Training Plans, or request ODCR. Full: [§ B](references/cluster-diagnostics-detail.md#b-capacity--az) · strategy: [capacity-planning.md](references/capacity-planning.md). + +## C: Lifecycle Scripts + +Script failed or timed out during provisioning. Read CloudWatch under `/aws/sagemaker/Clusters//` — common causes: missing S3 VPC endpoint, IAM gap, CRLF line endings, instance-group name mismatch. Full: [§ C](references/cluster-diagnostics-detail.md#c-lifecycle-scripts) · layout: [lifecycle-scripts.md](references/lifecycle-scripts.md). + +## D: EKS Access / kubectl + +IAM identity not in EKS access entries. Verify with `sts get-caller-identity`, create an access entry with admin policy, update kubeconfig. Full: [§ D](references/cluster-diagnostics-detail.md#d-eks-access--kubectl). + +## E: Cluster Provisioning + +`InService` without all instances is expected under Continuous Provisioning — failures surface as events, not cluster errors. For stuck `Creating`/`Updating`/`Deleting`: check CFN nested stacks (§ H), IAM, capacity, events; if stuck `Deleting` check VPC ENI dependencies. Full: [§ E](references/cluster-diagnostics-detail.md#e-cluster-provisioning). + +## F: SSM Connectivity + +`Target is not connected`: use `sagemaker-cluster:_-` format (not raw EC2 ID), install session-manager-plugin, confirm node `Running`. Check IAM + VPC endpoints on timeouts. Full: [§ F](references/cluster-diagnostics-detail.md#f-ssm-connectivity). + +## G: Node Replacement + +Auto-repair: confirm `NodeRecovery=Automatic`, check Health Monitoring Agent (HMA) logs + node labels / Slurm reason, confirm capacity. Manual: reboot first, replace only if reboot fails. Replace requires the cluster to have been patched via `UpdateClusterSoftware` at least once and cannot target a Slurm controller node. Full: [§ G](references/cluster-diagnostics-detail.md#g-node-replacement). + +## H: CloudFormation Errors + +`Embedded stack failed` hides the real error. Drill into nested stacks via Events tab (filter Failed) until you reach a non-stack resource. CLI: `describe-stack-events --query 'StackEvents[?ResourceStatus==\`CREATE_FAILED\`]'`. Also covers SLR creation failures and permission-boundary denials. Full: [§ H](references/cluster-diagnostics-detail.md#h-cloudformation-errors) · catalog: [cloudformation-errors.md](references/cloudformation-errors.md). + +## I: Utilities + +Map Slurm node names (`ip-10-x-y-z`) to HyperPod instance IDs via `list-cluster-nodes` or on-node `/opt/ml/config/resource_config.json`. Full: [§ I](references/cluster-diagnostics-detail.md#i-utilities). + +## J: AMI & Cluster Updates + +`UpdateClusterSoftware` fails and rolls back, or the cluster stays in a post-maintenance rollback state. Common causes: lifecycle script incompatible with new AMI, HMA version too old, insufficient rolling-update capacity. If the cluster has active nodes, collect diagnostics and escalate rather than delete-and-recreate. Full: [§ J](references/cluster-diagnostics-detail.md#j-ami--cluster-updates). + +## K: Dangling Nodes & Cleanup + +Nodes in `kubectl get nodes` but not in `list-cluster-nodes` (ghost EKS nodes), or the inverse (HyperPod nodes that never registered kubelet). Script flags both. Full: [§ K](references/cluster-diagnostics-detail.md#k-dangling-nodes--cleanup). + +## L: Autoscaler Compatibility + +Cluster Autoscaler errors on HyperPod provider IDs and breaks autoscaling for all node groups. No officially endorsed workaround — escalate to AWS Support. Karpenter does not conflict with HyperPod nodes by default. Full: [§ L](references/cluster-diagnostics-detail.md#l-autoscaler-compatibility). + +--- + +## Prerequisites + +- `aws` CLI v2.13+ authenticated to the cluster's account +- `jq`, `python3`, `bash` 4.2+ +- `kubectl` authenticated to the EKS cluster (EKS checks skipped if absent) +- `session-manager-plugin` (Slurm controller health checks only) + +IAM policy: [references/iam-permissions.md](references/iam-permissions.md). + +## Defaults + +- **Region** — required: pass `--region` or set `$AWS_DEFAULT_REGION`. +- **Mode** — `--cluster ` (diagnose) or `--validate` (pre-create). +- **Event window** — up to 500 most recent events (5 × 100, paginated). +- **Colors** — auto-disabled on non-TTY; `--no-color` to force off. + +## Error handling + +| Failure | Script | Tell the customer | +| --------------------------------------------------- | ---------------------------------------------------------- | ----------------------------------------------------- | +| `aws sts get-caller-identity` fails | Exit 1 | "Fix AWS credentials and rerun." | +| Cluster not found | Exit 1 after listing region's clusters | "Confirm HyperPod cluster name (not EKS) and region." | +| `sagemaker:*` / `ec2:*` / `eks:*` / `logs:*` denied | Warn, add `Missing IAM permission for `, continue | "Grant the listed IAM action and rerun." | +| `kubectl` absent or unauthenticated | Skip EKS checks (access entries, add-ons, aws-auth, nodes) | "Install/authenticate kubectl." | +| `session-manager-plugin` absent (Slurm) | Skip Slurm controller probe | "Install session-manager-plugin." | +| SSM throttled / times out (180s) | Retry with backoff; warn and continue if still failing | "Rerun later — script is idempotent." | +| CloudWatch log group not found | Skip CloudWatch check | "CloudWatch not configured on this cluster." | + +Exit codes: `0` no critical failures · `1` one or more critical failures (cluster not found, fatal prerequisite missing, or any `[FAIL]` in diagnose or `--validate` mode). `[WARN]` lines do not affect the exit code. + +## Skill delegation + +| Need | Use | +| ------------------------------- | -------------------------- | +| Shell on nodes | `hyperpod-ssm` | +| Version comparison across nodes | `hyperpod-version-checker` | + +## Escalate to AWS Support + +Escalate when: + +1. EFA health checks fail despite correct SG rules. +2. Capacity errors persist despite a valid Flexible Training Plan / ODCR. +3. Node replacement fails repeatedly without clear events / log signal. +4. Cluster stuck in a non-terminal state (`Creating`, `Updating`, or a post-maintenance rollback state) for an extended period. +5. CloudFormation root-cause is an internal service error. + +### Before opening the case + +Run these commands and attach the output. Goal: AWS Support has everything at case open. + +```bash +# 1. Cluster identity + status (confirms region, ARN, orchestrator, instance groups) +aws sagemaker describe-cluster --cluster-name --region + +# 2. Full cluster-level diagnostic bundle +bash scripts/diagnose-cluster.sh --cluster --region > diag.txt + +# 3. Per-node log/config bundle to S3 (delegates to hyperpod-issue-report skill) +# See skills/hyperpod-issue-report/SKILL.md for the exact invocation. +``` + +### Include in the case + +- Cluster name + ARN (or `ClusterId` suffix) and AWS region +- `ClusterStatus` + `FailureMessage` from `describe-cluster` +- Timestamp window (UTC start / end) of the failure +- Exact error strings observed (copy verbatim from events / logs / console) +- Affected instance IDs / `NodeLogicalId`s / instance group names +- `diag.txt` from step 2 above +- S3 URI of the `hyperpod-issue-report` bundle from step 3 diff --git a/plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/references/capacity-planning.md b/plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/references/capacity-planning.md new file mode 100644 index 00000000..0c19592d --- /dev/null +++ b/plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/references/capacity-planning.md @@ -0,0 +1,124 @@ +# Capacity Planning + +Companion to [SKILL.md](../SKILL.md) § B and `--validate`. Capacity errors are one of the most common creation failures. + +--- + +## Capacity options + +### On-demand + +Fine for small instance types and short experiments. **Not guaranteed** for large GPU types (p4d, p5, p5e, trn1, trn2). No physical-proximity guarantees — sub-optimal for distributed training. + +```bash +# Which AZs have this instance type. The EC2 API uses bare instance-type +# names, so strip the SageMaker `ml.` prefix before filtering. +aws ec2 describe-instance-type-offerings \ + --location-type availability-zone \ + --filters "Name=instance-type,Values=p5.48xlarge" \ + --region us-west-2 \ + --query 'InstanceTypeOfferings[*].Location' --output table +``` + +### Flexible Training Plans + +Guaranteed capacity for a reserved period, discounted pricing, co-located instances. Requires advance planning. + +```bash +aws sagemaker list-training-plans \ + --filters Name=Status,Value=Active \ + --region \ + --query 'TrainingPlanSummaries[*].{Name:TrainingPlanName,Type:InstanceType,Count:TotalInstanceCount,AZ:AvailabilityZone,Status:Status,Start:StartTime,End:EndTime}' \ + --output table +``` + +Use in cluster config: + +```bash +aws sagemaker create-cluster \ + --cluster-name my-cluster \ + --instance-groups '[{ + "InstanceGroupName": "gpu-workers", + "InstanceType": "ml.p5.48xlarge", + "InstanceCount": 4, + "ExecutionRole": "arn:aws:iam:::role/HyperPodRole", + "TrainingPlanArn": "arn:aws:sagemaker:::training-plan/", + "LifeCycleConfig": {"SourceS3Uri": "s3://sagemaker-lifecycle-/", "OnCreate": "on_create.sh"} + }]' \ + --vpc-config '{"SecurityGroupIds":["sg-xxx"],"Subnets":["subnet-xxx"]}' \ + --region +``` + +**Critical:** the subnet must be in the **same AZ** as the training plan's `AvailabilityZone`. + +### Reserved capacity (via account team) + +For large or long-term capacity. Contact the AWS account team — customized placement and pricing, longer lead time. + +--- + +## AZ selection + +Instance-type availability varies by AZ, and AZ names (`us-west-2a`) map to different physical zones per account. When coordinating with AWS Support or the account team about reserved capacity, use **AZ IDs** (`usw2-az1`), not AZ names — they're consistent across accounts. + +```bash +# AZ name → ID: +aws ec2 describe-availability-zones --region \ + --query 'AvailabilityZones[*].{Name:ZoneName,ID:ZoneId,State:State}' --output table + +# Your subnet's AZ: +aws ec2 describe-subnets --subnet-ids --region \ + --query 'Subnets[0].{AZ:AvailabilityZone,AZ_ID:AvailabilityZoneId}' + +# Instance-type offerings by AZ-ID: +aws ec2 describe-instance-type-offerings \ + --location-type availability-zone-id \ + --filters "Name=instance-type,Values=" \ + --region \ + --query 'InstanceTypeOfferings[*].Location' +``` + +If your subnet's AZ doesn't appear in the offerings list, create a new subnet in an AZ that does. + +--- + +## Service quotas + +Check `ml. for cluster usage` quotas before creating a cluster. EKS on HyperPod also consumes ENIs and subnet IPs — size subnets generously; CIDRs cannot be changed after creation. + +```bash +# SageMaker HyperPod quotas: +aws service-quotas list-service-quotas \ + --service-code sagemaker --region \ + --query 'Quotas[?contains(QuotaName,`cluster`) || contains(QuotaName,`HyperPod`)].{Name:QuotaName,Value:Value,Code:QuotaCode}' \ + --output table + +# Subnet free IPs: +aws ec2 describe-subnets --subnet-ids --region \ + --query 'Subnets[0].{CIDR:CidrBlock,FreeIPs:AvailableIpAddressCount}' +``` + +Request quota increases proactively — processing time varies by quota and region. + +--- + +## Troubleshooting + +### `Insufficient capacity` + +1. Check which AZs have the instance type (commands above) +2. Verify your subnet is in one of those AZs +3. If no AZ has capacity: try a different region/type or contact account team +4. Using a Training Plan: verify `TrainingPlanArn` and that the subnet AZ matches the plan AZ + +### `No subnets in the capacity AZ` + +Cluster specifies subnets, but none are in the AZ where AWS has capacity. Create a subnet in that AZ and add it to the cluster config. + +### Stuck in `Creating` with no events + +Likely waiting for capacity. Check `list-cluster-events`; if no events after >1 hour, contact AWS Support. + +### Partial provisioning + +Capacity was available for some instances but not all. With `NodeProvisioningMode=Continuous` the cluster keeps retrying. Check events for the failing instance group; consider reducing `InstanceCount` or using `MinInstanceCount` for elastic scaling. diff --git a/plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/references/cloudformation-errors.md b/plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/references/cloudformation-errors.md new file mode 100644 index 00000000..d22a0a13 --- /dev/null +++ b/plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/references/cloudformation-errors.md @@ -0,0 +1,84 @@ +# CloudFormation Error Reference + +Deep-dive companion to [SKILL.md](../SKILL.md) § H. HyperPod console deployments create nested CloudFormation stacks; the root-cause error is typically in a nested stack's leaf resource. + +--- + +## Navigate to the real failure + +1. CloudFormation console → correct region → find the failed HyperPod stack (`CREATE_FAILED` or `ROLLBACK_COMPLETE`) +2. **Events tab** → filter by `CREATE_FAILED` → note the earliest failure +3. **Resources tab** → find `AWS::CloudFormation::Stack` entries with `CREATE_FAILED` +4. Click the Physical ID → opens the nested stack +5. Repeat until you reach a stack with only leaf resources +6. The **Status reason** on the failed leaf resource is the root cause + +CLI alternative (per stack — nested stacks need to be iterated): + +```bash +aws cloudformation describe-stack-events --stack-name --region \ + --query 'StackEvents[?ResourceStatus==`CREATE_FAILED`].{Time:Timestamp,Resource:LogicalResourceId,Type:ResourceType,Reason:ResourceStatusReason}' \ + --output table +``` + +--- + +## Resource error catalog + +### AWS::SageMaker::Cluster + +| Status reason | Root cause | Fix | +| -------------------------------------------------- | -------------------------------------- | ------------------------------------------------------------------- | +| `Insufficient capacity in the Availability Zone` | No on-demand instances available in AZ | Different AZ, Flexible Training Plans, or reserved capacity | +| `No subnets in the capacity AZ` | Cluster subnet not in capacity AZ | Create subnet in the AZ where instances are available | +| `EFA health checks did not run successfully` | SG missing self-referencing rules | Add inbound + outbound self-ref rules (protocol: All, source: self) | +| `Lifecycle scripts did not run successfully` | Script error, S3 access, or timeout | Check CloudWatch: `/aws/sagemaker/Clusters//` | +| `The security group 'sg-xxx' does not exist` | Wrong SG ID or different region | Verify SG exists in same region and VPC | +| `The subnet 'subnet-xxx' does not exist` | Wrong subnet ID or different region | Verify subnet exists in same region | +| `You are not authorized to perform this operation` | Execution role missing permissions | Add required SageMaker + VPC permissions to the execution role | + +### AWS::IAM::Role + +| Status reason | Root cause | Fix | +| ----------------------------------------- | -------------------------------------------------------------------- | ------------------------------------------------------------ | +| `Cannot exceed quota for PoliciesPerRole` | Managed-policy-per-role quota reached (default 10; can be increased) | Consolidate into inline policies or request a quota increase | +| `Invalid principal in policy` | Wrong service in trust policy | Use `"Service": "sagemaker.amazonaws.com"` in trust policy | +| `MalformedPolicyDocument` | JSON syntax error | Validate JSON; check trailing commas and quotes | +| `EntityAlreadyExists` | Role name already taken | Use unique name or import existing role | + +### AWS::EC2::VPC / Subnet / SecurityGroup + +| Status reason | Root cause | Fix | +| ---------------------------------------------------- | ---------------------------------------------------------------- | -------------------------------------------------------- | +| `The CIDR 'x.x.x.x/y' conflicts with another subnet` | Overlapping CIDR in same VPC | Use non-overlapping CIDR blocks | +| `InvalidGroup.Duplicate` | SG rule already exists | Treat as success (template idempotency) | +| `RulesPerSecurityGroupLimitExceeded` | Per-SG rule quota reached (default 60 per direction; adjustable) | Consolidate with CIDR ranges or request a quota increase | + +### AWS::FSx::FileSystem + +| Status reason | Root cause | Fix | +| ----------------------------------------------- | ----------------------------------- | ------------------------------------------ | +| `The subnet is not in a supported AZ` | FSx Lustre not available in that AZ | Use a subnet in an AZ that supports Lustre | +| `The security group does not belong to the VPC` | SG and subnet in different VPCs | Move SG or subnet to same VPC | + +### Custom::Resource / AWS::Lambda::Function + +Lambda-backed custom resources fail with the underlying Lambda error. Find the function name in the Resources tab, then: + +```bash +aws logs tail /aws/lambda/ --region --since 1h +``` + +--- + +## Rolled-back stacks + +When a stack rolls back, CloudFormation deletes what it created. List them: + +```bash +aws cloudformation list-stacks \ + --stack-status-filter ROLLBACK_COMPLETE DELETE_COMPLETE \ + --region \ + --query 'StackSummaries[?contains(StackName,`HyperPod`) || contains(StackName,`hyperpod`)].{Name:StackName,Status:StackStatus,Time:CreationTime}' \ + --output table +``` diff --git a/plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/references/cluster-diagnostics-detail.md b/plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/references/cluster-diagnostics-detail.md new file mode 100644 index 00000000..aad0a1dc --- /dev/null +++ b/plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/references/cluster-diagnostics-detail.md @@ -0,0 +1,463 @@ +# Cluster Diagnostics — Detailed Procedures + +Full diagnostic and fix procedures for each section referenced from [SKILL.md](../SKILL.md). + +--- + +## A: EFA Health Checks + +**Signals:** `"EFA health checks did not run successfully. Ensure that your VPC and security groups are properly configured before attempting to create a new cluster."` + +**Root cause:** Security group missing self-referencing rules — a common cluster-creation failure. + +### Diagnose + +```bash +bash scripts/diagnose-cluster.sh --cluster --region + +# Or directly: +SG=$(aws sagemaker describe-cluster --cluster-name --region \ + --query 'VpcConfig.SecurityGroupIds[0]' --output text) +aws ec2 describe-security-groups --group-ids $SG --region \ + --query 'SecurityGroups[0].{Inbound:IpPermissions,Outbound:IpPermissionsEgress}' \ + --output json +``` + +Look for self-referencing rules where source/destination is the SG itself. + +### Fix — apply to every SG on the cluster + +Customer-run. Apply the two self-ref rules to each SG in `describe-cluster → VpcConfig.SecurityGroupIds`, then add **least-privilege egress** for the AWS APIs the node needs to reach. Idempotent: `InvalidPermission.Duplicate` = already exists, treat as success. + +```bash +SG= +REGION= + +# Inbound self-ref (inter-node communication, EFA) +aws ec2 authorize-security-group-ingress --group-id $SG --region $REGION \ + --ip-permissions '[{"IpProtocol":"-1","UserIdGroupPairs":[{"GroupId":"'"$SG"'"}]}]' + +# Outbound self-ref (EFA RDMA) +aws ec2 authorize-security-group-egress --group-id $SG --region $REGION \ + --ip-permissions '[{"IpProtocol":"-1","UserIdGroupPairs":[{"GroupId":"'"$SG"'"}]}]' +``` + +**Egress for AWS APIs.** The node needs HTTPS (443) outbound to reach the AWS services HyperPod uses: S3 (lifecycle scripts), ECR (container images), SageMaker (HyperPod control plane), SSM / SSMMessages / EC2Messages (Session Manager), STS, and CloudWatch Logs. The narrowest practical rule is **TCP 443 to the VPC-endpoint prefix-lists** for those services (`com.amazonaws..` resolves to a `pl-XXXXXXXX` ID via `aws ec2 describe-prefix-lists`), referenced in `authorize-security-group-egress --ip-permissions` as `PrefixListIds`. See the AWS docs on [VPC endpoint prefix lists](https://docs.aws.amazon.com/vpc/latest/privatelink/vpce-gateway.html#vpc-endpoints-security) for the exact CLI shape. `aws ec2 describe-vpc-endpoints` lists which services the cluster VPC already has endpoints for. + +Self-ref opens all ports between instances in this SG (intended for intra-cluster EFA). For multi-SG clusters see [cluster-operations.md § 1](cluster-operations.md#1-efa-security-group-multi-sg-clusters). + +--- + +## B: Capacity & AZ + +**Signals:** `"We currently do not have sufficient capacity in the Availability Zone you requested"` (public doc); also seen: subnets not in the AZ where capacity is available. + +```bash +aws ec2 describe-instance-type-offerings \ + --location-type availability-zone \ + --filters "Name=instance-type,Values=" \ + --region \ + --query 'InstanceTypeOfferings[*].Location' --output table +``` + +Fix: add subnet in an AZ where the type is available, or use Flexible Training Plans / ODCR. Full strategy: [capacity-planning.md](capacity-planning.md). + +--- + +## C: Lifecycle Scripts + +**Signals:** cluster-creation event indicates lifecycle script execution error or timeout; creation fails during provisioning. + +```bash +CLUSTER_ID=$(aws sagemaker describe-cluster --cluster-name --region \ + --query 'ClusterArn' --output text | cut -d/ -f2) +LOG_GROUP="/aws/sagemaker/Clusters//${CLUSTER_ID}" + +aws logs describe-log-streams --log-group-name "$LOG_GROUP" --region \ + --query 'logStreams[?starts_with(logStreamName,`LifecycleConfig`)].logStreamName' --output table + +aws logs get-log-events --log-group-name "$LOG_GROUP" \ + --log-stream-name "LifecycleConfig//" \ + --region --query 'events[*].message' --output text +``` + +| Log error | Fix | +| ---------------------------------------- | ----------------------------------------------------------- | +| `Connect timeout on endpoint URL: s3://` | Add S3 Gateway VPC endpoint to subnet route table | +| `AccessDenied` on S3 | Add `s3:GetObject` + `s3:ListBucket` to execution role | +| Script never exits / timeout | Add `set -euo pipefail`; test locally; add network timeouts | +| `ASCII text, with CRLF line terminators` | `dos2unix script.sh` before uploading | +| `provisioning_parameters.json` mismatch | Instance group names must match between config and API call | + +Full S3 layout, node-type detection, and on-node debug: [lifecycle-scripts.md](lifecycle-scripts.md). + +--- + +## D: EKS Access / kubectl + +**Signals:** `"couldn't get current server API group list: the server has asked for the client to provide credentials"`, `kubectl get nodes` fails or returns nothing. + +```bash +# Your identity +aws sts get-caller-identity + +# EKS cluster behind the HyperPod cluster +EKS_ARN=$(aws sagemaker describe-cluster --cluster-name --region \ + --query 'Orchestrator.Eks.ClusterArn' --output text) +EKS_NAME=$(echo $EKS_ARN | awk -F'/' '{print $NF}') + +# Existing access entries +aws eks list-access-entries --cluster-name $EKS_NAME --region + +# Auth mode +aws eks describe-cluster --name $EKS_NAME --region \ + --query 'cluster.accessConfig.authenticationMode' --output text +``` + +### Suggested command — grant yourself EKS access (run this yourself) + +**Preconditions:** `$MY_ARN` is the IAM **role ARN**, not the assumed-role session ARN. EKS auth mode is `API` or `API_AND_CONFIG_MAP`. + +**Command:** + +```bash +MY_ARN=$(aws sts get-caller-identity --query 'Arn' --output text) + +aws eks create-access-entry \ + --cluster-name $EKS_NAME --region --principal-arn $MY_ARN + +aws eks associate-access-policy \ + --cluster-name $EKS_NAME --region --principal-arn $MY_ARN \ + --policy-arn arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy \ + --access-scope '{"type": "cluster"}' + +aws eks update-kubeconfig --name $EKS_NAME --region +kubectl get nodes +``` + +**Blast radius:** `AmazonEKSClusterAdminPolicy` grants cluster-wide admin on the EKS cluster — use a narrower policy (`AmazonEKSEditPolicy` / `AmazonEKSViewPolicy` + namespace scope) for day-to-day operators. `update-kubeconfig` overwrites the current `kubectl` context. + +If the EKS cluster's auth mode is `CONFIG_MAP` only, access entries are not available. Switching auth mode is a cluster-level, administrator-level change — review the EKS access-entries documentation before proceeding and coordinate with anyone who depends on the existing `aws-auth` ConfigMap. + +--- + +## E: Cluster Provisioning + +**Signals:** Cluster `InService` but instances not visible, `kubectl get nodes` returns nothing, `list-cluster-nodes` shows fewer nodes than expected. + +With **Continuous Provisioning**, the cluster goes `InService` before all instances are created. Instance creation is asynchronous; failures appear as events. + +```bash +aws sagemaker describe-cluster --cluster-name --region \ + --query '{Status:ClusterStatus,Groups:InstanceGroups[*].{Name:InstanceGroupName,Count:CurrentCount,Target:InstanceCount,Status:InstanceGroupStatus}}' \ + --output table + +aws sagemaker list-cluster-events --cluster-name --region \ + --query 'ClusterEventSummaries[*].{Time:EventTime,Type:EventType,Message:Message}' \ + --output table + +aws sagemaker list-cluster-nodes --cluster-name --region \ + --query 'ClusterNodeSummaries[*].{ID:InstanceId,Group:InstanceGroupName,Status:InstanceStatus.Status}' \ + --output table +``` + +| Observation | Cause | Action | +| --------------------------------------------------------- | ----------------------------------- | ------------------------------------- | +| `CurrentCount < InstanceCount`, events show provisioning | Continuous provisioning in progress | Wait; monitor events | +| Events: `"Insufficient capacity"` | No capacity in AZ | See **[B](#b-capacity--az)** | +| Events: lifecycle script failure | Script error | See **[C](#c-lifecycle-scripts)** | +| Events: `"EFA health checks"` | SG misconfiguration | See **[A](#a-efa-health-checks)** | +| Nodes in `list-cluster-nodes` but not `kubectl get nodes` | EKS registration issue | Check lifecycle logs, kubelet via SSM | + +See [cluster-operations.md § 5](cluster-operations.md#5-continuous-provisioning-eks-only). + +--- + +## F: SSM Connectivity + +**Signals:** `"Target is not connected"`, SSM session fails. + +> **For interactive shell or repeated SSM access, use the [`hyperpod-ssm`](../../hyperpod-ssm/SKILL.md) skill** — it wraps the cluster-ID derivation, target-format construction, and session start shown below. The block here is for one-off connectivity diagnosis; `hyperpod-ssm` is the right tool for actually working on nodes. + +--- + +## G: Node Replacement + +### G.1: Auto-replacement not triggering + +Diagnose (read-only): + +```bash +# Is NodeRecovery enabled? +aws sagemaker describe-cluster --cluster-name --region \ + --query 'InstanceGroups[*].{Group:InstanceGroupName,Recovery:NodeRecovery}' --output table + +# Replacement activity +aws sagemaker list-cluster-events --cluster-name --region \ + --query 'ClusterEventSummaries[?contains(Message,`replace`) || contains(Message,`reboot`) || contains(Message,`hardware`) || contains(Message,`recovery`)]' \ + --output table + +# Health-monitoring-agent logs (pattern: SagemakerHealthMonitoringAgent//) +CLUSTER_ID=$(aws sagemaker describe-cluster --cluster-name --region \ + --query 'ClusterArn' --output text | cut -d/ -f2) +aws logs describe-log-streams \ + --log-group-name "/aws/sagemaker/Clusters//${CLUSTER_ID}" \ + --region \ + --query 'logStreams[?starts_with(logStreamName,`SagemakerHealthMonitoringAgent`)].logStreamName' \ + --output table + +# EKS node health labels — the sagemaker.amazonaws.com/node-health-status +# label on each node indicates the action HyperPod has decided on. +kubectl get nodes --show-labels +kubectl describe node + +sinfo -o "%N %T %30E" +``` + +**Common blockers:** `NodeRecovery=None`, health agent hasn't detected (wait for next cycle), lifecycle script failing on new instance (same log group, `LifecycleConfig/...` stream), no capacity (see [B](#b-capacity--az)), cluster not `InService`. + +### Suggested command — enable NodeRecovery (run this yourself) + +> **Destructive — replaces the whole `InstanceGroups` list.** Any group omitted from the payload is deleted; any field drift (instance type, count, lifecycle config) is applied as-is. Re-run `describe-cluster` first and copy every existing field into the payload below before adding `NodeRecovery=Automatic`. If unsure, use the SageMaker console — it preserves existing fields by default. Never run this command yourself; present it to the customer. + +**Preconditions:** `NodeRecovery=None` confirmed above. **Derive every field for every instance group from the current `describe-cluster` output** — `update-cluster` replaces the whole `InstanceGroups` list; any field drift is applied as-is. + +**Command:** + +```bash +aws sagemaker update-cluster --cluster-name --region \ + --instance-groups '[{"InstanceGroupName":"","InstanceType":"ml.p5.48xlarge", + "InstanceCount":, + "LifeCycleConfig":{"SourceS3Uri":"","OnCreate":"