Skip to content

feat(sagemaker-ai): add HyperPod debugging skills#169

Merged
ssuday merged 2 commits into
awslabs:mainfrom
badmin-aws:feat/hyperpod-debugging-skills
May 16, 2026
Merged

feat(sagemaker-ai): add HyperPod debugging skills#169
ssuday merged 2 commits into
awslabs:mainfrom
badmin-aws:feat/hyperpod-debugging-skills

Conversation

@badmin-aws
Copy link
Copy Markdown
Contributor

@badmin-aws badmin-aws commented May 16, 2026

Summary

Add 5 new HyperPod debugging skills to the sagemaker-ai plugin:

  • hyperpod-cluster-debugger — Cluster-wide diagnostics (creation failures, EFA health, lifecycle scripts, capacity)
  • hyperpod-nccl — NCCL failure diagnosis (training hangs, AllReduce timeouts, EFA errors, rendezvous failures)
  • hyperpod-node-debugger — Per-node issue triage (GPU hardware, EFA, disk/memory pressure, container runtime)
  • hyperpod-performance-debugger — Performance bottleneck analysis (uneven NCCL bandwidth, filesystem throughput, straggler nodes)
  • hyperpod-slurm-debugger — Slurm scheduler issues (nodes stuck down/drain, jobs pending, GRES miscounts, auto-resume)

Also includes:

  • Updates to hyperpod-ssm, hyperpod-version-checker, and hyperpod-issue-report
  • Updated README with new skill documentation

Testing

  • mise run build passes (lint, format, manifests, cross-refs, security scans)
  • Reference integrity validated (no broken links, orphan fixed)

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the project license.

Add new skills for diagnosing and troubleshooting HyperPod clusters:
- hyperpod-cluster-debugger: cluster-wide diagnostics
- hyperpod-nccl: NCCL failure diagnosis
- hyperpod-node-debugger: per-node issue triage
- hyperpod-performance-debugger: performance bottleneck analysis
- hyperpod-slurm-debugger: Slurm scheduler issues

Also updates hyperpod-ssm, hyperpod-version-checker, and
hyperpod-issue-report with related improvements.

Updates README with new skill documentation.
@badmin-aws badmin-aws requested review from a team as code owners May 16, 2026 06:04
Comment thread plugins/sagemaker-ai/skills/hyperpod-slurm-debugger/scripts/slurm-diagnose.sh Dismissed
Comment thread plugins/sagemaker-ai/skills/hyperpod-slurm-debugger/scripts/slurm-diagnose.sh Dismissed
@badmin-aws badmin-aws marked this pull request as draft May 16, 2026 06:15
@ssuday ssuday self-requested a review May 16, 2026 06:17
@badmin-aws badmin-aws marked this pull request as ready for review May 16, 2026 06:27
Copy link
Copy Markdown
Contributor

@theagenticguy theagenticguy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ssuday ssuday added this pull request to the merge queue May 16, 2026
Merged via the queue into awslabs:main with commit 95381e8 May 16, 2026
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants