All notable changes to this project will be documented in this file.
- fix: add SSH keepalive and handshake timeout (#772) — SSH connections now send keepalive probes every 30 seconds to prevent session drops during long operations (e.g.,
kubeadm init). A 15-second handshake timeout preventsconnectOrDiefrom blocking indefinitely against hosts that accept TCP but never complete the SSH handshake.
- fix: HA NLB hairpin routing (#746, #762) — Control-plane nodes now use
localhost:6443for kubectl instead of the NLB endpoint, avoiding AWS NLB hairpin/loopback timeouts. - fix: switch HA NLB to internal scheme (#760) — NLB uses internal scheme to keep traffic within the VPC.
- fix: handle InvalidInternetGatewayID.NotFound in IGW detach (#772) — The detach step now recognizes
InvalidInternetGatewayID.NotFoundalongsideGateway.NotAttachedand skips retries. - fix: handle NotFound errors in NLB/listener/target-group deletion (#772) — All NLB cleanup paths now check for
LoadBalancerNotFound,ListenerNotFound, andTargetGroupNotFound, treating already-deleted resources as success.
- fix: NLB cleanup in periodic VPC cleaner (#762) —
DeleteVPCResourcesnow deletes NLB listeners, target groups, and load balancers before attempting subnet/IGW/VPC deletion, preventingDependencyViolationerrors from NLB-owned ENIs. - fix: revoke cross-referencing SG rules before deletion (#766) — Security groups that reference each other are now cleaned up by revoking all ingress/egress rules before attempting deletion.
- fix: treat InvalidVpcID.NotFound as success in VPC cleanup (#769) — VPCs that no longer exist are treated as successfully cleaned up.
- fix: suppress NotFound warnings in all cleanup delete functions (#772) — The periodic cleanup job no longer logs misleading warnings when IGWs, security groups, subnets, or route tables are already gone.
- ci: update periodic cleanup and add manual trigger (#758, #765) — Periodic cleanup workflow uses the latest holodeck binary and supports manual dispatch.
A complete overhaul of multi-node cluster networking for production workloads:
- Add cluster networking cache constants and struct fields (#720, @ArangoGutierrez)
- Add public subnet for cluster mode (#723, @ArangoGutierrez)
- Add Transport abstraction for SSH connections (#724, @ArangoGutierrez)
- Separate control-plane and worker security groups (#725, @ArangoGutierrez)
- Cleanup dual security groups on cluster delete (#726, @ArangoGutierrez)
- Wire SSM transport for private-subnet cluster nodes (#727, @ArangoGutierrez)
- Wire cluster networking — private subnets, NAT, hostForNode, real tests (#728, @ArangoGutierrez)
- Add ELBv2/NLB support for HA clusters (#614, @ArangoGutierrez)
User-defined provisioning templates with full lifecycle phase support:
- Add
CustomTemplatetype andTemplatePhaseenum to API (#701, @ArangoGutierrez) - Implement custom template loader and executor (#702, @ArangoGutierrez)
- Add custom template input validation (#703, @ArangoGutierrez)
- Integrate custom templates into dependency resolver (#706, @ArangoGutierrez)
First-class support for RPM-based Linux distributions including Rocky Linux 9, Amazon Linux 2023, and Fedora 42 across all runtime stacks (Docker, containerd, CRI-O):
- Add RPM support to Docker template (#677, @ArangoGutierrez)
- Add RPM support to NVIDIA driver template (#676, @ArangoGutierrez)
- Add RPM support to Kubernetes templates (#678, @ArangoGutierrez)
- Add RPM support to kernel template (#681, @ArangoGutierrez)
- Add RPM support to CRI-O template (#680, @ArangoGutierrez)
- Add RPM support to container-toolkit package template (#679, @ArangoGutierrez)
- Add RPM docs, e2e validation, and Fedora 42 support (#693, @ArangoGutierrez)
- Add DNF/YUM package manager support to provisioner
- Add AMI architecture detection and cross-validation (#664, @ArangoGutierrez)
- Infer AMI architecture from instance type for ARM64 support (#669, @ArangoGutierrez)
- Propagate image architecture in cluster mode (#661, @ArangoGutierrez)
- Add ARM64 GPU end-to-end test on merge to main (#670, @ArangoGutierrez)
- Detect Kubernetes arch at runtime instead of defaulting to amd64 (#663, @ArangoGutierrez)
- Use runtime arch for NVIDIA CUDA repository URL (#662, @ArangoGutierrez)
- Make architecture field case-insensitive for backward compatibility
- Add full multinode Kubernetes cluster support (#562, @ArangoGutierrez)
- Parallelize node provisioning, join info, and source/dest check (#660, @ArangoGutierrez)
- Fix cluster-mode OS resolution to use per-node specs
- Add component provenance tracking to environment status (#635, @ArangoGutierrez)
- Support multiple installation sources for container runtimes (package, runfile, git) (#637, @ArangoGutierrez)
- Support multiple installation sources for NVIDIA drivers (package, runfile, git) (#636, @ArangoGutierrez)
- Add git ref resolution for CTK installation from GitHub sources
- Add Kubernetes installation from custom sources
- Complete CLI with full CRUD operations (create, delete, list, status, dryrun) (#563, #621, @ArangoGutierrez)
- Add retry logic with exponential backoff (#616, @ArangoGutierrez)
- Add cleanup mode for standalone VPC cleanup to GitHub Action (#4938a2ee, @ArangoGutierrez)
- Replace bash cleanup script with native Go implementation (@ArangoGutierrez)
- Idempotent provisioning templates with enhanced error handling (#570, @ArangoGutierrez)
- Add Ubuntu 20.04 to OS AMI registry
- Add GitHub workflow to mark issues as stale after 90 days of inactivity (#695, @ArangoGutierrez)
- Wait for NAT Gateway available state before creating routes, fixing race condition (#735, @ArangoGutierrez)
- Verify API server against local IP before switching to NLB (#721, @ArangoGutierrez)
- Remove local kubeadm config cleanup that races in cluster mode (#718, @ArangoGutierrez)
- Make kubeadm config local path unique per environment (#717, @ArangoGutierrez)
- Guard substring slice and redact join credentials in logs (#654, @ArangoGutierrez)
- Surface NLB errors instead of swallowing them (#645, @ArangoGutierrez)
- Disable Source/Dest Check on ENI for single-node deployments
- Restrict security group CIDR to detected public IP (#615, @ArangoGutierrez)
- Fail instead of silently falling back to
0.0.0.0/0for security group (#650, @ArangoGutierrez) - Implement cleanup on partial creation failures (#612, @ArangoGutierrez)
- Propagate image architecture in cluster mode (#661)
- Close SSH client before reassign, close pipe reader (#659, @ArangoGutierrez)
- Split SSH sessions in
createKindConfig(#657, @ArangoGutierrez) - Remove unreachable code in
connectOrDie, wrap last error (#655, @ArangoGutierrez) - Correct retry count in
connectOrDieerror message (#628, @ArangoGutierrez) - Guard nil provider for SSH provider mode (#643, @ArangoGutierrez)
- Close SSH client in
GetKubeConfig(#640, @ArangoGutierrez) - Use TOFU known_hosts for interactive SSH sessions (#653, @ArangoGutierrez)
- Add mutex to TOFU known_hosts to prevent race condition (#644, @ArangoGutierrez)
- Use TOFU host key verification in all SSH connections (#625, #630, @ArangoGutierrez)
- Log SFTP
MkdirAllerror instead of discarding (#642, @ArangoGutierrez)
- Validate node labels and IPs before shell interpolation (#656, @ArangoGutierrez)
- Security and concurrency: template input validation, error wrapping (#623, @ArangoGutierrez)
- Critical bugs:
storageSizeGBrace,log.Fatalfin goroutine, file permissions (#622, @ArangoGutierrez) - Create kubeconfig with 0600 permissions (#629, @ArangoGutierrez)
- Handle
crypto/randfailure in retry jitter (#641, @ArangoGutierrez) - Validate instance type before creating resources (#672, @ArangoGutierrez)
- Preserve error chain with
errors.Join, copy tags for goroutines (#651, @ArangoGutierrez) - Return errors from
GenerateInstanceID, validate ID format (#648, @ArangoGutierrez) - Replace
context.TODO()with proper timeouts increate.go(#611, @ArangoGutierrez) - Add context propagation and timeout support to cleanup (#608, @ArangoGutierrez)
- Address ignored errors with logging and documentation (#610, @ArangoGutierrez)
- Use
%wfor proper error wrapping (#609, @ArangoGutierrez) - Validate instance type before creating resources (#672)
- Pre-release validation fixes for kubeadm, networking, and CTK (#716, @ArangoGutierrez)
- E2E fixes for RPM-based distros (Rocky Linux 9)
- Validate feature gates, track branches, and endpoint host (#647, @ArangoGutierrez)
- Address audit fixes for heterogeneous cluster support
- Restore CTK/K8s validation and fix test schema
- Replace shared channels with per-invocation context cancellation (#631, @ArangoGutierrez)
- Nil out completed cancel entries in
activeCancels(#646, @ArangoGutierrez) - Replace signal goroutine with
signal.NotifyContext(#649, @ArangoGutierrez)
- Use valid hex instance IDs in delete and status tests (#713, @ArangoGutierrez)
- Inject sleep function to speed up AWS provider tests (~200x faster) (#739, @ArangoGutierrez)
- Parallelize node provisioning, join info, and source/dest check in cluster mode (#660, @ArangoGutierrez)
- Pre-compile templates, hoist regex, filter
DescribeLoadBalancers(#652, @ArangoGutierrez)
- Split E2E tests into smoke (pre-merge) and full (post-merge) tiers (#740, @ArangoGutierrez)
- Add ARM64 GPU end-to-end test on merge to main (#670, @ArangoGutierrez)
- Add workflow to mark issues as stale with no activity (#695, @ArangoGutierrez)
- Pin holodeck action to v0.2.18 in periodic CI workflow
- Update multinode cluster guide with production-grade networking (#734, @ArangoGutierrez)
- Comprehensive documentation refresh (#707, @ArangoGutierrez)
- Add multi-source installation guides and examples (#634, @ArangoGutierrez)
- Add OS selection guide
- Add CTK multi-source installation guide
- Add GitHub project management templates and AI instructions
- Updated AWS SDK Go v2 to latest (multiple bumps:
aws-sdk-go-v2,service/ec2,service/ssm,service/elasticloadbalancingv2,config) - Updated Kubernetes libraries:
k8s.io/apimachinery,sigs.k8s.io/controller-runtimeto v0.23.3 - Updated
golang.org/x/cryptoto v0.49.0 andgolang.org/x/syncto v0.20.0 - Updated Go base image from 1.25.5 to 1.26.1 (bookworm)
- Updated GitHub Actions:
docker/build-push-actionv7,docker/setup-buildx-actionv4,docker/metadata-actionv6,docker/login-actionv4,docker/setup-qemu-actionv4,aws-actions/configure-aws-credentialsv6,actions/upload-artifactv7 - Updated
github.com/onsi/ginkgo/v2andgithub.com/onsi/gomegato latest - Updated
github.com/pkg/sftpto v1.13.10
See GitHub Release