Skip to content

feat(longhorn): generalize install for non-AWS providers#270

Open
aktech wants to merge 4 commits intomainfrom
feat/longhorn-shared-package
Open

feat(longhorn): generalize install for non-AWS providers#270
aktech wants to merge 4 commits intomainfrom
feat/longhorn-shared-package

Conversation

@aktech
Copy link
Copy Markdown
Member

@aktech aktech commented Apr 28, 2026

Closes #269.

longhorn

Moves the Longhorn install out of the AWS provider into pkg/storage/longhorn/. The existing and hetzner providers now accept a longhorn: block with the same shape AWS already exposes.

cluster:
  existing:        # also: hetzner
    longhorn: {}   # defaults: enabled, 2 replicas

Default-enabled on AWS and Hetzner (both provision the cluster — RWO-only EBS / hcloud-volumes need an RWX class for shared workloads). Opt-in on existing (omitted block = no install) since the user's pre-existing cluster may already have its own StorageClass. enabled: false opts out anywhere.

Validated end-to-end on a Hetzner k3s cluster: nic deploy provisioned the iSCSI DaemonSet and the Longhorn release, downstream RWX PVCs bound to a Longhorn share-manager, and JupyterHub user pods mounted shared-storage cross-node — replacing the in-cluster NFS-on-RWO workaround that fails on overlayfs.

Lift the Longhorn install logic out of the AWS provider into a shared
package so the existing-cluster and hetzner providers can opt in. On
clusters without managed RWX (Hetzner hcloud-volumes is RWO-only,
on-prem k3s, kind/k3d), charts that need RWX previously had to fall
back to the in-cluster NFS-on-RWO workaround, which is fragile on
overlayfs hosts. Closes #269.

Shared package pkg/storage/longhorn:
- Install(ctx, kubeconfigBytes, *Config) — idempotent helm install or
  upgrade. iSCSI prerequisite DaemonSet runs on first install only;
  upgrade path skips it (saves the 3-minute readiness wait).
- Config + IsEnabled / Replicas helpers. nil *Config = do not install.

aws provider:
- Longhorn field type is now *longhorn.Config; aws.LonghornConfig is a
  type alias so existing yaml under cluster.aws.longhorn keeps working.
- LonghornEnabled() retains AWS's nil-block-as-enabled default.
- Provider.Deploy calls longhorn.Install directly.

existing provider:
- New opt-in longhorn block (nil = no install).
- Deploy installs Longhorn when enabled.
- GetStorageClass / InfraSettings.StorageClass returns 'longhorn' when
  Longhorn is enabled and storage_class is unset; explicit
  storage_class always wins.

hetzner provider:
- Same opt-in. Deploy reads the kubeconfig hetzner-k3s wrote and
  installs Longhorn after cluster create.
- InfraSettings.StorageClass flips from hcloud-volumes to longhorn
  when enabled.
@aktech aktech force-pushed the feat/longhorn-shared-package branch from 82ca29a to b798497 Compare April 28, 2026 13:59
@aktech aktech marked this pull request as ready for review April 30, 2026 11:06
aktech added 3 commits April 30, 2026 12:38
Inline the two construction sites — the helpers hid no unexported fields
(longhorn.Config exports all four) and the inline form is shorter.
Drop the duplicate storageClassLonghorn const now that the shared
package exports the same value.
hcloud-volumes is RWO-only, so charts that need RWX (jupyterhub
shared-storage, etc.) require Longhorn. Match the AWS pattern: install
unless 'longhorn.enabled: false' is set explicitly.
@viniciusdc viniciusdc added the needs: review 👀 This PR is complete and ready for reviewing label Apr 30, 2026
WithAction("installed").
WithMetadata("chart_version", ChartVersion))

return nil
Copy link
Copy Markdown
Contributor

@tylerpotts tylerpotts May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replacing an earlier comment of mine on this same hunk that suggested flipping defaultClass to false — that was the wrong direction; I deleted it and am posting this corrected version.

Hardcoding persistence.defaultClass: true is correct: when a user opts into Longhorn, Longhorn should be the cluster's default StorageClass. The actual gap is that this install path doesn't displace the previous default. On any cluster that already ships a default SC (Hetzner's hcloud-volumes, vSphere's thin, etc.), Longhorn's chart annotates longhorn as default but leaves the prior default's annotation in place, so two SCs end up flagged as default at once — which Kubernetes treats as undefined behavior (API server picks one nondeterministically and logs a warning).

Observed on a fresh Hetzner k3s cluster provisioned through this code path:

NAME                       PROVISIONER          AGE
hcloud-volumes (default)   csi.hetzner.cloud    12m
longhorn (default)         driver.longhorn.io   11m
longhorn-static            driver.longhorn.io   11m

Downstream PVCs without an explicit storageClassName then get sprayed across both classes by API-server roulette. AWS sidesteps this because gp2 isn't pre-annotated as default, but Hetzner (and most existing on-prem clusters with a pre-installed CSI) hit it.

Suggested fix

After the Helm install succeeds and the longhorn SC is created with the default annotation, list all StorageClasses and strip the default-class annotation from any other SC that still carries it. Existing PVCs are unaffected (they're already bound); only new PVCs without an explicit class are redirected to Longhorn — which is the intended behavior when Longhorn is enabled.

A small helper following the package's existing style (mirroring ensureISCSI / ensureISCSIWithClient):

// stripPreviousDefaultStorageClasses removes the
// `storageclass.kubernetes.io/is-default-class` annotation from every
// StorageClass other than keepName. Kubernetes treats multiple default SCs
// as undefined behavior, so after Longhorn registers itself as default we
// need to displace whatever class previously held that role (e.g.
// hcloud-volumes on Hetzner).
func stripPreviousDefaultStorageClasses(ctx context.Context, kubeconfigBytes []byte, keepName string) error {
	client, err := newK8sClient(kubeconfigBytes)
	if err != nil {
		return fmt.Errorf("failed to create Kubernetes client: %w", err)
	}
	return stripPreviousDefaultStorageClassesWithClient(ctx, client, keepName)
}

func stripPreviousDefaultStorageClassesWithClient(ctx context.Context, client kubernetes.Interface, keepName string) error {
	const defaultAnnotation = "storageclass.kubernetes.io/is-default-class"

	classes, err := client.StorageV1().StorageClasses().List(ctx, metav1.ListOptions{})
	if err != nil {
		return fmt.Errorf("failed to list StorageClasses: %w", err)
	}

	for i := range classes.Items {
		sc := &classes.Items[i]
		if sc.Name == keepName {
			continue
		}
		if sc.Annotations[defaultAnnotation] != "true" {
			continue
		}

		patch := []byte(`{"metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}`)
		if _, err := client.StorageV1().StorageClasses().Patch(ctx, sc.Name, types.MergePatchType, patch, metav1.PatchOptions{}); err != nil {
			return fmt.Errorf("failed to clear default-class annotation on StorageClass %q: %w", sc.Name, err)
		}

		status.Send(ctx, status.NewUpdate(status.LevelInfo,
			fmt.Sprintf("Cleared default-class annotation from previous default StorageClass %q", sc.Name)).
			WithResource("storageclass").
			WithAction("updating"))
	}

	return nil
}

Then call it in Install right before the success status update — e.g. between the span.SetAttributes block and the status.Send(...LevelSuccess...) call near this line:

if err := stripPreviousDefaultStorageClasses(ctx, kubeconfigBytes, StorageClassName); err != nil {
	span.RecordError(err)
	return fmt.Errorf("failed to clear previous default StorageClass: %w", err)
}

(Worth doing the same in upgrade for clusters that gained Longhorn after a previous default was already in place.)

New imports needed: k8s.io/apimachinery/pkg/types (for MergePatchType); metav1 and kubernetes.Interface are already in the package via iscsi.go, and newK8sClient is reused as-is.

Copy link
Copy Markdown
Contributor

@dcmcand dcmcand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid extraction, ~400 lines of AWS-specific Longhorn install logic moved into pkg/storage/longhorn with the abstraction boundary respected (no provider imports leaking into storage). Tests are thorough: table-driven, fake kubernetes.Interface injection, iSCSI wait/timeout paths covered. The nil-receiver-safe IsEnabled() / Replicas() idiom is nice, and keeping the divergent enabled-by-default policies at the provider layer (rather than baking them into the shared package) is the right call.

Two blockers worth resolving before merge:

  1. longhorn.Install accepts *Config with no documented nil-precondition; span attrs are inconsistent on whether nil is possible.
  2. Hetzner reads the kubeconfig directly from disk while AWS and existing both go through p.GetKubeconfig - inconsistent and fragile.

Plus a few questions about destroy-path symmetry (ADR-0002 requires Longhorn uninstall before cluster teardown, and neither hetzner nor existing wire that up), the iSCSI-on-upgrade skip, and the silent DryRun return on existing.

One doc follow-up: ADR-0002 still says "Longhorn is installed via Helm inside the AWS provider's Deploy()" - worth adding a 2026-04 update or superseding ADR.

span.SetAttributes(
attribute.String("chart_version", ChartVersion),
attribute.Int("replica_count", cfg.Replicas()),
attribute.Bool("dedicated_nodes", cfg != nil && cfg.DedicatedNodes),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker. The asymmetry here reveals that cfg could be nil:

attribute.Int("replica_count", cfg.Replicas()),           // nil-safe via receiver
attribute.Bool("dedicated_nodes", cfg != nil && cfg.DedicatedNodes),

In practice all current callers gate on IsEnabled() first, but Install is exported with no documented precondition. A nil cfg would silently install with defaults rather than erroring or no-oping. Either add an explicit early return at the top of Install, or document the precondition with a one-line comment on the function.

// CSI is RWO-only; Longhorn provides the RWX StorageClass that downstream
// charts (e.g. jupyterhub shared-storage for group dirs) need.
if hCfg.LonghornEnabled() {
kubeconfigBytes, err := os.ReadFile(kubeconfigPath) //nolint:gosec // Path constructed from known cache dir + project name
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker. AWS and existing both call p.GetKubeconfig(ctx, projectName, clusterConfig) here; only Hetzner reads kubeconfigPath directly via os.ReadFile. The path-variable approach is fragile to refactors and bypasses the abstraction. Provider.GetKubeconfig is already defined a few lines down (line 337) and does the same read with proper tracing. Use it for consistency.

span.RecordError(err)
return fmt.Errorf("failed to read kubeconfig for Longhorn install: %w", err)
}
if err := longhorn.Install(ctx, kubeconfigBytes, hCfg.Longhorn); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question. ADR-0002 documents that "Longhorn must be uninstalled before the cluster is torn down. PVs backed by Longhorn can block node group deletion." Now that Hetzner installs Longhorn here, the Destroy path (line 245+) needs a symmetric uninstall before tofu destroy to avoid orphaned hcloud-volumes. The shared pkg/storage/longhorn doesn't currently export an Uninstall. Worth adding it and calling it from Destroy (and in existing's Destroy if Longhorn was installed). If asymmetry is intentional, please drop a comment.

// LonghornConfig is kept as a package-local alias so existing yaml under
// `longhorn:` still unmarshals into the same shape; the underlying type now
// lives in pkg/storage/longhorn so non-AWS providers can share install logic.
type LonghornConfig = longhorn.Config
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question. The comment says this alias exists "so existing yaml under longhorn: still unmarshals into the same shape", but YAML shape comes from the struct tags on longhorn.Config directly - the alias isn't required for unmarshaling and isn't referenced outside tests. Looks like dead surface area; consider removing.

// open-iscsi/iscsi-initiator-utils (Amazon Linux 2023, k3s minimal images).
func ensureISCSI(ctx context.Context, kubeconfigBytes []byte) error {
tracer := otel.Tracer("nebari-infrastructure-core")
ctx, span := tracer.Start(ctx, "longhorn.ensureISCSI")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question. This skips iSCSI on Helm upgrade because "it was already provisioned on first install". That assumption breaks if someone runs helm uninstall longhorn while leaving the cluster intact - next deploy goes straight to Helm install with no iSCSI. The DaemonSet is idempotent and the wait is only ~3 minutes, so running ensureISCSI on the upgrade path too would be cheap and remove the implicit assumption. Intentional?

attribute.Bool("dry_run", opts.DryRun),
)

if opts.DryRun {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question. Other providers' DryRun paths emit a status update (e.g. hetzner Destroy sends "Dry run: would destroy ..."). Should existing's Deploy DryRun similarly emit something visible like "Dry run: would install Longhorn on existing cluster" rather than returning silently?

@@ -0,0 +1,390 @@
package longhorn
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stylistic. TestBuildHelmValuesNonDedicatedOmitsNodeSelector could be a row in the main TestBuildHelmValues table with a wantAbsent []string field, putting all the cases in one place. TestBuildHelmValuesDedicatedNodesStructure is fine standalone since the type assertions don't fit the table cleanly.

// kubernetes.Interface so unit tests can inject a fake client.
func ensureISCSIWithClient(ctx context.Context, client kubernetes.Interface) error {
tracer := otel.Tracer("nebari-infrastructure-core")
ctx, span := tracer.Start(ctx, "longhorn.ensureISCSIWithClient")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stylistic. Both ensureISCSI (line 89) and ensureISCSIWithClient start spans for the same logical operation. The nested span will appear in traces as a child, which inflates timelines slightly. Either drop the inner span or make it clear the outer one is just a thin wrapper for the kubeconfig to client conversion.

// LonghornConfig is kept as a package-local alias so existing yaml under
// `longhorn:` still unmarshals into the same shape; the underlying type now
// lives in pkg/storage/longhorn so non-AWS providers can share install logic.
type LonghornConfig = longhorn.Config
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stylistic (re: LonghornReplicaCount below, line 63 - outside diff hunk). That method is called when LonghornEnabled() is true, which returns true even when c.Longhorn == nil. Replicas() is nil-safe so this works correctly, but it's non-obvious. A one-line comment on LonghornReplicaCount noting that Replicas() handles the nil case would save the next reader from adding a redundant nil check.

@tylerpotts
Copy link
Copy Markdown
Contributor

Ran into a Longhorn replica-scheduling issue while testing this PR on a fresh AWS cluster (m7i.xlarge, default 20 GB EBS root). Posting in case it informs the defaults.

What happened

Single-node cluster, no longhorn: block in user config → falls back to defaultReplicaCount = 2 (pkg/storage/longhorn/config.go:26). Longhorn defaults replica-soft-anti-affinity to false, so two replicas require two distinct nodes. Result on a 1-node cluster:

  • 3 of 5 PVCs became robustness: degraded, state: attached (1 of 2 replicas scheduled — works, but no redundancy).
  • 2 of 5 became robustness: faulted, state: detached — replica records existed but had no nodeID and no dataPath (never scheduled at all). Hub pod stuck in ContainerCreating indefinitely with AttachVolume.Attach failed ... volume is not ready for workloads.

Scaling the node group to 2 healed the 3 degraded volumes (rebuilt onto the new node). The 2 faulted volumes still couldn't schedule, this time with Scheduled=False reason=ReplicaSchedulingFailure msg=insufficient storage — at 14 GB scheduled / 21 GB total per node, the 25% min-available reserve leaves ~2 GB free, which can't fit the bigger replicas.

Suggestions

A few directions worth considering (not all need to land in this PR):

  1. Reconsider defaultReplicaCount = 2. It silently breaks single-node clusters (kind/k3d dev, minimal Hetzner, AWS with min_nodes: 1). Either default to 1 with a doc note on durability, or auto-pick based on observable node count at install time.
  2. Expose replica-soft-anti-affinity via Config. Lets single-node clusters keep replicaCount: 2 co-located on one node — weaker durability but at least the volume attaches. Today this is hardcoded off via Longhorn defaults.
  3. Document the AWS EBS root-disk size assumption. 20 GB is too small for any non-trivial Longhorn workload. A separate dedicated EBS volume mounted at /var/lib/longhorn is the standard practice; the Config.DedicatedNodes path could grow a dedicated_disk companion.
  4. Surface a helpful error. When cfg.Replicas() > observedNodes at install time, log a warning. The current failure mode is silent until a Pod tries to attach, which is a long time later.

Happy to send a follow-up PR for any of these if useful — let me know which direction you'd want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs: review 👀 This PR is complete and ready for reviewing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Generalize Longhorn install to non-AWS providers

4 participants