Add: Checkpoint/restore for pods #1399

jsun-m · 2025-08-27T19:13:34Z

Summary by cubic

Adds CRIU-based checkpoint/restore for pods with an optional user-defined checkpoint condition. This lets pods start from a pre-warmed state to reduce cold starts.

New Features
- SDK: Pod supports checkpoint_enabled and checkpoint_condition; a lightweight runner (beta9.runner.checkpoint) waits for the condition and coordinates checkpoint across workers.
- Gateway/Proto: GetOrCreateStubRequest adds checkpoint_condition; server passes CHECKPOINT_ENABLED and CHECKPOINT_CONDITION env vars to containers.
- Worker: Integrates checkpoint/restore with improved error handling and only exposes ports after checkpoint or restore; falls back to a normal run if no checkpoint is found; uses a stronger stop signal for reliability.
- CLI: Deployment now cleans up artifacts even on failures.
Migration
- To enable: Pod(checkpoint_enabled=True, checkpoint_condition=my_func). The function should return True when the pod is ready to be checkpointed.
- No changes needed for existing deployments (feature is off by default).

jsun-m · 2025-08-27T19:41:17Z

pkg/worker/criu.go

-	})
-
-	return exitCode, request.ContainerId, err
+	return -1, "", fmt.Errorf("checkpoint not found")


The parent function already handles this exact intent of running the code without checkpoints as a fallback. This is just duplicated.

jsun-m · 2025-08-27T19:42:53Z

pkg/worker/criu.go

 			log.Error().Str("container_id", request.ContainerId).Msgf("failed to update checkpoint state: %v", updateStateErr)
 		}
+
+		err = exposeNetwork()


When we are checkpointing, we do not want requests to taint the checkpoint. It should only checkpoint the ready state of the pod and not any running requests. In the case of an API server, the running request would be a connection.

jsun-m · 2025-08-27T19:44:28Z

pkg/worker/lifecycle.go

-			return
-		}
-	}
+	// for idx, bindPort := range opts.BindPorts {


@luke-lombardi Would it be problem if we delay the exposing of ports right before the container starts instead?

jsun-m · 2025-08-27T19:47:13Z

sdk/src/beta9/abstractions/pod.py


+        if self.checkpoint_enabled and self.checkpoint_condition is not None:
+            self.entrypoint = [
+                "(python -m beta9.runner.checkpoint &) &&",


With regards to the current snippet of entrypoint code we have, this is the best place to put the script.

we can also move this to the backend and pass USER_CODE_DIR

["sh", "-c", f"cd {USER_CODE_DIR} && {' '.join(self.entrypoint)}"]

jsun-m · 2025-08-27T19:47:48Z

sdk/src/beta9/cli/deployment.py

-
-        if not module and hasattr(user_obj, "cleanup_deployment_artifacts"):
-            user_obj.cleanup_deployment_artifacts()
+        finally:


GRPC errors left pod artifacts in the repo and they were not properly cleaned up.

jsun-m · 2025-08-27T19:55:45Z

pkg/worker/criu.go

-			updateStateErr := s.updateCheckpointState(request, types.CheckpointStatusRestoreFailed)
-			if updateStateErr != nil {
-				log.Error().Str("container_id", request.ContainerId).Msgf("failed to update checkpoint state: %v", updateStateErr)
+			var e *runc.ExitError


[TODO]: This logic is not complete. @luke-lombardi I want to chat with you about stop and error conditions related to a running RestoreCheckpoint process.

A SIGKILL from container stop event from user would put the checkpoint into a restore_failed state which is not what we want but this SIGKILL can also occur for other factors that are related to a failed restore. We need to hash about better indicators of failure in a restore checkpoint process.

cubic-dev-ai

15 issues found across 16 files

_{React with 👍 or 👎 to teach cubic. You can also tag @cubic-dev-ai to give feedback, ask questions, or re-run the review.}

cubic-dev-ai · 2025-08-29T15:51:21Z

sdk/src/beta9/cli/deployment.py

-            user_obj.cleanup_deployment_artifacts()
+        finally:
+            if not module and hasattr(user_obj, "cleanup_deployment_artifacts"):
+                user_obj.cleanup_deployment_artifacts()


Exceptions from cleanup_deployment_artifacts inside finally can mask the original failure or exit; wrap cleanup in its own try/except to avoid overshadowing errors.

Prompt for AI agents

Address the following comment on sdk/src/beta9/cli/deployment.py at line 197: <comment>Exceptions from cleanup_deployment_artifacts inside finally can mask the original failure or exit; wrap cleanup in its own try/except to avoid overshadowing errors.</comment> <file context> @@ -180,17 +180,21 @@ def create_deployment( - user_obj.cleanup_deployment_artifacts() + finally: + if not module and hasattr(user_obj, "cleanup_deployment_artifacts"): + user_obj.cleanup_deployment_artifacts() if capture_logs.capture_logs: </file context>

cubic-dev-ai · 2025-08-29T15:51:21Z

sdk/src/beta9/abstractions/pod.py


+        if self.checkpoint_enabled and self.checkpoint_condition is not None:
+            self.entrypoint = [
+                "(python -m beta9.runner.checkpoint &) &&",


Shell snippet is injected into entrypoint without a shell wrapper for custom images, causing exec failure and overriding the image's default ENTRYPOINT.

Prompt for AI agents

Address the following comment on sdk/src/beta9/abstractions/pod.py at line 310: <comment>Shell snippet is injected into entrypoint without a shell wrapper for custom images, causing exec failure and overriding the image's default ENTRYPOINT.</comment> <file context> @@ -281,6 +305,12 @@ def deploy( + if self.checkpoint_enabled and self.checkpoint_condition is not None: + self.entrypoint = [ + "(python -m beta9.runner.checkpoint &) &&", + *self.entrypoint, + ] </file context>

pkg/gateway/services/stub.go

cubic-dev-ai · 2025-08-29T15:51:21Z

pkg/abstractions/pod/pod.go

 		fmt.Sprintf("STUB_ID=%s", stub.ExternalId),
 		fmt.Sprintf("STUB_TYPE=%s", stub.Type),
 		fmt.Sprintf("KEEP_WARM_SECONDS=%d", stubConfig.KeepWarmSeconds),
+		fmt.Sprintf("CHECKPOINT_ENABLED=%t", stubConfig.CheckpointEnabled),


CHECKPOINT_ENABLED env var reflects stubConfig instead of the computed effective value, causing mismatches when checkpointing is disabled (e.g., multi-GPU).

Prompt for AI agents

Address the following comment on pkg/abstractions/pod/pod.go at line 285: <comment>CHECKPOINT_ENABLED env var reflects stubConfig instead of the computed effective value, causing mismatches when checkpointing is disabled (e.g., multi-GPU).</comment> <file context> @@ -282,6 +282,8 @@ func (s *GenericPodService) run(ctx context.Context, authInfo *auth.AuthInfo, st fmt.Sprintf("STUB_ID=%s", stub.ExternalId), fmt.Sprintf("STUB_TYPE=%s", stub.Type), fmt.Sprintf("KEEP_WARM_SECONDS=%d", stubConfig.KeepWarmSeconds), + fmt.Sprintf("CHECKPOINT_ENABLED=%t", stubConfig.CheckpointEnabled), + fmt.Sprintf("CHECKPOINT_CONDITION=%s", stubConfig.CheckpointCondition), }...) </file context>

cubic-dev-ai · 2025-08-29T15:51:21Z

sdk/src/beta9/runner/checkpoint.py

+        workers_ready.value += 1
+
+    if workers_ready.value == config.workers:
+        Path(CHECKPOINT_SIGNAL_FILE).touch(exist_ok=True)


Creating the checkpoint signal without waiting for all workers can race with worker coordination; delegate to wait_for_checkpoint() to ensure readiness across workers.

Prompt for AI agents

Address the following comment on sdk/src/beta9/runner/checkpoint.py at line 30: <comment>Creating the checkpoint signal without waiting for all workers can race with worker coordination; delegate to wait_for_checkpoint() to ensure readiness across workers.</comment> <file context> @@ -0,0 +1,71 @@ + workers_ready.value += 1 + + if workers_ready.value == config.workers: + Path(CHECKPOINT_SIGNAL_FILE).touch(exist_ok=True) + return _reload_config() + </file context>

cubic-dev-ai · 2025-08-29T15:51:22Z

pkg/worker/criu.go

-	})
-
-	return exitCode, request.ContainerId, err
+	return -1, "", fmt.Errorf("checkpoint not found")


Missing fallback to normal run when checkpoint is not found; this now returns an error instead of starting the container normally.

(Based on the PR description that the worker falls back to a normal run if no checkpoint is found.)

Prompt for AI agents

Address the following comment on pkg/worker/criu.go at line 150: <comment>Missing fallback to normal run when checkpoint is not found; this now returns an error instead of starting the container normally. (Based on the PR description that the worker falls back to a normal run if no checkpoint is found.)</comment> <file context> @@ -123,31 +129,30 @@ func (s *Worker) attemptCheckpointOrRestore(ctx context.Context, request *types. - }) - - return exitCode, request.ContainerId, err + return -1, "", fmt.Errorf("checkpoint not found") } </file context>

cubic-dev-ai · 2025-08-29T15:51:22Z

pkg/worker/criu.go

-			if updateStateErr != nil {
-				log.Error().Str("container_id", request.ContainerId).Msgf("failed to update checkpoint state: %v", updateStateErr)
+			var e *runc.ExitError
+			if errors.As(err, &e) {


Restore failures that are not runc.ExitError won't update checkpoint state or be logged, leaving stale/incorrect state.

Prompt for AI agents

Address the following comment on pkg/worker/criu.go at line 133: <comment>Restore failures that are not runc.ExitError won't update checkpoint state or be logged, leaving stale/incorrect state.</comment> <file context> @@ -123,31 +129,30 @@ func (s *Worker) attemptCheckpointOrRestore(ctx context.Context, request *types. - if updateStateErr != nil { - log.Error().Str("container_id", request.ContainerId).Msgf("failed to update checkpoint state: %v", updateStateErr) + var e *runc.ExitError + if errors.As(err, &e) { + code := e.Status + </file context>

cubic-dev-ai · 2025-08-29T15:51:22Z

pkg/worker/criu.go

 		}
 		defer f.Close()

+		err = exposeNetwork()


Network is exposed before a successful restore, potentially allowing premature connections; expose after a successful restore instead.

(Based on the PR description stating the worker only exposes ports after checkpoint or restore.)

Prompt for AI agents

Address the following comment on pkg/worker/criu.go at line 117: <comment>Network is exposed before a successful restore, potentially allowing premature connections; expose after a successful restore instead. (Based on the PR description stating the worker only exposes ports after checkpoint or restore.)</comment> <file context> @@ -113,6 +114,11 @@ func (s *Worker) attemptCheckpointOrRestore(ctx context.Context, request *types. } defer f.Close() + err = exposeNetwork() + if err != nil { + return -1, "", fmt.Errorf("failed to expose network: %v", err) </file context>

cubic-dev-ai · 2025-08-29T15:51:22Z

pkg/abstractions/pod/instance.go

 		fmt.Sprintf("STUB_ID=%s", i.Stub.ExternalId),
 		fmt.Sprintf("STUB_TYPE=%s", i.Stub.Type),
 		fmt.Sprintf("KEEP_WARM_SECONDS=%d", i.StubConfig.KeepWarmSeconds),
+		fmt.Sprintf("CHECKPOINT_ENABLED=%t", i.StubConfig.CheckpointEnabled),


Adding CHECKPOINT_ENABLED to request.Env duplicates the worker-controlled flag and can override it due to merge order; it also ignores the computed effective value (GPU>1), risking inconsistent checkpoint behavior.

Prompt for AI agents

Address the following comment on pkg/abstractions/pod/instance.go at line 36: <comment>Adding CHECKPOINT_ENABLED to request.Env duplicates the worker-controlled flag and can override it due to merge order; it also ignores the computed effective value (GPU>1), risking inconsistent checkpoint behavior.</comment> <file context> @@ -33,6 +33,8 @@ func (i *podInstance) startContainers(containersToRun int) error { fmt.Sprintf("STUB_ID=%s", i.Stub.ExternalId), fmt.Sprintf("STUB_TYPE=%s", i.Stub.Type), fmt.Sprintf("KEEP_WARM_SECONDS=%d", i.StubConfig.KeepWarmSeconds), + fmt.Sprintf("CHECKPOINT_ENABLED=%t", i.StubConfig.CheckpointEnabled), + fmt.Sprintf("CHECKPOINT_CONDITION=%s", i.StubConfig.CheckpointCondition), }...) </file context>

cubic-dev-ai · 2025-08-29T15:51:22Z

sdk/src/beta9/runner/checkpoint.py

+        return
+
+    module, func = config.checkpoint_condition.split(":")
+    target_module = importlib.import_module(module)


Rule violated: Prevent Redundant Code Duplication

Repeated dynamic import/lookup logic; extract a shared utility to load a callable from "module:function" to prevent code duplication across modules.

Prompt for AI agents

Address the following comment on sdk/src/beta9/runner/checkpoint.py at line 48: <comment>Repeated dynamic import/lookup logic; extract a shared utility to load a callable from "module:function" to prevent code duplication across modules.</comment> <file context> @@ -0,0 +1,71 @@ + return + + module, func = config.checkpoint_condition.split(":") + target_module = importlib.import_module(module) + method = getattr(target_module, func) + </file context>

Add: Checkpoint/restore for pods

14947d3

jsun-m commented Aug 27, 2025

View reviewed changes

Cleanup

10644d9

jsun-m commented Aug 27, 2025

View reviewed changes

jsun-m requested a review from luke-lombardi August 27, 2025 19:55

update checkpoint condition logs

37b3b00

jsun-m marked this pull request as ready for review August 29, 2025 15:40

cubic-dev-ai bot reviewed Aug 29, 2025

View reviewed changes

Add: Checkpoint/restore for pods #1399

Are you sure you want to change the base?

Add: Checkpoint/restore for pods #1399

Uh oh!

Conversation

jsun-m commented Aug 27, 2025 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by cubic

Uh oh!

jsun-m Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

jsun-m Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

jsun-m Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

jsun-m Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

jsun-m Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

jsun-m Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai bot Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jsun-m commented Aug 27, 2025 •

edited by cubic-dev-ai bot

Loading