Skip to content

Propagate CA bundle to sandbox trust store on init#2325

Open
sitole wants to merge 14 commits intomainfrom
feat/envd-ca-certificates
Open

Propagate CA bundle to sandbox trust store on init#2325
sitole wants to merge 14 commits intomainfrom
feat/envd-ca-certificates

Conversation

@sitole
Copy link
Copy Markdown
Member

@sitole sitole commented Apr 8, 2026

Problem

Sandboxes have no mechanism to receive a custom CA certificate and
install it into the system trust store at startup. This is required
for scenarios where outbound traffic needs to trust a certificate that
differs per sandbox.

Solution

The orchestrator sends a CA bundle (combined PEM string) to envd via
the existing /init call on sandbox create and resume. The orchestrator
plumbing is a no-op in the open-source version; the real certificate
injection is handled in the EE version.

On the envd side, CACertInstaller appends the received cert to
/etc/ssl/certs/ca-certificates.crt on the critical path, then
removes the previous cert in a background goroutine. Resume with the
same cert is a no-op (in-memory comparison). A state file at
/var/run/e2b/ca-cert.pem ensures the previous cert is correctly
removed even after a process restart.

At startup, BindMountCABundle copies the system bundle to tmpfs and
bind-mounts it back over the original path. All processes continue
using the standard path transparently; writes bypass the NBD-backed
filesystem entirely. CA cert injection is now essentially instant.

Testing

Unit tests in packages/envd/internal/host/cacerts_test.go cover
first install, resume hot path, cert rotation, concurrent resume,
and restart scenarios (same cert and different cert after OOM).

Measured append_duration with the bind-mount in place: ~0.04 ms
(down from ~1.8 ms warm NBD / ~460 ms cold GCS).

sitole added 3 commits April 7, 2026 14:19
EgressProxy gains a CACertificates() method so the orchestrator can
retrieve the per-proxy CA bundle and forward it to the sandbox on
startup and resume. NoopEgressProxy and tcpfirewall.Proxy return nil.

Factory now takes an EgressProxy and stores it; CreateSandbox and
ResumeSandbox copy CACertificates() into the Sandbox struct. The init
body sent to envd gains a CaCert field (combined PEM string built by
combineCACertificates) so envd can install the cert on each /init call.
Each sandbox may receive a CA certificate via the /init caBundle field.
CACertInstaller (packages/envd/internal/host) appends the certificate
to /etc/ssl/certs/ca-certificates.crt on the critical path (~2 ms),
then removes the previous cert in a background goroutine to keep the
bundle clean across rotations.

Two layers of state track what is currently installed:
- lastCACert (memory): makes resume a zero-I/O hit when the same cert
  arrives again after pause/resume.
- /var/run/e2b/ca-cert.pem (disk): survives OOM kills and restarts so
  the background goroutine can identify and remove the previous cert
  even when lastCACert is empty after a process restart.
@cursor
Copy link
Copy Markdown

cursor bot commented Apr 8, 2026

PR Summary

Medium Risk
Touches VM trust store handling and modifies the sandbox boot/init path; failures could break outbound TLS or sandbox startup. Concurrency/atomic file update logic and background cleanup introduce some operational edge cases (e.g., duplicates after process restart).

Overview
Propagates a PEM CA bundle from orchestrator to sandboxes via the existing POST /init call and installs it into the guest trust store. This extends the envd OpenAPI schema with caBundle, wires the value through sandbox creation/resume (via a new EgressProxy.CABundle() hook), and adds an envd-side CACertInstaller that appends the cert to /etc/ssl/certs/ca-certificates.crt and asynchronously rotates out the previously installed cert while persisting the active cert for update-ca-certificates. The template boot flow now bind-mounts /etc/ssl/certs onto tmpfs for fast/atomic updates, and adds unit/integration tests covering injection, rotation, restart, and concurrency scenarios.

Reviewed by Cursor Bugbot for commit c80e77c. Bugbot is set up for automated code reviews on this repo. Configure here.

@sitole sitole marked this pull request as draft April 8, 2026 10:21
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3ebbe78bc1

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test review body - please ignore

Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Additional findings (outside current diff — PR may have been updated during review):

  • 🔴 tests/integration/internal/envd/generated.go:190-196 — The integration test generated client sends the wrong JSON field for CA certificates: it defines CaCertificates *[]CACertificate (json:"caCertificates") — an array of {cert, name} objects — while the actual envd API added CaBundle *string (json:"caBundle"). Any integration test that populates CaCertificates will have those certs silently ignored by envd, which reads caBundle and finds it absent. The fix is to regenerate tests/integration/internal/envd/generated.go from the current packages/envd/spec/envd.yaml.

    Extended reasoning...

    What the bug is and how it manifests

    This PR adds CA certificate propagation to the envd /init endpoint. The authoritative API schema (packages/envd/spec/envd.yaml) and the generated server code (packages/envd/internal/api/api.gen.go) correctly define a single string field: CaBundle *string with JSON tag "caBundle,omitempty". The orchestrator's own generated client (packages/orchestrator/pkg/sandbox/envd/envd.gen.go) also correctly reflects CaBundle string with JSON tag "caBundle,omitempty".

    However, the integration test generated client (tests/integration/internal/envd/generated.go) defines a completely different schema: a new struct CACertificate with Cert string and Name string fields, and attaches them as CaCertificates *[]CACertificate (JSON tag "caCertificates,omitempty") to PostInitJSONBody. This is a stale artifact from an earlier design iteration where certificates were individual named objects rather than a single concatenated PEM bundle string.

    The specific code path that triggers it

    An integration test that wants to install a CA cert would set body.CaCertificates = &[]CACertificate{{Cert: pemData, Name: "my-ca"}} and call PostInit. The JSON marshalled over the wire would contain "caCertificates":[...] — a field envd never reads. The field envd actually reads ("caBundle") would be absent/null, so envd's if data.CaBundle != nil && *data.CaBundle != "" guard in SetData() would skip cert installation entirely.

    Why existing code does not prevent it

    No current integration tests exercise the CA cert path (confirmed by grepping test files — neither CaCertificates nor CaBundle is set in any test body). The Go type system cannot catch it because both structs compile independently; the discrepancy only surfaces at HTTP serialisation time when the wrong field name hits the wire.

    Impact

    Any future integration test for the CA cert feature written against this client would silently pass (envd returns 204 regardless) while the cert is never actually installed. This makes the CA cert feature effectively untestable via the integration client, directly undermining the purpose of this PR.

    How to fix it

    Re-run oapi-codegen against packages/envd/spec/envd.yaml targeting tests/integration/internal/envd/generated.go. This will remove the CACertificate struct and replace CaCertificates *[]CACertificate with CaBundle *string (json:"caBundle,omitempty"), matching the actual API.

    Step-by-step proof

    1. PR adds caBundle: {type: string} to packages/envd/spec/envd.yaml.
    2. packages/envd/internal/api/api.gen.go (regenerated correctly): CaBundle *string \json:"caBundle,omitempty"``
    3. envd SetData() handler reads: if data.CaBundle != nil && *data.CaBundle != ""
    4. tests/integration/internal/envd/generated.go (NOT regenerated): CaCertificates *[]CACertificate \json:"caCertificates,omitempty"``
    5. An integration test sets CaCertificates -> JSON on wire: {"caCertificates":[{"cert":"...","name":"..."}]}
    6. envd deserialises the body: CaBundle is nil (field absent); caCertificates is unknown and silently discarded by Go's JSON unmarshaller.
    7. data.CaBundle != nil is false -> cert installation skipped -> test passes but cert was never installed.

sitole added 2 commits April 8, 2026 13:59
c.lastCACert was read outside mu before being compared against the
incoming cert, while the write happens under mu. This is a data race
per Go's memory model; the race detector flags it on concurrent calls.
Move the comparison inside the lock.

When two rotations happen in quick succession (install A then install B),
two background goroutines are queued. If the goroutine for A runs after
the goroutine for B, it reads the state file — which B has already
updated to reflect the new cert — and interprets B's cert as the
previous one to remove, corrupting the bundle. Fix by bailing out of
the goroutine early when lastCACert no longer matches the cert it was
spawned for, meaning a newer install has taken over.

Also regenerate tests/integration/internal/envd/generated.go to align
the integration client with the current envd spec (caBundle string
instead of the stale caCertificates array).
The CA bundle (/etc/ssl/certs/ca-certificates.crt) lives on the
NBD-backed root filesystem. Measured syscall breakdown on the append:

  open:  1.83 ms warm (overlayfs copy-up + NBD round-trip)
         394 ms cold (blocks not yet in orchestrator's local GCS cache)
  write: 0.52 ms (one NBD round-trip regardless of data size)

At startup, BindMountCABundle() copies the bundle to /run/e2b/ (tmpfs)
and bind-mounts it back over the original path. All processes continue
using /etc/ssl/certs/ca-certificates.crt unchanged; the underlying
file is now in RAM. Measured result after the change: 0.037 ms total
(both cold and warm, no GCS or NBD involved).

Also make removeCertFromBundle write atomically via a temp file and
rename so the bundle is never empty from a concurrent reader's
perspective during the rewrite.
os.CreateTemp in the same directory as the bind-mounted bundle path
creates the temp file on the NBD-backed parent filesystem, while the
rename target lives on the tmpfs backing the bind mount. os.Rename
across different devices fails with EXDEV.

Pass filepath.Dir(statePath) as the temp directory instead. The state
file is in E2BRunDir (the same tmpfs as the bind mount source), so the
temp file and the rename target are always on the same device.

Also fix the permissions: os.CreateTemp creates with 0600. After the
atomic rename replaces the bundle, non-root processes could no longer
read it. Call Chmod(0o644) on the temp file before writing.
On restart the bind mount already exists, so CaBundlePath and
caBundleTmpfsPath are the same inode. The previous code opened
CaBundlePath for reading, then opened caBundleTmpfsPath with O_TRUNC —
truncating the same file — then tried to copy from the now-empty source.
The EBUSY guard came after the damage, leaving an empty CA bundle and
breaking all TLS connections in the sandbox after a restart.

Replace os.Open + io.Copy with os.ReadFile to load the full content into
memory before any write. os.WriteFile then truncates and rewrites the
same file, but the content is already safe in the buffer.
@sitole sitole marked this pull request as ready for review April 8, 2026 13:31
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1dd8a54d61

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@sitole sitole requested a review from arkamar April 9, 2026 09:01
Copy link
Copy Markdown
Contributor

@arkamar arkamar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a brief discussion I believe we should not bind-mount /etc/ssl/certs/ca-certificates.cert file only but copy of whole /etc/ssl/certs directory instead. This should make original workflow safe and fast, plus common tools should work as well.

Cert injection and rotation should be covered with integration tests, they would catch problem with rename() on different partitions.

sitole added 4 commits April 14, 2026 09:52
Solves issue with running file rename during already resumed sandboxes
as source and target used different filesystem as we only mounted
`ca-certificates.crt` file. Now `/etc/ssl/certs` is mounted during envd
service startup.

This change allows Ubuntu `apt-get install --reinstall ca-certificates`
to succeed and allow sandbox users to add and update certs store during
sandbox template build / sandbox runtime.
Writing the cert only to the tmpfs bundle means a subsequent
apt-get install --reinstall ca-certificates or update-ca-certificates
overwrites the bundle without the injected cert.

The background goroutine now also writes the cert to
/usr/local/share/ca-certificates/e2b-ca.crt (NBD-backed). That
directory is read by update-ca-certificates as a source, so the
injected cert survives a bundle rebuild.

The state file is removed: on OOM restart lastCACert is empty, so
the goroutine skips cleanup of the previous cert. The stale entry
remains in the bundle until the next rotation or until
update-ca-certificates rebuilds it — both cases correctly include
only the current cert via caExtraPath.

Also fixes a bug where the error check around Install() was missing,
causing SetData to always return an error when a CA bundle is present.
Tests call /init directly with a synthetic self-signed CA cert to
exercise the injection path independently of the orchestrator, which
returns an empty bundle in the OSS implementation.

TestCACertBundleOnTmpfs — verifies /etc/ssl/certs is bind-mounted from
tmpfs as set up by the envd.service ExecStartPre script.

TestCACertInjection — injects a cert via caBundle and asserts it
appears in /etc/ssl/certs/ca-certificates.crt.

TestCACertRotationOnResume — calls /init twice with different certs,
simulating a sandbox resume with a new proxy cert. Polls until the
background goroutine removes the old cert and the new cert is present.

TestCACertPersistsThroughUpdateCACertificates — injects a cert, runs
update-ca-certificates inside the sandbox, and verifies the cert
survives the rebuild via /usr/local/share/ca-certificates/e2b-ca.crt.
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix prepared fixes for both issues found in the latest run.

  • ✅ Fixed: Backup creates nested directory when extending templates
    • The provisioning step now removes /etc/ssl/certs.orig before copying so backups are always refreshed as a flat, current directory.
  • ✅ Fixed: Test calls ignore install error return values
    • All c.install(...) test calls now assert require.NoError, including concurrent calls via an error channel checked after WaitGroup completion.

Create PR

Or push these changes by commenting:

@cursor push 922e73f5b4
Preview (922e73f5b4)
diff --git a/packages/envd/internal/host/cacerts_test.go b/packages/envd/internal/host/cacerts_test.go
--- a/packages/envd/internal/host/cacerts_test.go
+++ b/packages/envd/internal/host/cacerts_test.go
@@ -92,7 +92,7 @@
 	bundlePath, extraPath := testPaths(t)
 	c := newTestInstaller(t)
 
-	c.install(context.Background(), certA, bundlePath, extraPath)
+	require.NoError(t, c.install(context.Background(), certA, bundlePath, extraPath))
 	waitForFile(t, extraPath)
 
 	bundle, err := os.ReadFile(bundlePath)
@@ -109,8 +109,8 @@
 	bundlePath, extraPath := testPaths(t)
 	c := newTestInstaller(t)
 
-	c.install(context.Background(), certA, bundlePath, extraPath)
-	c.install(context.Background(), certA, bundlePath, extraPath) // resume — hot path hit
+	require.NoError(t, c.install(context.Background(), certA, bundlePath, extraPath))
+	require.NoError(t, c.install(context.Background(), certA, bundlePath, extraPath)) // resume — hot path hit
 
 	bundle, err := os.ReadFile(bundlePath)
 	require.NoError(t, err)
@@ -124,9 +124,9 @@
 	bundlePath, extraPath := testPaths(t)
 	c := newTestInstaller(t)
 
-	c.install(context.Background(), certA, bundlePath, extraPath)
+	require.NoError(t, c.install(context.Background(), certA, bundlePath, extraPath))
 
-	c.install(context.Background(), certB, bundlePath, extraPath)
+	require.NoError(t, c.install(context.Background(), certB, bundlePath, extraPath))
 
 	normalizedA := strings.TrimRight(certA, "\n") + "\n"
 	normalizedB := strings.TrimRight(certB, "\n") + "\n"
@@ -144,7 +144,7 @@
 	bundlePath, extraPath := testPaths(t)
 	c := newTestInstaller(t)
 
-	c.install(context.Background(), "", bundlePath, extraPath)
+	require.NoError(t, c.install(context.Background(), "", bundlePath, extraPath))
 
 	bundle, err := os.ReadFile(bundlePath)
 	require.NoError(t, err)
@@ -167,7 +167,7 @@
 	// State of the VM after a previous envd run.
 	require.NoError(t, os.WriteFile(bundlePath, []byte(baseBundle+normalizedA), 0o644))
 
-	c.install(context.Background(), certA, bundlePath, extraPath)
+	require.NoError(t, c.install(context.Background(), certA, bundlePath, extraPath))
 	waitForFile(t, extraPath)
 
 	bundle, err := os.ReadFile(bundlePath)
@@ -191,7 +191,7 @@
 	// State of the VM after a previous envd run that installed certA.
 	require.NoError(t, os.WriteFile(bundlePath, []byte(baseBundle+normalizedA), 0o644))
 
-	c.install(context.Background(), certB, bundlePath, extraPath)
+	require.NoError(t, c.install(context.Background(), certB, bundlePath, extraPath))
 	waitForFile(t, extraPath)
 
 	bundle, err := os.ReadFile(bundlePath)
@@ -208,17 +208,22 @@
 	bundlePath, extraPath := testPaths(t)
 	c := newTestInstaller(t)
 
-	c.install(context.Background(), certA, bundlePath, extraPath)
+	require.NoError(t, c.install(context.Background(), certA, bundlePath, extraPath))
 
 	var wg sync.WaitGroup
+	errCh := make(chan error, 10)
 
 	for range 10 {
 		wg.Go(func() {
-			c.install(context.Background(), certA, bundlePath, extraPath)
+			errCh <- c.install(context.Background(), certA, bundlePath, extraPath)
 		})
 	}
 
 	wg.Wait()
+	close(errCh)
+	for err := range errCh {
+		require.NoError(t, err)
+	}
 
 	// Acquire mu to drain any in-flight background goroutines.
 	c.mu.Lock()

diff --git a/packages/orchestrator/pkg/template/build/phases/base/provision.sh b/packages/orchestrator/pkg/template/build/phases/base/provision.sh
--- a/packages/orchestrator/pkg/template/build/phases/base/provision.sh
+++ b/packages/orchestrator/pkg/template/build/phases/base/provision.sh
@@ -88,6 +88,7 @@
 echo "Backing up CA certs dir for clean source on system boot"
 # Backing up CA certs dir so envd ExecStartPre can restore it from a clean source on each system boot.
 # This is usefull when extending existing template that already mounted certs directory as tmpfs.
+rm -rf /etc/ssl/certs.orig
 cp -a /etc/ssl/certs /etc/ssl/certs.orig
 
 echo "Linking systemd to init"

This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.

Reviewed by Cursor Bugbot for commit c80e77c. Configure here.

@sitole
Copy link
Copy Markdown
Member Author

sitole commented Apr 14, 2026

After a brief discussion I believe we should not bind-mount /etc/ssl/certs/ca-certificates.cert file only but copy of whole /etc/ssl/certs directory instead. This should make original workflow safe and fast, plus common tools should work as well.

Commit c0890e3 fixed this. /etc/ssl/certs directory is now tmpfs backed.
Additionally e2b ca bundle is stored on filesystem to survive calls regenerating ca bundle.

Cert injection and rotation should be covered with integration tests, they would catch problem with rename() on different partitions.

Integration tests added in c80e77c.

@sitole sitole requested a review from arkamar April 14, 2026 13:21
Copy link
Copy Markdown
Contributor

@arkamar arkamar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All previous points are resolved to my linking, but I noticed one more problem, which could be tricky, but partially solvable. See my comment.

Group=root
Environment=GOTRACEBACK=all
LimitCORE=infinity
ExecStartPre=/bin/sh -c 'mountpoint -q /etc/ssl/certs || (mkdir -p /run/e2b/certs && cp -a /etc/ssl/certs.orig/. /run/e2b/certs/ && mount --bind /run/e2b/certs /etc/ssl/certs)'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line caught my eye. When a derived template adds certs during a RUN step (via update-ca-certificates for example), those certs end up on the tmpfs but certs.orig stays clean. On the next fresh boot, cp -a /etc/ssl/certs.orig. ... overwrites them.

I am not sure how big issue this could be, but maybe we can call update-ca-certificates between cp and mount to pick up the drop-ins from /usr/local/share/ca-certificates/, which should be the standard way for enterprise environments, see https://ubuntu.com/server/docs/how-to/security/install-a-root-ca-certificate-in-the-trust-store/. It does not solve the non-standard edits, but we can point users to this documentation in the worst case if we would not be able to come up with better solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants