Propagate CA bundle to sandbox trust store on init#2325
Propagate CA bundle to sandbox trust store on init#2325
Conversation
EgressProxy gains a CACertificates() method so the orchestrator can retrieve the per-proxy CA bundle and forward it to the sandbox on startup and resume. NoopEgressProxy and tcpfirewall.Proxy return nil. Factory now takes an EgressProxy and stores it; CreateSandbox and ResumeSandbox copy CACertificates() into the Sandbox struct. The init body sent to envd gains a CaCert field (combined PEM string built by combineCACertificates) so envd can install the cert on each /init call.
Each sandbox may receive a CA certificate via the /init caBundle field. CACertInstaller (packages/envd/internal/host) appends the certificate to /etc/ssl/certs/ca-certificates.crt on the critical path (~2 ms), then removes the previous cert in a background goroutine to keep the bundle clean across rotations. Two layers of state track what is currently installed: - lastCACert (memory): makes resume a zero-I/O hit when the same cert arrives again after pause/resume. - /var/run/e2b/ca-cert.pem (disk): survives OOM kills and restarts so the background goroutine can identify and remove the previous cert even when lastCACert is empty after a process restart.
PR SummaryMedium Risk Overview Reviewed by Cursor Bugbot for commit c80e77c. Bugbot is set up for automated code reviews on this repo. Configure here. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3ebbe78bc1
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Additional findings (outside current diff — PR may have been updated during review):
-
🔴
tests/integration/internal/envd/generated.go:190-196— The integration test generated client sends the wrong JSON field for CA certificates: it definesCaCertificates *[]CACertificate(json:"caCertificates") — an array of {cert, name} objects — while the actual envd API addedCaBundle *string(json:"caBundle"). Any integration test that populatesCaCertificateswill have those certs silently ignored by envd, which readscaBundleand finds it absent. The fix is to regeneratetests/integration/internal/envd/generated.gofrom the currentpackages/envd/spec/envd.yaml.Extended reasoning...
What the bug is and how it manifests
This PR adds CA certificate propagation to the envd
/initendpoint. The authoritative API schema (packages/envd/spec/envd.yaml) and the generated server code (packages/envd/internal/api/api.gen.go) correctly define a single string field:CaBundle *stringwith JSON tag"caBundle,omitempty". The orchestrator's own generated client (packages/orchestrator/pkg/sandbox/envd/envd.gen.go) also correctly reflectsCaBundle stringwith JSON tag"caBundle,omitempty".However, the integration test generated client (
tests/integration/internal/envd/generated.go) defines a completely different schema: a new structCACertificatewithCert stringandName stringfields, and attaches them asCaCertificates *[]CACertificate(JSON tag"caCertificates,omitempty") toPostInitJSONBody. This is a stale artifact from an earlier design iteration where certificates were individual named objects rather than a single concatenated PEM bundle string.The specific code path that triggers it
An integration test that wants to install a CA cert would set
body.CaCertificates = &[]CACertificate{{Cert: pemData, Name: "my-ca"}}and callPostInit. The JSON marshalled over the wire would contain"caCertificates":[...]— a field envd never reads. The field envd actually reads ("caBundle") would be absent/null, so envd'sif data.CaBundle != nil && *data.CaBundle != ""guard inSetData()would skip cert installation entirely.Why existing code does not prevent it
No current integration tests exercise the CA cert path (confirmed by grepping test files — neither
CaCertificatesnorCaBundleis set in any test body). The Go type system cannot catch it because both structs compile independently; the discrepancy only surfaces at HTTP serialisation time when the wrong field name hits the wire.Impact
Any future integration test for the CA cert feature written against this client would silently pass (envd returns 204 regardless) while the cert is never actually installed. This makes the CA cert feature effectively untestable via the integration client, directly undermining the purpose of this PR.
How to fix it
Re-run
oapi-codegenagainstpackages/envd/spec/envd.yamltargetingtests/integration/internal/envd/generated.go. This will remove theCACertificatestruct and replaceCaCertificates *[]CACertificatewithCaBundle *string(json:"caBundle,omitempty"), matching the actual API.Step-by-step proof
- PR adds
caBundle: {type: string}topackages/envd/spec/envd.yaml. packages/envd/internal/api/api.gen.go(regenerated correctly):CaBundle *string \json:"caBundle,omitempty"``- envd
SetData()handler reads:if data.CaBundle != nil && *data.CaBundle != "" tests/integration/internal/envd/generated.go(NOT regenerated):CaCertificates *[]CACertificate \json:"caCertificates,omitempty"``- An integration test sets
CaCertificates-> JSON on wire:{"caCertificates":[{"cert":"...","name":"..."}]} - envd deserialises the body:
CaBundleis nil (field absent);caCertificatesis unknown and silently discarded by Go's JSON unmarshaller. data.CaBundle != nilis false -> cert installation skipped -> test passes but cert was never installed.
- PR adds
c.lastCACert was read outside mu before being compared against the incoming cert, while the write happens under mu. This is a data race per Go's memory model; the race detector flags it on concurrent calls. Move the comparison inside the lock. When two rotations happen in quick succession (install A then install B), two background goroutines are queued. If the goroutine for A runs after the goroutine for B, it reads the state file — which B has already updated to reflect the new cert — and interprets B's cert as the previous one to remove, corrupting the bundle. Fix by bailing out of the goroutine early when lastCACert no longer matches the cert it was spawned for, meaning a newer install has taken over. Also regenerate tests/integration/internal/envd/generated.go to align the integration client with the current envd spec (caBundle string instead of the stale caCertificates array).
The CA bundle (/etc/ssl/certs/ca-certificates.crt) lives on the
NBD-backed root filesystem. Measured syscall breakdown on the append:
open: 1.83 ms warm (overlayfs copy-up + NBD round-trip)
394 ms cold (blocks not yet in orchestrator's local GCS cache)
write: 0.52 ms (one NBD round-trip regardless of data size)
At startup, BindMountCABundle() copies the bundle to /run/e2b/ (tmpfs)
and bind-mounts it back over the original path. All processes continue
using /etc/ssl/certs/ca-certificates.crt unchanged; the underlying
file is now in RAM. Measured result after the change: 0.037 ms total
(both cold and warm, no GCS or NBD involved).
Also make removeCertFromBundle write atomically via a temp file and
rename so the bundle is never empty from a concurrent reader's
perspective during the rewrite.
os.CreateTemp in the same directory as the bind-mounted bundle path creates the temp file on the NBD-backed parent filesystem, while the rename target lives on the tmpfs backing the bind mount. os.Rename across different devices fails with EXDEV. Pass filepath.Dir(statePath) as the temp directory instead. The state file is in E2BRunDir (the same tmpfs as the bind mount source), so the temp file and the rename target are always on the same device. Also fix the permissions: os.CreateTemp creates with 0600. After the atomic rename replaces the bundle, non-root processes could no longer read it. Call Chmod(0o644) on the temp file before writing.
On restart the bind mount already exists, so CaBundlePath and caBundleTmpfsPath are the same inode. The previous code opened CaBundlePath for reading, then opened caBundleTmpfsPath with O_TRUNC — truncating the same file — then tried to copy from the now-empty source. The EBUSY guard came after the damage, leaving an empty CA bundle and breaking all TLS connections in the sandbox after a restart. Replace os.Open + io.Copy with os.ReadFile to load the full content into memory before any write. os.WriteFile then truncates and rewrites the same file, but the content is already safe in the buffer.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1dd8a54d61
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
arkamar
left a comment
There was a problem hiding this comment.
After a brief discussion I believe we should not bind-mount /etc/ssl/certs/ca-certificates.cert file only but copy of whole /etc/ssl/certs directory instead. This should make original workflow safe and fast, plus common tools should work as well.
Cert injection and rotation should be covered with integration tests, they would catch problem with rename() on different partitions.
Solves issue with running file rename during already resumed sandboxes as source and target used different filesystem as we only mounted `ca-certificates.crt` file. Now `/etc/ssl/certs` is mounted during envd service startup. This change allows Ubuntu `apt-get install --reinstall ca-certificates` to succeed and allow sandbox users to add and update certs store during sandbox template build / sandbox runtime.
Writing the cert only to the tmpfs bundle means a subsequent apt-get install --reinstall ca-certificates or update-ca-certificates overwrites the bundle without the injected cert. The background goroutine now also writes the cert to /usr/local/share/ca-certificates/e2b-ca.crt (NBD-backed). That directory is read by update-ca-certificates as a source, so the injected cert survives a bundle rebuild. The state file is removed: on OOM restart lastCACert is empty, so the goroutine skips cleanup of the previous cert. The stale entry remains in the bundle until the next rotation or until update-ca-certificates rebuilds it — both cases correctly include only the current cert via caExtraPath. Also fixes a bug where the error check around Install() was missing, causing SetData to always return an error when a CA bundle is present.
Tests call /init directly with a synthetic self-signed CA cert to exercise the injection path independently of the orchestrator, which returns an empty bundle in the OSS implementation. TestCACertBundleOnTmpfs — verifies /etc/ssl/certs is bind-mounted from tmpfs as set up by the envd.service ExecStartPre script. TestCACertInjection — injects a cert via caBundle and asserts it appears in /etc/ssl/certs/ca-certificates.crt. TestCACertRotationOnResume — calls /init twice with different certs, simulating a sandbox resume with a new proxy cert. Polls until the background goroutine removes the old cert and the new cert is present. TestCACertPersistsThroughUpdateCACertificates — injects a cert, runs update-ca-certificates inside the sandbox, and verifies the cert survives the rebuild via /usr/local/share/ca-certificates/e2b-ca.crt.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix prepared fixes for both issues found in the latest run.
- ✅ Fixed: Backup creates nested directory when extending templates
- The provisioning step now removes
/etc/ssl/certs.origbefore copying so backups are always refreshed as a flat, current directory.
- The provisioning step now removes
- ✅ Fixed: Test calls ignore install error return values
- All
c.install(...)test calls now assertrequire.NoError, including concurrent calls via an error channel checked afterWaitGroupcompletion.
- All
Or push these changes by commenting:
@cursor push 922e73f5b4
Preview (922e73f5b4)
diff --git a/packages/envd/internal/host/cacerts_test.go b/packages/envd/internal/host/cacerts_test.go
--- a/packages/envd/internal/host/cacerts_test.go
+++ b/packages/envd/internal/host/cacerts_test.go
@@ -92,7 +92,7 @@
bundlePath, extraPath := testPaths(t)
c := newTestInstaller(t)
- c.install(context.Background(), certA, bundlePath, extraPath)
+ require.NoError(t, c.install(context.Background(), certA, bundlePath, extraPath))
waitForFile(t, extraPath)
bundle, err := os.ReadFile(bundlePath)
@@ -109,8 +109,8 @@
bundlePath, extraPath := testPaths(t)
c := newTestInstaller(t)
- c.install(context.Background(), certA, bundlePath, extraPath)
- c.install(context.Background(), certA, bundlePath, extraPath) // resume — hot path hit
+ require.NoError(t, c.install(context.Background(), certA, bundlePath, extraPath))
+ require.NoError(t, c.install(context.Background(), certA, bundlePath, extraPath)) // resume — hot path hit
bundle, err := os.ReadFile(bundlePath)
require.NoError(t, err)
@@ -124,9 +124,9 @@
bundlePath, extraPath := testPaths(t)
c := newTestInstaller(t)
- c.install(context.Background(), certA, bundlePath, extraPath)
+ require.NoError(t, c.install(context.Background(), certA, bundlePath, extraPath))
- c.install(context.Background(), certB, bundlePath, extraPath)
+ require.NoError(t, c.install(context.Background(), certB, bundlePath, extraPath))
normalizedA := strings.TrimRight(certA, "\n") + "\n"
normalizedB := strings.TrimRight(certB, "\n") + "\n"
@@ -144,7 +144,7 @@
bundlePath, extraPath := testPaths(t)
c := newTestInstaller(t)
- c.install(context.Background(), "", bundlePath, extraPath)
+ require.NoError(t, c.install(context.Background(), "", bundlePath, extraPath))
bundle, err := os.ReadFile(bundlePath)
require.NoError(t, err)
@@ -167,7 +167,7 @@
// State of the VM after a previous envd run.
require.NoError(t, os.WriteFile(bundlePath, []byte(baseBundle+normalizedA), 0o644))
- c.install(context.Background(), certA, bundlePath, extraPath)
+ require.NoError(t, c.install(context.Background(), certA, bundlePath, extraPath))
waitForFile(t, extraPath)
bundle, err := os.ReadFile(bundlePath)
@@ -191,7 +191,7 @@
// State of the VM after a previous envd run that installed certA.
require.NoError(t, os.WriteFile(bundlePath, []byte(baseBundle+normalizedA), 0o644))
- c.install(context.Background(), certB, bundlePath, extraPath)
+ require.NoError(t, c.install(context.Background(), certB, bundlePath, extraPath))
waitForFile(t, extraPath)
bundle, err := os.ReadFile(bundlePath)
@@ -208,17 +208,22 @@
bundlePath, extraPath := testPaths(t)
c := newTestInstaller(t)
- c.install(context.Background(), certA, bundlePath, extraPath)
+ require.NoError(t, c.install(context.Background(), certA, bundlePath, extraPath))
var wg sync.WaitGroup
+ errCh := make(chan error, 10)
for range 10 {
wg.Go(func() {
- c.install(context.Background(), certA, bundlePath, extraPath)
+ errCh <- c.install(context.Background(), certA, bundlePath, extraPath)
})
}
wg.Wait()
+ close(errCh)
+ for err := range errCh {
+ require.NoError(t, err)
+ }
// Acquire mu to drain any in-flight background goroutines.
c.mu.Lock()
diff --git a/packages/orchestrator/pkg/template/build/phases/base/provision.sh b/packages/orchestrator/pkg/template/build/phases/base/provision.sh
--- a/packages/orchestrator/pkg/template/build/phases/base/provision.sh
+++ b/packages/orchestrator/pkg/template/build/phases/base/provision.sh
@@ -88,6 +88,7 @@
echo "Backing up CA certs dir for clean source on system boot"
# Backing up CA certs dir so envd ExecStartPre can restore it from a clean source on each system boot.
# This is usefull when extending existing template that already mounted certs directory as tmpfs.
+rm -rf /etc/ssl/certs.orig
cp -a /etc/ssl/certs /etc/ssl/certs.orig
echo "Linking systemd to init"This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.
Reviewed by Cursor Bugbot for commit c80e77c. Configure here.
Commit c0890e3 fixed this.
Integration tests added in c80e77c. |
arkamar
left a comment
There was a problem hiding this comment.
All previous points are resolved to my linking, but I noticed one more problem, which could be tricky, but partially solvable. See my comment.
| Group=root | ||
| Environment=GOTRACEBACK=all | ||
| LimitCORE=infinity | ||
| ExecStartPre=/bin/sh -c 'mountpoint -q /etc/ssl/certs || (mkdir -p /run/e2b/certs && cp -a /etc/ssl/certs.orig/. /run/e2b/certs/ && mount --bind /run/e2b/certs /etc/ssl/certs)' |
There was a problem hiding this comment.
This line caught my eye. When a derived template adds certs during a RUN step (via update-ca-certificates for example), those certs end up on the tmpfs but certs.orig stays clean. On the next fresh boot, cp -a /etc/ssl/certs.orig. ... overwrites them.
I am not sure how big issue this could be, but maybe we can call update-ca-certificates between cp and mount to pick up the drop-ins from /usr/local/share/ca-certificates/, which should be the standard way for enterprise environments, see https://ubuntu.com/server/docs/how-to/security/install-a-root-ca-certificate-in-the-trust-store/. It does not solve the non-standard edits, but we can point users to this documentation in the worst case if we would not be able to come up with better solution.


Problem
Sandboxes have no mechanism to receive a custom CA certificate and
install it into the system trust store at startup. This is required
for scenarios where outbound traffic needs to trust a certificate that
differs per sandbox.
Solution
The orchestrator sends a CA bundle (combined PEM string) to envd via
the existing
/initcall on sandbox create and resume. The orchestratorplumbing is a no-op in the open-source version; the real certificate
injection is handled in the EE version.
On the envd side,
CACertInstallerappends the received cert to/etc/ssl/certs/ca-certificates.crton the critical path, thenremoves the previous cert in a background goroutine. Resume with the
same cert is a no-op (in-memory comparison). A state file at
/var/run/e2b/ca-cert.pemensures the previous cert is correctlyremoved even after a process restart.
At startup,
BindMountCABundlecopies the system bundle to tmpfs andbind-mounts it back over the original path. All processes continue
using the standard path transparently; writes bypass the NBD-backed
filesystem entirely. CA cert injection is now essentially instant.
Testing
Unit tests in
packages/envd/internal/host/cacerts_test.gocoverfirst install, resume hot path, cert rotation, concurrent resume,
and restart scenarios (same cert and different cert after OOM).
Measured
append_durationwith the bind-mount in place: ~0.04 ms(down from ~1.8 ms warm NBD / ~460 ms cold GCS).