feat(security): schedule the behavioral detection scan cron#1444
feat(security): schedule the behavioral detection scan cron#1444chong-techops wants to merge 2 commits into
Conversation
PR Environment DeployedYour PR environment has been deployed! Environment Details:
Components:
The environment will be automatically cleaned up when this PR is closed or merged. |
6f1fa02 to
eab2874
Compare
PR Environment DeployedYour PR environment has been deployed! Environment Details:
Components:
The environment will be automatically cleaned up when this PR is closed or merged. |
PR-environment validationValidated the scheduled scan end-to-end in the deployed PR environment ( CronJobs deployed ( Forced run against the deployed image ( This exercises the full chain: CronJob -> Note on early failed runs: the first few auto-runs (during initial pod startup) show CI: |
Wire the security-behavioral-scan endpoint to a Kubernetes CronJob in both prod and staging (every 5 minutes), mirroring the existing reaper job and reusing deploy/scripts/reaper.sh. Auth: the endpoint validates via authenticateInternalService (async) scoped to caller "scheduler" -- the legacy-bearer path resolves the CronJob's X-Service-Key (SCHEDULER_SERVICE_API_KEY) to that caller, and least-privilege rejects the other internal callers. Widen EXECUTION_LOOKBACK_MS to 10 minutes so consecutive 5-minute scans overlap and scheduler jitter cannot drop a new-account-first-workflow event; matched to the 10-minute window of the downstream Loki alert. The scan body is wrapped so a failure of the scan itself (e.g. a DB error) emits a self-guarded security.behavioral.scan_error signal and a 500 instead of silently dropping detection to zero behind a green CronJob (reaper.sh reports a non-2xx as a successful job).
Add the security-behavioral-scan CronJob to the PR-environment Helm template (every minute, faster than prod's 5-minute cadence) so a PR deployment exercises the scheduled scan end-to-end.
eab2874 to
409b485
Compare
PR Environment DeployedYour PR environment has been deployed! Environment Details:
Components:
The environment will be automatically cleaned up when this PR is closed or merged. |
What
Schedules the existing security-behavioral-scan endpoint (
app/api/cron/security-behavioral-scan/route.ts) so it actually runs. Previously the endpoint existed but nothing invoked it.Changes
deploy/keeperhub/{prod,staging}/values.yaml), every 5 minutes, mirroring the existingreaperjob and reusingdeploy/scripts/reaper.sh. Restores prodreaper'sbackoffLimit: 2(it was untouched on staging).authenticateInternalServicescoped to theschedulerservice (X-Service-Key+SCHEDULER_SERVICE_API_KEY), reusing the existing scheduler SSM key instead of provisioning a new cron secret. Least-privilege: the mcp/events/hub keys that also satisfy internal-service auth are rejected.EXECUTION_LOOKBACK_MSto 10 minutes so consecutive 5-minute scans overlap and scheduler jitter cannot drop a new-account-first-workflow event. Matched to the 10-minute window of the downstream Loki alert so duplicate emissions dedupe to a single page.Why
The behavioral detection layer surfaces signals (new account triggering a workflow minutes after signup) that the metrics pipeline can't compute because the substrate lives in Postgres. Without a scheduler the endpoint never ran, so the Loki alert had no data.
Test plan
tests/integration/security-behavioral-scan.test.ts(6 cases) covers fail-closed auth, scheduler-only scoping (non-scheduler keys rejected), the 200 path, and the structured-log + Sentry emit per detected row.