Skip to content

feat(security): schedule the behavioral detection scan cron#1444

Open
chong-techops wants to merge 2 commits into
stagingfrom
feature/TECH-25-schedule-security-behavioral-scan
Open

feat(security): schedule the behavioral detection scan cron#1444
chong-techops wants to merge 2 commits into
stagingfrom
feature/TECH-25-schedule-security-behavioral-scan

Conversation

@chong-techops
Copy link
Copy Markdown

What

Schedules the existing security-behavioral-scan endpoint (app/api/cron/security-behavioral-scan/route.ts) so it actually runs. Previously the endpoint existed but nothing invoked it.

Changes

  • Kubernetes CronJob added to prod and staging Helm values (deploy/keeperhub/{prod,staging}/values.yaml), every 5 minutes, mirroring the existing reaper job and reusing deploy/scripts/reaper.sh. Restores prod reaper's backoffLimit: 2 (it was untouched on staging).
  • Auth: the endpoint now validates via authenticateInternalService scoped to the scheduler service (X-Service-Key + SCHEDULER_SERVICE_API_KEY), reusing the existing scheduler SSM key instead of provisioning a new cron secret. Least-privilege: the mcp/events/hub keys that also satisfy internal-service auth are rejected.
  • Detection window: widened EXECUTION_LOOKBACK_MS to 10 minutes so consecutive 5-minute scans overlap and scheduler jitter cannot drop a new-account-first-workflow event. Matched to the 10-minute window of the downstream Loki alert so duplicate emissions dedupe to a single page.

Why

The behavioral detection layer surfaces signals (new account triggering a workflow minutes after signup) that the metrics pipeline can't compute because the substrate lives in Postgres. Without a scheduler the endpoint never ran, so the Loki alert had no data.

Test plan

  • Unit/integration: tests/integration/security-behavioral-scan.test.ts (6 cases) covers fail-closed auth, scheduler-only scoping (non-scheduler keys rejected), the 200 path, and the structured-log + Sentry emit per detected row.
  • Lint + full type-check green.
  • PR-environment validation: deploy this PR and confirm the CronJob is created, runs on schedule, and the endpoint returns 200 to the scheduler key. (Adding the job to the PR-env Helm template in a follow-up commit on this branch so the PR deployment exercises it.)

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

PR Environment Deployed

Your PR environment has been deployed!

Environment Details:

Components:

  • Keeperhub Application
  • PostgreSQL Database (isolated instance)
  • LocalStack (SQS emulation)
  • Redis (isolated instance)
  • Schedule Dispatcher (staging image)
  • Block Dispatcher (staging image)
  • Event Tracker (staging image)

The environment will be automatically cleaned up when this PR is closed or merged.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

PR Environment Deployed

Your PR environment has been deployed!

Environment Details:

Components:

  • Keeperhub Application
  • PostgreSQL Database (isolated instance)
  • LocalStack (SQS emulation)
  • Redis (isolated instance)
  • Schedule Dispatcher (staging image)
  • Block Dispatcher (staging image)
  • Event Tracker (staging image)

The environment will be automatically cleaned up when this PR is closed or merged.

@chong-techops
Copy link
Copy Markdown
Author

PR-environment validation

Validated the scheduled scan end-to-end in the deployed PR environment (pr-1444), after rebasing onto current staging.

CronJobs deployed (common chart):

keeperhub-pr-1444-common-reaper                     * * * * *
keeperhub-pr-1444-common-security-behavioral-scan   * * * * *

Forced run against the deployed image (kubectl create job --from=cronjob/...security-behavioral-scan):

Environment variables are ready
{"newAccountFirstWorkflowEvents":0,"durationMs":3}
{"http_code":200,"time_total":0.106875}

This exercises the full chain: CronJob -> reaper.sh -> curl with X-Service-Key=SCHEDULER_SERVICE_API_KEY -> /api/cron/security-behavioral-scan -> authenticateInternalService resolves caller scheduler (legacy-bearer) -> 200 + scan body. Steady-state auto-runs are Complete.

Note on early failed runs: the first few auto-runs (during initial pod startup) show Failed. reaper.sh uses curl -sS (no -f), so an HTTP 4xx/5xx still exits 0 (job Complete); a job only Fails on a transport/connection error. So those failures were app-unreachable-during-rollout, not auth/endpoint errors, and they self-healed once the pod was ready. In prod the 5-minute cadence plus the 10-minute EXECUTION_LOOKBACK_MS overlap re-covers any run skipped during a rollout window, so no detection event is dropped.

CI: build, typecheck, lint, test-unit, test-integration, test-unit-sandbox-remote, migrate-check all green. The earlier build/typecheck failures were from the branch being cut off a stale staging before authenticateInternalService became async; fixed by rebasing and porting the endpoint + test to the async, caller-scoped API.

Wire the security-behavioral-scan endpoint to a Kubernetes CronJob in
both prod and staging (every 5 minutes), mirroring the existing reaper
job and reusing deploy/scripts/reaper.sh.

Auth: the endpoint validates via authenticateInternalService (async)
scoped to caller "scheduler" -- the legacy-bearer path resolves the
CronJob's X-Service-Key (SCHEDULER_SERVICE_API_KEY) to that caller, and
least-privilege rejects the other internal callers.

Widen EXECUTION_LOOKBACK_MS to 10 minutes so consecutive 5-minute scans
overlap and scheduler jitter cannot drop a new-account-first-workflow
event; matched to the 10-minute window of the downstream Loki alert.

The scan body is wrapped so a failure of the scan itself (e.g. a DB
error) emits a self-guarded security.behavioral.scan_error signal and a
500 instead of silently dropping detection to zero behind a green
CronJob (reaper.sh reports a non-2xx as a successful job).
Add the security-behavioral-scan CronJob to the PR-environment Helm
template (every minute, faster than prod's 5-minute cadence) so a PR
deployment exercises the scheduled scan end-to-end.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 4, 2026

PR Environment Deployed

Your PR environment has been deployed!

Environment Details:

Components:

  • Keeperhub Application
  • PostgreSQL Database (isolated instance)
  • LocalStack (SQS emulation)
  • Redis (isolated instance)
  • Schedule Dispatcher (staging image)
  • Block Dispatcher (staging image)
  • Event Tracker (staging image)

The environment will be automatically cleaned up when this PR is closed or merged.

@chong-techops chong-techops requested review from a team, OleksandrUA and eskp and removed request for a team June 4, 2026 04:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant