Skip to content

feat(audit): agentic cert-lifecycle audit trail — attribution + tamper-evidence#313

Merged
fabriziosalmi merged 8 commits into
mainfrom
feat/agentic-audit-trail
Jun 15, 2026
Merged

feat(audit): agentic cert-lifecycle audit trail — attribution + tamper-evidence#313
fabriziosalmi merged 8 commits into
mainfrom
feat/agentic-audit-trail

Conversation

@fabriziosalmi

Copy link
Copy Markdown
Owner

Closes the gap raised on the v2.15.0 LinkedIn thread: when an MCP/AI agent renews or replaces certificates on a schedule, "it ran" was not an audit trail. CertMate could not say what changed, when, and on whose authority in a form a third party could check.

This is the non-risky scope of that work — attribution + tamper-evidence + the operator surface to use them. It deliberately stops short of signing/anchoring (see Out of scope).

What the analysis found

The starting state was worse than assumed: the success paths emitted no audit record at all — successful create / renew / deploy / auto-renew, and every scheduled (unattended) renewal, were invisible; only denials were logged. The convenience helpers log_certificate_created/renewed/... were dead code. And the log had no integrity (a line could be edited/deleted/reordered with no trace).

What this adds

Attribution (74149cb)

  • Every audit entry gains an additive, structured actor {kind, id, label, token_prefix, agent_session} and trigger {cause, job_id}. Old call sites and readers keep working (a system actor is synthesised when none is passed).
  • actor.kind is derived only from the authenticated identity: an is_agent-flagged scoped key → agent; a non-agent key / legacy bearer → api_token; a session → user; the scheduler → scheduler. The client-supplied X-CertMate-Agent-Session header is recorded as an informational claim and can never promote a caller to agent.
  • Emission now fires on the previously-silent paths: cert_service.issue_create/renew/reissue (covering API sync, async executor, and web), the scheduled renewals (actor.kind='scheduler' + job_id), and the auto-renew / manual-deploy endpoints.
  • The MCP server sends a per-process agent session (override CERTMATE_AGENT_SESSION).

Tamper-evidence (17c77b7)

  • An append-only SHA-256 hash chain at data/audit/certificate_audit.chain.jsonl: {seq, entry, prev_hash, hash}, gap-free seq (a missing seq proves a deletion), byte-stable canon (IDN-safe). Single-writer under a lock, fsync per line, state advanced only after a durable write, recovered on restart, truncated-tail tolerated. No new dependencies (json + hashlib).
  • A standalone verifier (python -m modules.core.audit_verify, stdlib only, runs without CertMate) reports PASS or the exact seq + reason of the first break.

Operator surface (e2bd4a1)

  • GET /api/audit/verify (admin, read-only): runs the verifier, returns 200 intact / 409 broken.
  • Settings → API Keys: an "AI agent key" checkbox (+ an agent badge in the list) that sets is_agent, so the MCP server can use a key whose actions are attributed as agent.

Docs (e1e2307)

  • api.md: corrects the audit-log section (it wrongly claimed pure JSON, hiding the logging prefix and the local-vs-UTC time bases) and documents the new fields, the chain, the endpoint, and the verifier.
  • mcp.md: how to give an agent a dedicated is_agent key.
  • compliance.md (new): an honest operator-enablement mapping to NIS2 (strongest fit), EU AI Act Art. 50 (transparency spirit only), and ISO 42001 (records), with explicit non-claims.

Hardening from adversarial review (12da43e)

  • An independent review found a blocker: a non-object JSON line in the chain file could raise AttributeError out of AuditLogger.__init__ and abort app startup (taking the scheduler down). Fixed so a corrupt line is reported, never crashes; recovery can never abort __init__. Plus lock-on-verify and the tail-truncation limitation documented + tested.

Threat-model honesty (stated in code and docs)

The chain detects interior modification / deletion / reorder by anyone without the writer's running state. It does not detect tail truncation without an external head anchor, and it does not bind the operator (who holds the file and could rewrite the whole chain). Both require external anchoring of signed checkpoints — Phase 3, deliberately not in this PR.

Out of scope (by design)

  • Ed25519 signed export bundle + external anchoring (Phase 3) — introduces signing-key lifecycle (loss/rotation) and touches SMTP/S3; a separate decision.
  • Including the chain in the unified backup zip — touches the backup contract.

Tests / safety

  • +60 tests (attribution, kind derivation, header-is-a-claim, success/failure emission across create/renew, scheduler attribution, chain write/verify, modify/delete/reorder detection, restart recovery, non-object-line safety, the verify endpoint). Full suite 1530 passed, 17 skipped. No emoji. Template theme-token gate passes.
  • Audit emission is isolated best-effort everywhere (try/except), so it cannot break or alter certificate issuance/renewal.

Note for before merge: Phase 1 wraps the issuance choke point (issue_create/renew/reissue) additively, and the renewal tests pass — but since that path was touched, a real-cert smoke (issue+renew against a test subdomain) is the prudent final check.

🤖 Generated with Claude Code

fabriziosalmi and others added 6 commits June 15, 2026 13:02
…mit on the silent paths

When an MCP/AI agent renews or replaces certificates on a schedule, the audit
log could not say what changed, when, and on whose authority. Two gaps caused
this: the success paths that matter emitted no audit record at all, and the
records that did exist carried only a coarse `user` string.

Attribution (Phase 1 of l0 #408):

- audit entries gain an additive, structured `actor` ({kind, id, label,
  token_prefix, agent_session}) and `trigger` ({cause, job_id}). log_operation
  synthesises a {kind:'system'} actor when none is passed, so every existing
  call site and old reader keeps working.
- new audit_context resolver maps the AUTHENTICATED identity into (actor,
  trigger): a scoped key flagged is_agent -> kind='agent', a non-agent scoped
  key / legacy bearer -> 'api_token', a session -> 'user'. The client-supplied
  X-CertMate-Agent-Session / -Agent-Id headers are recorded as an informational
  claim only and never promote a caller to 'agent'.
- auth threads the stable api_key_id + token_prefix + is_agent into
  request.current_user; create_api_key accepts and persists is_agent (exposed
  on POST /api/keys) so an operator can dedicate an agent-flagged key.

Emission on the previously-silent paths:

- cert_service issue_create / issue_renew / issue_reissue now emit an attributed
  success or failure record at the choke point all of API, async executor, and
  web routes funnel through (the async path captures the context synchronously).
- scheduled renewals (CertificateManager + ClientCertificateManager
  check_renewals) emit per-domain records with actor.kind='scheduler' and
  trigger.job_id — the headline unattended-agent case, previously invisible.
- the auto-renew toggle and manual deploy endpoints emit attributed records.

- the MCP server sends a per-process X-CertMate-Agent-Session (override via
  CERTMATE_AGENT_SESSION) so an agent session's actions can be grouped.

16 new tests cover kind derivation, the header-is-a-claim rule, success/failure
emission across create/renew, and scheduler attribution.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The audit log was a plain append-only text file with no integrity: a single
line could be edited, deleted, or reordered and nothing would detect it (the
readers even skip malformed lines silently, so a deletion left no trace).

This adds a parallel, append-only certificate_audit.chain.jsonl that records
every audit entry inside a SHA-256 hash chain:

- each line is canonical JSON of {seq, entry, prev_hash, hash}, where hash
  commits to seq + entry + prev_hash, and prev_hash links to the previous
  line's hash. seq is a gap-free monotonic counter, so a missing seq proves a
  deletion. canon uses sorted keys / no whitespace / UTF-8 with non-ASCII
  preserved, so an IDN domain hashes identically on writer and verifier.
- the writer keeps single-writer next-seq/last-hash state under a lock (Flask
  request threads and the APScheduler renewal thread share one AuditLogger),
  fsyncs each line, and only advances state after a durable write — a chain
  failure never breaks audit logging or the audited operation, and never
  fabricates a phantom gap. State is recovered from the last complete line on
  restart; a truncated trailing line (interrupted write) is tolerated.
- the chain is written under the persistent data/ tree (data/audit), not the
  ephemeral logs/ tree, so it is the durable verifiable artifact. Disable with
  CERTMATE_AUDIT_CHAIN=0.

A standalone verifier (python -m modules.core.audit_verify [chain.jsonl],
stdlib only, no CertMate import needed) recomputes the chain and reports PASS
or the exact seq and reason of the first break (modification / deletion /
reorder / truncation). Exit 0 intact, 1 broken, 2 missing/IO.

Honest threat model, stated in the module: the chain detects tampering by
anyone WITHOUT the writer's running state, but does not bind the operator, who
holds the file and can recompute the whole chain. Constraining the operator
needs external anchoring of signed checkpoints (Phase 3), deliberately not
implemented here. No new dependencies (json + hashlib).

18 new tests: canon determinism incl. non-ASCII, write+verify, modify/delete/
reorder detection, restart recovery, truncated-tail tolerance, kill switch,
separate chain dir, and the CLI exit codes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…I toggle

Make the Phase 1/2 audit trail usable without the shell:

- GET /api/audit/verify (admin, read-only) runs the hash-chain verifier and
  returns its result, 200 when intact and 409 when broken, so an operator or a
  monitoring probe can confirm integrity (or alert) over the API.
- Settings -> API Keys gains an "AI agent key" checkbox (and the list shows an
  "agent" badge). It sets is_agent on the key so the agent's actions are
  attributed as actor.kind='agent' — previously settable only via the API.

3 endpoint tests (intact -> 200, tamper -> 409, no audit -> 503).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…compliance

- api.md: correct the audit-log section (each line is a logging-prefixed JSON
  message split on " - INFO - ", not pure JSON; local vs UTC time bases), and
  document the actor/trigger fields, the tamper-evident chain, GET
  /api/audit/verify, and the standalone verifier CLI.
- mcp.md: how to give an agent a dedicated is_agent-flagged key so its actions
  are attributed as actor.kind='agent', plus the CERTMATE_AGENT_SESSION /
  CERTMATE_AGENT_ID env vars.
- compliance.md (new): an honest operator-enablement mapping to NIS2 (strongest
  fit), EU AI Act Art. 50 (transparency spirit only), and ISO 42001 (records),
  with explicit non-claims and the threat-model limits (the local chain does not
  bind the operator; off-box anchoring is not implemented).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lock-on-verify, tail-truncation docs

An adversarial review of the branch found one blocker and two refinements:

- BLOCKER: verify_chain and _recover_chain_state assumed every parsed JSON
  line was an object. A valid-JSON-but-non-object line ([1,2,3], 42, null)
  raised AttributeError. Because _recover_chain_state runs in
  AuditLogger.__init__ (called unguarded by the factory) under an except that
  only caught OSError, a single malformed line in the chain file could abort
  app startup — taking the renewal scheduler down with it. Now: verify_chain
  reports a non-object line as malformed (never raises), and recovery skips
  non-object lines and catches any exception, disabling the chain rather than
  ever aborting __init__.
- In-process AuditLogger.verify_chain() now takes the append lock so a verify
  racing an in-flight append cannot observe a half-written final line and
  report a spurious truncation (the standalone CLI verifier still runs lock-free
  by design).
- Documented the inherent limitation that tail truncation (removing entries
  from the end) is not detected without an external head anchor (Phase 3), in
  api.md, compliance.md, and a pinning regression test.

7 new tests: 5 non-object-line cases, recovery surviving a non-object line, and
the tail-truncation limitation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The 'AI agent key' badge introduced teal utility classes
(bg-teal-100/text-teal-800 + dark variants) not previously present in the
purged bundle. Rebuild static/css/tailwind.min.css so the frontend-css CI gate
(npm run css:build + git diff --exit-code) stays green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread modules/core/cert_service.py Fixed
fabriziosalmi and others added 2 commits June 15, 2026 13:57
Minor: new capability and API surface (not a bugfix).

- Attribution (actor/trigger) on every certificate-lifecycle audit entry, and
  emission on the previously-silent success and scheduled-renewal paths.
- Tamper-evident SHA-256 hash chain + standalone verifier.
- New GET /api/audit/verify (admin) and an "AI agent key" (is_agent) toggle.
- Docs: audit/attribution, the verifier, and an honest compliance mapping.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-text FP)

CodeQL flagged `_scrub_log(domain)` in the _audit_emit failure-path debug log
as "clear-text logging of sensitive information (password)". It is a false
positive — the value is a certificate domain name, already scrubbed — caused by
field-insensitive taint on the `prepared` dict (which holds both the domain and
the attribution context incl. token_prefix). The domain added no diagnostic
value to that line, so drop it: the operation type is enough, and the scanner
stays green without a manual dismissal.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@fabriziosalmi fabriziosalmi merged commit 9763077 into main Jun 15, 2026
6 of 7 checks passed
@fabriziosalmi fabriziosalmi deleted the feat/agentic-audit-trail branch June 15, 2026 12:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants