feat(audit): agentic cert-lifecycle audit trail — attribution + tamper-evidence#313
Merged
Conversation
…mit on the silent paths
When an MCP/AI agent renews or replaces certificates on a schedule, the audit
log could not say what changed, when, and on whose authority. Two gaps caused
this: the success paths that matter emitted no audit record at all, and the
records that did exist carried only a coarse `user` string.
Attribution (Phase 1 of l0 #408):
- audit entries gain an additive, structured `actor` ({kind, id, label,
token_prefix, agent_session}) and `trigger` ({cause, job_id}). log_operation
synthesises a {kind:'system'} actor when none is passed, so every existing
call site and old reader keeps working.
- new audit_context resolver maps the AUTHENTICATED identity into (actor,
trigger): a scoped key flagged is_agent -> kind='agent', a non-agent scoped
key / legacy bearer -> 'api_token', a session -> 'user'. The client-supplied
X-CertMate-Agent-Session / -Agent-Id headers are recorded as an informational
claim only and never promote a caller to 'agent'.
- auth threads the stable api_key_id + token_prefix + is_agent into
request.current_user; create_api_key accepts and persists is_agent (exposed
on POST /api/keys) so an operator can dedicate an agent-flagged key.
Emission on the previously-silent paths:
- cert_service issue_create / issue_renew / issue_reissue now emit an attributed
success or failure record at the choke point all of API, async executor, and
web routes funnel through (the async path captures the context synchronously).
- scheduled renewals (CertificateManager + ClientCertificateManager
check_renewals) emit per-domain records with actor.kind='scheduler' and
trigger.job_id — the headline unattended-agent case, previously invisible.
- the auto-renew toggle and manual deploy endpoints emit attributed records.
- the MCP server sends a per-process X-CertMate-Agent-Session (override via
CERTMATE_AGENT_SESSION) so an agent session's actions can be grouped.
16 new tests cover kind derivation, the header-is-a-claim rule, success/failure
emission across create/renew, and scheduler attribution.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The audit log was a plain append-only text file with no integrity: a single
line could be edited, deleted, or reordered and nothing would detect it (the
readers even skip malformed lines silently, so a deletion left no trace).
This adds a parallel, append-only certificate_audit.chain.jsonl that records
every audit entry inside a SHA-256 hash chain:
- each line is canonical JSON of {seq, entry, prev_hash, hash}, where hash
commits to seq + entry + prev_hash, and prev_hash links to the previous
line's hash. seq is a gap-free monotonic counter, so a missing seq proves a
deletion. canon uses sorted keys / no whitespace / UTF-8 with non-ASCII
preserved, so an IDN domain hashes identically on writer and verifier.
- the writer keeps single-writer next-seq/last-hash state under a lock (Flask
request threads and the APScheduler renewal thread share one AuditLogger),
fsyncs each line, and only advances state after a durable write — a chain
failure never breaks audit logging or the audited operation, and never
fabricates a phantom gap. State is recovered from the last complete line on
restart; a truncated trailing line (interrupted write) is tolerated.
- the chain is written under the persistent data/ tree (data/audit), not the
ephemeral logs/ tree, so it is the durable verifiable artifact. Disable with
CERTMATE_AUDIT_CHAIN=0.
A standalone verifier (python -m modules.core.audit_verify [chain.jsonl],
stdlib only, no CertMate import needed) recomputes the chain and reports PASS
or the exact seq and reason of the first break (modification / deletion /
reorder / truncation). Exit 0 intact, 1 broken, 2 missing/IO.
Honest threat model, stated in the module: the chain detects tampering by
anyone WITHOUT the writer's running state, but does not bind the operator, who
holds the file and can recompute the whole chain. Constraining the operator
needs external anchoring of signed checkpoints (Phase 3), deliberately not
implemented here. No new dependencies (json + hashlib).
18 new tests: canon determinism incl. non-ASCII, write+verify, modify/delete/
reorder detection, restart recovery, truncated-tail tolerance, kill switch,
separate chain dir, and the CLI exit codes.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…I toggle Make the Phase 1/2 audit trail usable without the shell: - GET /api/audit/verify (admin, read-only) runs the hash-chain verifier and returns its result, 200 when intact and 409 when broken, so an operator or a monitoring probe can confirm integrity (or alert) over the API. - Settings -> API Keys gains an "AI agent key" checkbox (and the list shows an "agent" badge). It sets is_agent on the key so the agent's actions are attributed as actor.kind='agent' — previously settable only via the API. 3 endpoint tests (intact -> 200, tamper -> 409, no audit -> 503). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…compliance - api.md: correct the audit-log section (each line is a logging-prefixed JSON message split on " - INFO - ", not pure JSON; local vs UTC time bases), and document the actor/trigger fields, the tamper-evident chain, GET /api/audit/verify, and the standalone verifier CLI. - mcp.md: how to give an agent a dedicated is_agent-flagged key so its actions are attributed as actor.kind='agent', plus the CERTMATE_AGENT_SESSION / CERTMATE_AGENT_ID env vars. - compliance.md (new): an honest operator-enablement mapping to NIS2 (strongest fit), EU AI Act Art. 50 (transparency spirit only), and ISO 42001 (records), with explicit non-claims and the threat-model limits (the local chain does not bind the operator; off-box anchoring is not implemented). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lock-on-verify, tail-truncation docs An adversarial review of the branch found one blocker and two refinements: - BLOCKER: verify_chain and _recover_chain_state assumed every parsed JSON line was an object. A valid-JSON-but-non-object line ([1,2,3], 42, null) raised AttributeError. Because _recover_chain_state runs in AuditLogger.__init__ (called unguarded by the factory) under an except that only caught OSError, a single malformed line in the chain file could abort app startup — taking the renewal scheduler down with it. Now: verify_chain reports a non-object line as malformed (never raises), and recovery skips non-object lines and catches any exception, disabling the chain rather than ever aborting __init__. - In-process AuditLogger.verify_chain() now takes the append lock so a verify racing an in-flight append cannot observe a half-written final line and report a spurious truncation (the standalone CLI verifier still runs lock-free by design). - Documented the inherent limitation that tail truncation (removing entries from the end) is not detected without an external head anchor (Phase 3), in api.md, compliance.md, and a pinning regression test. 7 new tests: 5 non-object-line cases, recovery surviving a non-object line, and the tail-truncation limitation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The 'AI agent key' badge introduced teal utility classes (bg-teal-100/text-teal-800 + dark variants) not previously present in the purged bundle. Rebuild static/css/tailwind.min.css so the frontend-css CI gate (npm run css:build + git diff --exit-code) stays green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Minor: new capability and API surface (not a bugfix). - Attribution (actor/trigger) on every certificate-lifecycle audit entry, and emission on the previously-silent success and scheduled-renewal paths. - Tamper-evident SHA-256 hash chain + standalone verifier. - New GET /api/audit/verify (admin) and an "AI agent key" (is_agent) toggle. - Docs: audit/attribution, the verifier, and an honest compliance mapping. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-text FP) CodeQL flagged `_scrub_log(domain)` in the _audit_emit failure-path debug log as "clear-text logging of sensitive information (password)". It is a false positive — the value is a certificate domain name, already scrubbed — caused by field-insensitive taint on the `prepared` dict (which holds both the domain and the attribution context incl. token_prefix). The domain added no diagnostic value to that line, so drop it: the operation type is enough, and the scanner stays green without a manual dismissal. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes the gap raised on the v2.15.0 LinkedIn thread: when an MCP/AI agent renews or replaces certificates on a schedule, "it ran" was not an audit trail. CertMate could not say what changed, when, and on whose authority in a form a third party could check.
This is the non-risky scope of that work — attribution + tamper-evidence + the operator surface to use them. It deliberately stops short of signing/anchoring (see Out of scope).
What the analysis found
The starting state was worse than assumed: the success paths emitted no audit record at all — successful create / renew / deploy / auto-renew, and every scheduled (unattended) renewal, were invisible; only denials were logged. The convenience helpers
log_certificate_created/renewed/...were dead code. And the log had no integrity (a line could be edited/deleted/reordered with no trace).What this adds
Attribution (
74149cb)actor {kind, id, label, token_prefix, agent_session}andtrigger {cause, job_id}. Old call sites and readers keep working (asystemactor is synthesised when none is passed).actor.kindis derived only from the authenticated identity: anis_agent-flagged scoped key →agent; a non-agent key / legacy bearer →api_token; a session →user; the scheduler →scheduler. The client-suppliedX-CertMate-Agent-Sessionheader is recorded as an informational claim and can never promote a caller toagent.cert_service.issue_create/renew/reissue(covering API sync, async executor, and web), the scheduled renewals (actor.kind='scheduler'+job_id), and the auto-renew / manual-deploy endpoints.CERTMATE_AGENT_SESSION).Tamper-evidence (
17c77b7)data/audit/certificate_audit.chain.jsonl:{seq, entry, prev_hash, hash}, gap-freeseq(a missing seq proves a deletion), byte-stable canon (IDN-safe). Single-writer under a lock,fsyncper line, state advanced only after a durable write, recovered on restart, truncated-tail tolerated. No new dependencies (json + hashlib).python -m modules.core.audit_verify, stdlib only, runs without CertMate) reports PASS or the exactseq+ reason of the first break.Operator surface (
e2bd4a1)GET /api/audit/verify(admin, read-only): runs the verifier, returns200intact /409broken.agentbadge in the list) that setsis_agent, so the MCP server can use a key whose actions are attributed asagent.Docs (
e1e2307)api.md: corrects the audit-log section (it wrongly claimed pure JSON, hiding the logging prefix and the local-vs-UTC time bases) and documents the new fields, the chain, the endpoint, and the verifier.mcp.md: how to give an agent a dedicatedis_agentkey.compliance.md(new): an honest operator-enablement mapping to NIS2 (strongest fit), EU AI Act Art. 50 (transparency spirit only), and ISO 42001 (records), with explicit non-claims.Hardening from adversarial review (
12da43e)AttributeErrorout ofAuditLogger.__init__and abort app startup (taking the scheduler down). Fixed so a corrupt line is reported, never crashes; recovery can never abort__init__. Plus lock-on-verify and the tail-truncation limitation documented + tested.Threat-model honesty (stated in code and docs)
The chain detects interior modification / deletion / reorder by anyone without the writer's running state. It does not detect tail truncation without an external head anchor, and it does not bind the operator (who holds the file and could rewrite the whole chain). Both require external anchoring of signed checkpoints — Phase 3, deliberately not in this PR.
Out of scope (by design)
Tests / safety
🤖 Generated with Claude Code