Add baseline integrity (M2): provenance, audit log, verify command#89
Open
pengfei-threemoonslab wants to merge 1 commit into
Open
Add baseline integrity (M2): provenance, audit log, verify command#89pengfei-threemoonslab wants to merge 1 commit into
pengfei-threemoonslab wants to merge 1 commit into
Conversation
Bumps baseline schema 0.4 → 0.5 and closes the silent-baseline-edit
trust hole identified in the M2 plan. Casual / accidental edits to
`baseline.json` are now observably wrong in code review.
Schema
- Adds optional `BaselineProvenance` to each `BaselineFinding`
(scanner_version, run_id, recorded_at, reason, expires). Older
v0.2/v0.3/v0.4 baselines still load with `provenance=None`;
re-running `baseline save` upgrades them.
- Re-saves preserve existing provenance for fingerprints already in
the baseline. Only newly-added entries get a fresh stamp, so
reviewer-set `reason` / `expires` survive subsequent saves.
Audit log
- New `core/baseline_audit.py`. `baseline save` appends one JSON line
to `<baseline-dir>/baseline-audit.log` per save: timestamp, run_id,
scanner_version, hash_before, hash_after, added/removed fingerprint
deltas. Append-only on the writer side.
- Honest threat model: tamper-evident, not tamper-proof. An adversary
who rewrites both files atomically defeats `verify`; the goal is
making casual edits visible in `git log`.
Manifest
- `baseline.integrity_mode: off | warn | strict` (default warn).
`off` is the back-compat escape hatch; `strict` makes
SHIP-BASELINE-INTEGRITY-MISMATCH set `blocks_release=true`.
Checks
- Three new check IDs in `category: baseline`:
- SHIP-BASELINE-INTEGRITY-MISMATCH (critical) — hash mismatch,
missing audit log, entry references unknown run_id, or entry
pre-dates v0.5 and lacks provenance.
- SHIP-BASELINE-ENTRY-EXPIRED (high) — `provenance.expires` past.
- SHIP-BASELINE-ENTRY-STALE (low) — deprecated check ID or
resolved-not-pruned (scan-aware).
- Integrity findings bypass `checks.ignore` and
`checks.severity_overrides` — silencing tamper detection would
defeat the trust property the audit log defends.
CLI
- New `agents-shipgate baseline verify` command with `--strict`,
`--json`, `--audit-log` flags.
- New exit code 6 (`baseline_integrity_failure`) emitted only by
`baseline verify --strict` on hash mismatch. `scan` continues to
use 20 for gate failure regardless of integrity_mode.
Tests
- 28 new tests in `tests/test_baseline_integrity.py` covering audit-
log primitives, provenance behavior in `baseline_from_report`,
every `verify_baseline` issue kind, the scan-pipeline integration
in warn/strict/off modes, and the CLI `baseline verify` surface.
- 1119 total tests pass; ruff clean; schemas regenerated and
`generate_schemas.py --check` is clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
BaselineProvenance(scanner_version / run_id / recorded_at / reason / expires) on every entry; legacy 0.2/0.3/0.4 baselines still load.<baseline-dir>/baseline-audit.logrecording everybaseline savewith SHA-256 hashes before/after plus added/removed fingerprint deltas. The wire shape is documented as stable in STABILITY.md.manifest.baseline.integrity_mode: off | warn | strict(defaultwarn) and three new check IDs:SHIP-BASELINE-INTEGRITY-MISMATCH(critical),SHIP-BASELINE-ENTRY-EXPIRED(high),SHIP-BASELINE-ENTRY-STALE(low).agents-shipgate baseline verify [--strict] [--json] [--audit-log]and reserves new exit code 6 (baseline_integrity_failure) forverify --strict;scancontinues to use 20 for gate failure regardless of integrity_mode.Type
What this fixes (why)
Closes the silent-baseline-edit trust hole identified in the M2 plan. Today a developer can hand-edit
baseline.jsonto add a fingerprint and the scanner accepts it — arelease_decision.decision = blockedbecomesreview_requiredwithout an audit trail. For a tool that pitches itself as a release gate this is a credibility hole. M2 makes casual / accidental edits observably wrong in code review.Honest threat model (written into both
STABILITY.mdand the check rationale): the audit log is tamper-evident, not tamper-proof. A well-resourced adversary who atomically rewrites bothbaseline.jsonand the audit log defeatsverify. The goal is forgit log .agents-shipgate/baseline-audit.logto expose casual edits, not to defeat a determined attacker.Design notes for reviewers
checks.ignoreandchecks.severity_overrides. Silencing tamper detection would defeat the property the audit log defends. Severity-tuning is M1's job, not M2's. Seesrc/agents_shipgate/checks/baseline_integrity.py.reasonandexpiressurvive subsequentbaseline saveruns; only newly-added fingerprints get a fresh stamp. Testtest_baseline_from_report_preserves_prior_provenanceenforces this — losing it would break the audit-log linkage.integrity_mode: offis the v0.17 escape hatch. Defaultwarnso adopters don't break on upgrade; suggested v0.18 default isstrict.SHIP-BASELINE-ENTRY-STALEsplits cleanly:deprecated_check_idis detected from the baseline alone (fires under bothverifyandscan);resolved_not_prunedis scan-aware (only underscan --baseline X). Theverifycommand intentionally does not implement the scan-aware path.assign_finding_idsandannotate_remediationagain, so they participate fully in the report (fingerprints,agent_actionenum,autofix_safe) but skip dedupe (they were never duplicates) and skip suppressions (intentional).Verification
CI is authoritative for
python -m ruff check .,python -m compileall -q src tests, andpython -m pytest.Additional local checks run:
pytest: 1119 passed, 3 skipped (28 new tests intests/test_baseline_integrity.pycovering audit-log primitives,baseline_from_reportprovenance behavior across new / resave / legacy paths, everyverify_baselineissue kind, the scan-pipeline integration in warn/strict/off modes, and the CLIbaseline verifysurface).ruff check .: clean.python scripts/generate_schemas.py --check: clean (committeddocs/checks.json,docs/manifest-v0.1.json, and regeneratedllms-full.txt).agents-shipgate baseline verify --helpshows the new command; ran a full save/verify/tamper/verify cycle onsamples/support_refund_agent.Release-readiness notes
docs/checks.mdSTABILITY.mdWhat's NOT in this PR (intentional, queued for follow-up)
severity_overridesto tune integrity-finding severities — see "Integrity findings bypass …" above; M1 is the right venue.report.json; rendering tweaks deferred.🤖 Generated with Claude Code