pytorch
diff --git a/‎aws/lambda/pytorch-auto-revert/SIGNAL_ACTIONS.md‎
Lines changed: 85 additions & 0 deletions b/‎aws/lambda/pytorch-auto-revert/SIGNAL_ACTIONS.md‎
Lines changed: 85 additions & 0 deletions
diff --git a/‎aws/lambda/pytorch-auto-revert/pytorch_auto_revert/__main__.py‎
Lines changed: 41 additions & 7 deletions b/‎aws/lambda/pytorch-auto-revert/pytorch_auto_revert/__main__.py‎
Lines changed: 41 additions & 7 deletions
diff --git a/‎aws/lambda/pytorch-auto-revert/pytorch_auto_revert/clickhouse_client_helper.py‎
Lines changed: 15 additions & 0 deletions b/‎aws/lambda/pytorch-auto-revert/pytorch_auto_revert/clickhouse_client_helper.py‎
Lines changed: 15 additions & 0 deletions
diff --git a/‎aws/lambda/pytorch-auto-revert/pytorch_auto_revert/run_state_logger.py‎
Lines changed: 162 additions & 0 deletions b/‎aws/lambda/pytorch-auto-revert/pytorch_auto_revert/run_state_logger.py‎
Lines changed: 162 additions & 0 deletions
@@ -0,0 +1,85 @@
+# Signal Actions Layer
+
+This document specifies the Actions layer that consumes extracted Signals with their processing outcomes, decides what to do (restart/revert/none), executes allowed actions, and logs actions. Logging the full state of the run is handled by a separate run‑state logger module (see Interfaces).
+
+## Overview
+
+- Inputs (provided by integration code):
+  - Run parameters: `repo_full_name`, `workflows`, `lookback_hours`, `dry_run`.
+  - A list of pairs: `List[Tuple[Signal, SignalProcOutcome]]`, where `SignalProcOutcome = Union[AutorevertPattern, RestartCommits, Ineligible]`.
+- Decisions: per-signal outcome mapped to a concrete action:
+  - `AutorevertPattern` → record a global revert intent for the suspected commit
+  - `RestartCommits` → restart workflow(s) for specific `(workflow, commit)` pairs
+  - `Ineligible` → no action
+- Side effects: workflow restarts (non-dry-run only); append-only logging of actions in ClickHouse.
+- Idempotence and dedup: enforced via ClickHouse lookups before acting and logging.
+
+## Run Context
+
+Immutable run-scoped metadata shared by all actions in the same run:
+
+- `ts`: DateTime captured at run start; identical across all rows inserted for the run
+- `repo_full_name`: e.g., `pytorch/pytorch`
+- `workflows`: list of workflow display names
+- `lookback_hours`: window used for extraction
+- `dry_run`: bool
+
+## Action Semantics
+
+- `revert` (record-only):
+  - Scope: global per `commit_sha` across all workflows and signals
+  - Dedup: if a non-dry-run `revert` exists for the same `repo` and `commit_sha`, do not log another
+
+- `restart` (execute + log):
+  - Scope: per `(workflow, commit_sha)` pair
+  - Caps: up to 2 non-dry-run restarts total for the pair
+  - Pacing: skip if the most recent non-dry-run restart was within 15 minutes before `ts`
+  - No extra GitHub-side “already restarted” guard; rely on ClickHouse logs for dedup/caps
+
+- `none`:
+  - Not logged in `autorevert_events_v2` (only actions taken are logged)
+
+- Multiple signals targeting same workflow/commit are coalesced in-memory, then deduped again via ClickHouse checks.
+- Dry-run behavior:
+  - Simulate restarts (no dispatch), log actions with `dry_run=1`
+  - Dry-run rows do not count toward caps/pacing or revert dedup criteria
+
+## ClickHouse Logging
+
+Two tables, sharing the same `ts` per CLI/lambda run.
+
+### `autorevert_events_v2`
+
+- Purpose: record actions taken during a run & dedup/cap against prior actions
+- Notable columns:
+  - `ts` DateTime — run timestamp
+  - `workflows` Array(String) — workflows involved in this action
+    - restart: a single-element array with the target workflow
+    - revert: one or more workflows whose signals contributed
+  - `source_signal_keys` Array(String) — signal keys that contributed to this action
+  - `failed` UInt8 DEFAULT 0 — marks a failed attempt (e.g., restart dispatch failed)
+  - `notes` String DEFAULT '' — optional free-form metadata
+
+### `autorevert_state` (separate module)
+
+- Purpose: persist the HUD-like state for the whole run for auditability
+- Notable columns:
+  - `ts` DateTime — run timestamp (matches `autorevert_events_v2.ts`)
+  - `state` String — JSON-encoded model of the HUD grid and outcomes
+  - `params` String DEFAULT '' — optional, free-form
+
+## Processing Flow
+
+1. Create `RunContext` and capture `ts` at start (integration).
+2. Provide the Actions layer with: `(run params, List[Tuple[Signal, SignalProcOutcome]])`.
+3. Transform and group the list into coalesced action groups (reusable method):
+   - Revert groups: `(action=revert, commit_sha, sources: List[SignalMetadata(workflow, key)])`
+   - Restart groups: `(action=restart, commit_sha, workflow_target, sources: List[SignalMetadata(workflow, key)])`
+4. For each group, consult `autorevert_events_v2` (non-dry-run rows) to enforce dedup rules:
+   - Reverts: skip if any prior recorded `revert` exists for `commit_sha`
+   - Restarts: skip if ≥2 prior restarts exist for `(workflow_target, commit_sha)`; skip if the latest is within 15 minutes of `ts`
+5. Execute eligible actions:
+   - Restart: if not `dry_run`, dispatch and capture success/failure in `notes`
+   - Revert: record only
+6. Insert one `autorevert_events_v2` row per executed group with aggregated `workflows` and `source_signal_keys` (dry-run rows use `dry_run=1`).
+7. Separately (integration), build the full run state and call the run‑state logger to write a single `autorevert_state` row with the same `ts`.
@@ -10,6 +10,7 @@
 from .clickhouse_client_helper import CHCliFactory
 from .github_client_helper import GHClientFactory
 from .testers.autorevert import autorevert_checker
+from .testers.autorevert_v2 import autorevert_v2
 from .testers.hud import run_hud
 from .testers.restart_checker import workflow_restart_checker
 
@@ -71,9 +72,10 @@ def get_opts() -> argparse.Namespace:
     # no subcommand runs the lambda flow
     subparsers = parser.add_subparsers(dest="subcommand")
 
-    # autorevert-checker subcommand
+    # autorevert-checker subcommand (new default; legacy behind a flag)
     workflow_parser = subparsers.add_parser(
-        "autorevert-checker", help="Analyze workflows looking for autorevert patterns"
+        "autorevert-checker",
+        help="Analyze workflows for autorevert using Signals (default), or legacy via flag",
     )
     workflow_parser.add_argument(
         "workflows",
@@ -84,6 +86,11 @@ def get_opts() -> argparse.Namespace:
     workflow_parser.add_argument(
         "--hours", type=int, default=48, help="Lookback window in hours (default: 48)"
     )
+    workflow_parser.add_argument(
+        "--repo-full-name",
+        default=os.environ.get("REPO_FULL_NAME", "pytorch/pytorch"),
+        help="Full repo name to filter by (owner/repo).",
+    )
     workflow_parser.add_argument(
         "--verbose",
         "-v",
@@ -105,6 +112,11 @@ def get_opts() -> argparse.Namespace:
         action="store_true",
         help="Ignore common errors in autorevert patterns (e.g., 'No tests found')",
     )
+    workflow_parser.add_argument(
+        "--legacy-autorevert",
+        action="store_true",
+        help="Run the legacy autorevert behavior instead of the new Signals-based flow",
+    )
 
     # workflow-restart-checker subcommand
     workflow_restart_parser = subparsers.add_parser(
@@ -189,15 +201,37 @@ def main(*args, **kwargs) -> None:
         )
 
     if opts.subcommand is None:
-        autorevert_checker(
+        # New default without subcommand: run v2 using env defaults
+        autorevert_v2(
             os.environ.get("WORKFLOWS", "Lint,trunk,pull,inductor").split(","),
-            do_restart=True,
-            do_revert=True,
             hours=int(os.environ.get("HOURS", 16)),
-            verbose=True,
+            repo_full_name=os.environ.get("REPO_FULL_NAME", "pytorch/pytorch"),
             dry_run=opts.dry_run,
-            ignore_common_errors=True,
+            do_restart=True,
+            do_revert=True,
         )
+    elif opts.subcommand == "autorevert-checker":
+        if getattr(opts, "legacy_autorevert", False):
+            # Legacy behavior behind flag
+            autorevert_checker(
+                opts.workflows,
+                do_restart=opts.do_restart,
+                do_revert=opts.do_revert,
+                hours=opts.hours,
+                verbose=opts.verbose,
+                dry_run=opts.dry_run,
+                ignore_common_errors=opts.ignore_common_errors,
+            )
+        else:
+            # New default behavior under the same subcommand
+            autorevert_v2(
+                opts.workflows,
+                hours=opts.hours,
+                repo_full_name=opts.repo_full_name,
+                dry_run=opts.dry_run,
+                do_restart=opts.do_restart,
+                do_revert=opts.do_revert,
+            )
     elif opts.subcommand == "autorevert-checker":
         autorevert_checker(
             opts.workflows,
 
@@ -1,5 +1,6 @@
 import logging
 import threading
+from datetime import datetime, timezone
 
 import clickhouse_connect
 
@@ -85,3 +86,17 @@ def connection_test(self) -> bool:
         except Exception as e:
             self._logger.warning(f"Connection test failed: {e}")
             return False
+
+
+# ---- Utilities ----
+def ensure_utc_datetime(dt: datetime) -> datetime:
+    """Coerce a datetime to timezone-aware UTC.
+
+    - If naive, assume it represents UTC and set tzinfo accordingly
+    - If aware, convert to UTC
+    """
+    if not isinstance(dt, datetime):
+        return dt  # type: ignore[return-value]
+    if dt.tzinfo is None:
+        return dt.replace(tzinfo=timezone.utc)
+    return dt.astimezone(timezone.utc)
@@ -0,0 +1,162 @@
+from __future__ import annotations
+
+import json
+from typing import Dict, Iterable, List, Tuple, Union
+
+from .clickhouse_client_helper import CHCliFactory
+from .signal import AutorevertPattern, Ineligible, RestartCommits, Signal
+from .signal_extraction_types import RunContext
+
+
+SignalProcOutcome = Union[AutorevertPattern, RestartCommits, Ineligible]
+
+
+class RunStateLogger:
+    """Serialize the run’s HUD-like state and insert a single row into misc.autorevert_state.
+
+    The state JSON captures:
+    - commits (newest→older) and minimal started_at timestamps per commit
+    - per-signal columns with outcome, human notes, ineligible details, and per-commit events
+    - run metadata (repo, workflows, lookback_hours, ts, dry_run)
+    """
+
+    def _build_state_json(
+        self,
+        *,
+        repo: str,
+        ctx: RunContext,
+        pairs: Iterable[Tuple[Signal, SignalProcOutcome]],
+    ) -> str:
+        """Build a compact JSON string describing the run’s HUD-like grid and outcomes."""
+        pairs_list = list(pairs)
+        signals: List[Signal] = [s for s, _ in pairs_list]
+
+        # Collect commit order (newest → older) across signals
+        commits: List[str] = []
+        seen = set()
+        for s in signals:
+            for c in s.commits:
+                if c.head_sha not in seen:
+                    seen.add(c.head_sha)
+                    commits.append(c.head_sha)
+
+        # Compute minimal started_at per commit (for timestamp context)
+        commit_times: Dict[str, str] = {}
+        for sha in commits:
+            tmin_iso: str | None = None
+            for s in signals:
+                # find commit in this signal
+                sc = next((cc for cc in s.commits if cc.head_sha == sha), None)
+                if not sc or not sc.events:
+                    continue
+                # events are sorted oldest first
+                t = sc.events[0].started_at
+                ts_iso = t.isoformat()
+                if tmin_iso is None or ts_iso < tmin_iso:
+                    tmin_iso = ts_iso
+            if tmin_iso is not None:
+                commit_times[sha] = tmin_iso
+
+        # Build columns with outcomes, notes, and per-commit events
+        cols = []
+        for sig, outcome in pairs_list:
+            if isinstance(outcome, AutorevertPattern):
+                oc = "revert"
+                note = (
+                    f"Pattern: newer fail {len(outcome.newer_failing_commits)}; "
+                    f"suspect {outcome.suspected_commit[:7]} vs baseline {outcome.older_successful_commit[:7]}"
+                )
+                ineligible = None
+            elif isinstance(outcome, RestartCommits):
+                oc = "restart"
+                if outcome.commit_shas:
+                    short = ", ".join(sorted(s[:7] for s in outcome.commit_shas))
+                    note = f"Suggest restart: {short}"
+                else:
+                    note = "Suggest restart: <none>"
+                ineligible = None
+            else:
+                oc = "ineligible"
+                note = f"Ineligible: {outcome.reason.value}"
+                if outcome.message:
+                    note += f" — {outcome.message}"
+                ineligible = {
+                    "reason": outcome.reason.value,
+                    "message": outcome.message,
+                }
+
+            # Per-commit events for this signal
+            cells: Dict[str, List[Dict]] = {}
+            for c in sig.commits:
+                evs = []
+                for e in c.events:
+                    ev = {
+                        "status": e.status.value,
+                        "started_at": e.started_at.isoformat(),
+                        "name": e.name,
+                    }
+                    if e.ended_at:
+                        ev["ended_at"] = e.ended_at.isoformat()
+                    evs.append(ev)
+                if evs:
+                    cells[c.head_sha] = evs
+
+            col = {
+                "workflow": sig.workflow_name,
+                "key": sig.key,
+                "outcome": oc,
+                "note": note,
+                "cells": cells,
+            }
+            if ineligible is not None:
+                col["ineligible"] = ineligible
+            cols.append(col)
+
+        doc = {
+            "commits": commits,
+            "commit_times": commit_times,
+            "columns": cols,
+            "meta": {
+                "repo": repo,
+                "workflows": ctx.workflows,
+                "lookback_hours": ctx.lookback_hours,
+                "ts": ctx.ts.isoformat(),
+                "dry_run": ctx.dry_run,
+            },
+        }
+        return json.dumps(doc, separators=(",", ":"))
+
+    def insert_state(
+        self,
+        *,
+        ctx: RunContext,
+        pairs: Iterable[Tuple[Signal, SignalProcOutcome]],
+        params: str = "",
+    ) -> None:
+        """Insert one state row into misc.autorevert_state for this run context."""
+        state_json = self._build_state_json(
+            repo=ctx.repo_full_name, ctx=ctx, pairs=list(pairs)
+        )
+        cols = [
+            "ts",
+            "repo",
+            "state",
+            "dry_run",
+            "workflows",
+            "lookback_hours",
+            "params",
+        ]
+        data = [
+            [
+                ctx.ts,
+                ctx.repo_full_name,
+                state_json,
+                1 if ctx.dry_run else 0,
+                ctx.workflows,
+                int(ctx.lookback_hours),
+                params or "",
+            ]
+        ]
+        CHCliFactory().client.insert(
+            table="autorevert_state", data=data, column_names=cols, database="misc"
+        )