Skip to content

Commit d8451ae

Browse files
[autorevert] Breaks down dry_run logic for revert and restart (#7179)
*PLEASE NOTE* this PR is intended to be on top of #7179 maybe merge that one before reviewing this one to make reviewing easier Breaking down the dry_run logic for revert and restart is required so we can continue to safely work towards improving the autorevert safely. Not adding the logic is not the best option, as I hope we'll be able to run it locally a few times and iterate on that before publishing. And rely on code commenting and other not so great approaches is not ideal. --------- Co-authored-by: Ivan Zaitsev <[email protected]>
1 parent c06e904 commit d8451ae

File tree

9 files changed

+165
-72
lines changed

9 files changed

+165
-72
lines changed

aws/lambda/pytorch-auto-revert/Makefile

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -21,13 +21,9 @@ venv/bin/lintrunner: venv/bin/python
2121
run-local: venv/bin/python
2222
venv/bin/python -m pytorch_auto_revert
2323

24-
.PHONY: run-local-dry
25-
run-local-dry: venv/bin/python
26-
venv/bin/python -m pytorch_auto_revert --dry-run
27-
2824
.PHONY: run-local-workflows
2925
run-local-workflows: venv/bin/python
30-
venv/bin/python -m pytorch_auto_revert autorevert-checker Lint trunk pull inductor linux-binary-manywheel --hours 4380 --ignore-common-errors --verbose
26+
venv/bin/python -m pytorch_auto_revert autorevert-checker Lint trunk pull inductor linux-binary-manywheel --hours 8 --revert-action log
3127

3228
deployment.zip:
3329
mkdir -p deployment

aws/lambda/pytorch-auto-revert/SIGNAL_ACTIONS.md

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ This document specifies the Actions layer that consumes extracted Signals with t
55
## Overview
66

77
- Inputs (provided by integration code):
8-
- Run parameters: `repo_full_name`, `workflows`, `lookback_hours`, `dry_run`.
8+
- Run parameters: `repo_full_name`, `workflows`, `lookback_hours`, `restart_action`, `revert_action`.
99
- A list of pairs: `List[Tuple[Signal, SignalProcOutcome]]`, where `SignalProcOutcome = Union[AutorevertPattern, RestartCommits, Ineligible]`.
1010
- Decisions: per-signal outcome mapped to a concrete action:
1111
- `AutorevertPattern` → record a global revert intent for the suspected commit
@@ -22,7 +22,7 @@ Immutable run-scoped metadata shared by all actions in the same run:
2222
- `repo_full_name`: e.g., `pytorch/pytorch`
2323
- `workflows`: list of workflow display names
2424
- `lookback_hours`: window used for extraction
25-
- `dry_run`: bool
25+
- `restart_action`, `revert_action`: independent enums controlling behavior
2626

2727
## Action Semantics
2828

@@ -40,9 +40,7 @@ Immutable run-scoped metadata shared by all actions in the same run:
4040
- Not logged in `autorevert_events_v2` (only actions taken are logged)
4141

4242
- Multiple signals targeting same workflow/commit are coalesced in-memory, then deduped again via ClickHouse checks.
43-
- Dry-run behavior:
44-
- Simulate restarts (no dispatch), log actions with `dry_run=1`
45-
- Dry-run rows do not count toward caps/pacing or revert dedup criteria
43+
- Event `dry_run` is per-action and derived from the mode’s side effects: `dry_run = 1` means “no side effects”.
4644

4745
## ClickHouse Logging
4846

@@ -59,6 +57,7 @@ Two tables, sharing the same `ts` per CLI/lambda run.
5957
- `source_signal_keys` Array(String) — signal keys that contributed to this action
6058
- `failed` UInt8 DEFAULT 0 — marks a failed attempt (e.g., restart dispatch failed)
6159
- `notes` String DEFAULT '' — optional free-form metadata
60+
- `dry_run` UInt8 — per-event; 1 means “no side effects” for this logged action
6261

6362
### `autorevert_state` (separate module)
6463

@@ -67,6 +66,9 @@ Two tables, sharing the same `ts` per CLI/lambda run.
6766
- `ts` DateTime — run timestamp (matches `autorevert_events_v2.ts`)
6867
- `state` String — JSON-encoded model of the HUD grid and outcomes
6968
- `params` String DEFAULT '' — optional, free-form
69+
- `dry_run` UInt8 — run-level convenience flag; 1 when the run performed no side effects
70+
71+
State JSON `meta` contains: `repo`, `workflows`, `lookback_hours`, `ts`, `restart_action`, `revert_action`.
7072

7173
## Processing Flow
7274

@@ -78,8 +80,8 @@ Two tables, sharing the same `ts` per CLI/lambda run.
7880
4. For each group, consult `autorevert_events_v2` (non-dry-run rows) to enforce dedup rules:
7981
- Reverts: skip if any prior recorded `revert` exists for `commit_sha`
8082
- Restarts: skip if ≥2 prior restarts exist for `(workflow_target, commit_sha)`; skip if the latest is within 15 minutes of `ts`
81-
5. Execute eligible actions:
82-
- Restart: if not `dry_run`, dispatch and capture success/failure in `notes`
83-
- Revert: record only
84-
6. Insert one `autorevert_events_v2` row per executed group with aggregated `workflows` and `source_signal_keys` (dry-run rows use `dry_run=1`).
83+
5. Execute eligible actions using the per-action mode:
84+
- Restart: `run`dispatch and log; `log` → log only; `skip` → no logging
85+
- Revert: currently record intent only; `run-notify`/`run-revert` modes are allowed but still log intent (no GH revert yet)
86+
6. Insert one `autorevert_events_v2` row per executed group with aggregated `workflows` and `source_signal_keys`.
8587
7. Separately (integration), build the full run state and call the run‑state logger to write a single `autorevert_state` row with the same `ts`.

aws/lambda/pytorch-auto-revert/pytorch_auto_revert/__main__.py

Lines changed: 24 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,10 @@
1212
from .testers.autorevert_v2 import autorevert_v2
1313
from .testers.hud import run_hud
1414
from .testers.restart_checker import workflow_restart_checker
15+
from .utils import RestartAction, RevertAction
16+
17+
18+
DEFAULT_WORKFLOWS = ["Lint", "trunk", "pull", "inductor"]
1519

1620

1721
def setup_logging(log_level: str) -> None:
@@ -71,14 +75,15 @@ def get_opts() -> argparse.Namespace:
7175
# no subcommand runs the lambda flow
7276
subparsers = parser.add_subparsers(dest="subcommand")
7377

74-
# autorevert-checker subcommand (new default; legacy behind a flag)
78+
# autorevert-checker subcommand
7579
workflow_parser = subparsers.add_parser(
7680
"autorevert-checker",
7781
help="Analyze workflows for autorevert using Signals (default), or legacy via flag",
7882
)
7983
workflow_parser.add_argument(
8084
"workflows",
8185
nargs="+",
86+
default=DEFAULT_WORKFLOWS,
8287
help="Workflow name(s) to analyze - single name or comma/space separated"
8388
+ ' list (e.g., "pull" or "pull,trunk,inductor")',
8489
)
@@ -91,25 +96,22 @@ def get_opts() -> argparse.Namespace:
9196
help="Full repo name to filter by (owner/repo).",
9297
)
9398
workflow_parser.add_argument(
94-
"--verbose",
95-
"-v",
96-
action="store_true",
97-
help="Show detailed output including commit summaries",
98-
)
99-
workflow_parser.add_argument(
100-
"--do-restart",
101-
action="store_true",
102-
help="Actually restart workflows for detected autorevert patterns",
103-
)
104-
workflow_parser.add_argument(
105-
"--do-revert",
106-
action="store_true",
107-
help="When restarts complete and secondary pattern matches, log REVERT",
99+
"--restart-action",
100+
type=RestartAction,
101+
default=RestartAction.RUN,
102+
choices=list(RestartAction),
103+
help=(
104+
"Restart mode: skip (no logging), log (no side effects), or run (dispatch)."
105+
),
108106
)
109107
workflow_parser.add_argument(
110-
"--ignore-common-errors",
111-
action="store_true",
112-
help="Ignore common errors in autorevert patterns (e.g., 'No tests found')",
108+
"--revert-action",
109+
type=RevertAction,
110+
default=RevertAction.LOG,
111+
choices=list(RevertAction),
112+
help=(
113+
"Revert mode: skip, log (no side effects), run-notify (side effect), or run-revert (side effect)."
114+
),
113115
)
114116

115117
# workflow-restart-checker subcommand
@@ -200,19 +202,17 @@ def main(*args, **kwargs) -> None:
200202
os.environ.get("WORKFLOWS", "Lint,trunk,pull,inductor").split(","),
201203
hours=int(os.environ.get("HOURS", 16)),
202204
repo_full_name=os.environ.get("REPO_FULL_NAME", "pytorch/pytorch"),
203-
dry_run=opts.dry_run,
204-
do_restart=True,
205-
do_revert=True,
205+
restart_action=(RestartAction.LOG if opts.dry_run else RestartAction.RUN),
206+
revert_action=RevertAction.LOG,
206207
)
207208
elif opts.subcommand == "autorevert-checker":
208209
# New default behavior under the same subcommand
209210
autorevert_v2(
210211
opts.workflows,
211212
hours=opts.hours,
212213
repo_full_name=opts.repo_full_name,
213-
dry_run=opts.dry_run,
214-
do_restart=opts.do_restart,
215-
do_revert=opts.do_revert,
214+
restart_action=(RestartAction.LOG if opts.dry_run else opts.restart_action),
215+
revert_action=(RevertAction.LOG if opts.dry_run else opts.revert_action),
216216
)
217217
elif opts.subcommand == "workflow-restart-checker":
218218
workflow_restart_checker(opts.workflow, commit=opts.commit, days=opts.days)

aws/lambda/pytorch-auto-revert/pytorch_auto_revert/run_state_logger.py

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -121,7 +121,8 @@ def _build_state_json(
121121
"workflows": ctx.workflows,
122122
"lookback_hours": ctx.lookback_hours,
123123
"ts": ctx.ts.isoformat(),
124-
"dry_run": ctx.dry_run,
124+
"restart_action": str(ctx.restart_action),
125+
"revert_action": str(ctx.revert_action),
125126
},
126127
}
127128
return json.dumps(doc, separators=(",", ":"))
@@ -151,7 +152,11 @@ def insert_state(
151152
ctx.ts,
152153
ctx.repo_full_name,
153154
state_json,
154-
1 if ctx.dry_run else 0,
155+
1
156+
if not (
157+
ctx.restart_action.side_effects or ctx.revert_action.side_effects
158+
)
159+
else 0,
155160
ctx.workflows,
156161
int(ctx.lookback_hours),
157162
params or "",

aws/lambda/pytorch-auto-revert/pytorch_auto_revert/signal_actions.py

Lines changed: 45 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
from .clickhouse_client_helper import CHCliFactory, ensure_utc_datetime
99
from .signal import AutorevertPattern, Ineligible, RestartCommits, Signal
1010
from .signal_extraction_types import RunContext
11+
from .utils import RestartAction, RevertAction
1112
from .workflow_checker import WorkflowRestartChecker
1213

1314

@@ -199,10 +200,20 @@ def execute_revert(
199200
self, *, commit_sha: str, sources: List[SignalMetadata], ctx: RunContext
200201
) -> bool:
201202
"""Record a revert intent if not previously logged for the commit."""
203+
if ctx.revert_action == RevertAction.SKIP:
204+
logging.debug(
205+
"[v2][action] revert for sha %s: skipping (ignored)", commit_sha[:8]
206+
)
207+
return False
208+
209+
dry_run = not ctx.revert_action.side_effects
210+
202211
if self._logger.prior_revert_exists(
203212
repo=ctx.repo_full_name, commit_sha=commit_sha
204213
):
205-
logging.info("[v2][action] revert: skipping existing")
214+
logging.info(
215+
"[v2][action] revert for sha %s: skipping existing", commit_sha[:8]
216+
)
206217
return False
207218
self._logger.insert_event(
208219
repo=ctx.repo_full_name,
@@ -211,12 +222,14 @@ def execute_revert(
211222
commit_sha=commit_sha,
212223
workflows=sorted({s.workflow_name for s in sources}),
213224
source_signal_keys=[s.key for s in sources],
214-
dry_run=ctx.dry_run,
225+
dry_run=dry_run,
215226
failed=False,
216227
notes="",
217228
)
218229
logging.info(
219-
"[v2][action] revert: logged%s", " (dry_run)" if ctx.dry_run else ""
230+
"[v2][action] revert for sha %s: logged%s",
231+
commit_sha[:8],
232+
" (dry_run)" if dry_run else "",
220233
)
221234
return True
222235

@@ -229,22 +242,36 @@ def execute_restart(
229242
ctx: RunContext,
230243
) -> bool:
231244
"""Dispatch a workflow restart if under cap and outside pacing window; always logs the event."""
245+
if ctx.restart_action == RestartAction.SKIP:
246+
logging.info(
247+
"[v2][action] restart for sha %s: skipping (ignored)", commit_sha[:8]
248+
)
249+
return False
250+
251+
dry_run = not ctx.restart_action.side_effects
252+
232253
recent = self._logger.recent_restarts(
233254
repo=ctx.repo_full_name, workflow=workflow_target, commit_sha=commit_sha
234255
)
235256
if len(recent) >= 2:
236-
logging.info("[v2][action] restart: skipping cap (recent=%d)", len(recent))
257+
logging.info(
258+
"[v2][action] restart for sha %s: skipping cap (recent=%d)",
259+
commit_sha[:8],
260+
len(recent),
261+
)
237262
return False
238263
if recent and (ctx.ts - recent[0]) < timedelta(minutes=15):
239264
delta = (ctx.ts - recent[0]).total_seconds()
240265
logging.info(
241-
"[v2][action] restart: skipping pacing (delta_sec=%d)", int(delta)
266+
"[v2][action] restart for sha %s: skipping pacing (delta_sec=%d)",
267+
commit_sha[:8],
268+
int(delta),
242269
)
243270
return False
244271

245272
notes = ""
246273
ok = True
247-
if not ctx.dry_run:
274+
if not dry_run:
248275
ok = self._restart.restart_workflow(workflow_target, commit_sha)
249276
if not ok:
250277
notes = "dispatch_failed"
@@ -255,14 +282,20 @@ def execute_restart(
255282
commit_sha=commit_sha,
256283
workflows=[workflow_target],
257284
source_signal_keys=[s.key for s in sources],
258-
dry_run=ctx.dry_run,
285+
dry_run=dry_run,
259286
failed=not ok,
260287
notes=notes,
261288
)
262-
if not ctx.dry_run and notes == "":
263-
logging.info("[v2][action] restart: dispatched")
264-
elif notes:
265-
logging.info("[v2][action] restart: dispatch_failed: %s", notes)
289+
if not dry_run and ok:
290+
logging.info("[v2][action] restart for sha %s: dispatched", commit_sha[:8])
291+
elif not ok:
292+
logging.info(
293+
"[v2][action] restart for sha %s: dispatch_failed: %s",
294+
commit_sha[:8],
295+
notes,
296+
)
266297
else:
267-
logging.info("[v2][action] restart: logged (dry_run)")
298+
logging.info(
299+
"[v2][action] restart for sha %s: logged (dry_run)", commit_sha[:8]
300+
)
268301
return True

aws/lambda/pytorch-auto-revert/pytorch_auto_revert/signal_extraction_types.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@
66
from functools import cached_property
77
from typing import List, NewType, Set
88

9+
from .utils import RestartAction, RevertAction
10+
911

1012
# Default classification rules that indicate test failures.
1113
DEFAULT_TEST_RULES: Set[str] = {
@@ -29,11 +31,12 @@
2931

3032
@dataclass(frozen=True)
3133
class RunContext:
32-
ts: datetime
34+
lookback_hours: int
3335
repo_full_name: str
36+
restart_action: RestartAction
37+
revert_action: RevertAction
38+
ts: datetime
3439
workflows: List[str]
35-
lookback_hours: int
36-
dry_run: bool = False
3740

3841

3942
# Represents a job row from the jobs table in ClickHouse

aws/lambda/pytorch-auto-revert/pytorch_auto_revert/testers/autorevert_v2.py

Lines changed: 10 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -7,16 +7,16 @@
77
from ..signal_actions import SignalActionProcessor, SignalProcOutcome
88
from ..signal_extraction import SignalExtractor
99
from ..signal_extraction_types import RunContext
10+
from ..utils import RestartAction, RevertAction
1011

1112

1213
def autorevert_v2(
1314
workflows: Iterable[str],
1415
*,
1516
hours: int = 24,
1617
repo_full_name: str = "pytorch/pytorch",
17-
dry_run: bool = False,
18-
do_restart: bool = True,
19-
do_revert: bool = True,
18+
restart_action: RestartAction = RestartAction.RUN,
19+
revert_action: RevertAction = RevertAction.LOG,
2020
) -> Tuple[List[Signal], List[Tuple[Signal, SignalProcOutcome]]]:
2121
"""Run the Signals-based autorevert flow end-to-end.
2222
@@ -32,11 +32,12 @@ def autorevert_v2(
3232
ts = datetime.now(timezone.utc)
3333

3434
logging.info(
35-
"[v2] Start: workflows=%s hours=%s repo=%s dry_run=%s",
35+
"[v2] Start: workflows=%s hours=%s repo=%s restart_action=%s revert_action=%s",
3636
",".join(workflows),
3737
hours,
3838
repo_full_name,
39-
dry_run,
39+
restart_action,
40+
revert_action,
4041
)
4142
logging.info("[v2] Run timestamp (CH log ts) = %s", ts.isoformat())
4243

@@ -57,24 +58,19 @@ def autorevert_v2(
5758

5859
# Build run context
5960
run_ctx = RunContext(
60-
ts=ts,
61+
lookback_hours=hours,
6162
repo_full_name=repo_full_name,
63+
restart_action=restart_action,
64+
revert_action=revert_action,
65+
ts=ts,
6266
workflows=workflows,
63-
lookback_hours=hours,
64-
dry_run=dry_run,
6567
)
6668

6769
# Group and execute actions
6870
proc = SignalActionProcessor()
6971
groups = proc.group_actions(pairs)
7072
logging.info("[v2] Candidate action groups: %d", len(groups))
7173

72-
# Support toggling specific kinds of actions via flags
73-
if not do_revert:
74-
groups = [g for g in groups if g.type != "revert"]
75-
if not do_restart:
76-
groups = [g for g in groups if g.type != "restart"]
77-
7874
executed_count = sum(1 for g in groups if proc.execute(g, run_ctx))
7975
logging.info("[v2] Executed action groups: %d", executed_count)
8076

0 commit comments

Comments
 (0)