Skip to content

Releases: pytorch/test-infra

v20251002-162215

02 Oct 16:24
5903a75

Choose a tag to compare

[autorevert] Inject synthetic PENDING events for pending workflows in…

v20251002-150750

02 Oct 15:09
e8dfdb6

Choose a tag to compare

[autorevert] Fix pacing query logic (#7274)

`Any` has an unexpected semantics in CH, it returns [first
value](https://clickhouse.com/docs/sql-reference/aggregate-functions/reference/any),
the correct way to check if any value is true is to use `countIf`.

The effect of this bug was that pacing was not working in some rare
cases when there are multiple events for commit and some were not
matching the condition.

Basically, when the first event goes out of the window, and second event
is added, we get two rows: 0 and 1, and depending on the random order
either would be returned by `any`.

The correct way (among many) would use `countIf` instead.

Testing:

```
  SELECT
  (countIf(failed = 0 AND ts > now() - toIntervalSecond(5200)) > 0) AS has_success_within_window,
    any(failed = 0 AND ts > now() - toIntervalSecond(5200)) AS has_success_within_window_old
  FROM misc.autorevert_events_v2
  WHERE repo = 'pytorch/pytorch'
  AND action = 'restart'
  AND dry_run = 0
  AND commit_sha = 'b5c4f46bb9ede8dc6adf11975c93b9f285d9ed67'
  ```
  
  result:
```
"has_success_within_window","has_success_within_window_old"
"1","0"
```



more testing:

```
python -m pytorch_auto_revert --dry-run autorevert-checker Lint trunk
pull inductor rocm rocm-mi300 --hours 18 --hud-html
```

v20251001-182920

01 Oct 18:31
2315118

Choose a tag to compare

[autorevert] Add 'linux-aarch64' to default workflows (#7268)

see the list of viable strict workflows:
https://github.com/pytorch/pytorch/pull/164374/files

testing:

```
HOURS=18 python -m pytorch_auto_revert --dry-run
2025-10-01 11:19:05,293 INFO [root] [v2] Start: workflows=Lint,trunk,pull,inductor,linux-aarch64 hours=18 repo=pytorch/pytorch restart_action=log revert_action=log notify_issue_number=163650 bisection=unlimited
2025-10-01 11:19:05,293 INFO [root] [v2] Run timestamp (CH log ts) = 2025-10-01T18:19:05.293306+00:00
2025-10-01 11:19:05,294 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Fetching commits in time range: repo=pytorch/pytorch lookback=18h
2025-10-01 11:19:06,055 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Commits fetched: 47 commits in 0.76s
2025-10-01 11:19:06,055 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Fetching jobs: repo=pytorch/pytorch workflows=Lint,trunk,pull,inductor,linux-aarch64 commits=47 lookback=18h
2025-10-01 11:20:14,477 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Jobs fetched: 7058 rows in 68.42s
2025-10-01 11:20:14,539 INFO [root] [v2] Extracted 1 signals
2025-10-01 11:20:14,539 INFO [root] [v2][signal] wf=inductor key=inductor-test / test outcome=Ineligible(reason=<IneligibleReason.FLAKY: 'flaky'>, message='signal is flaky (mixed outcomes on same commit)')
2025-10-01 11:20:14,539 INFO [root] [v2] Candidate action groups: 0
2025-10-01 11:20:14,539 INFO [root] [v2] Executed action groups: 0
2025-10-01 11:20:15,101 INFO [root] [v2] State logged
```

v20251001-181055

01 Oct 18:12
c7f01a8

Choose a tag to compare

[autorevert] Implement autobisect functionality (#7238)

Testing on the periodic workflow (on top of
https://github.com/pytorch/test-infra/pull/7248):

```
 python -m pytorch_auto_revert  autorevert-checker periodic --hours 128 --bisection-limit 2   --hud-html
 python -m pytorch_auto_revert --dry-run autorevert-checker periodic --hours 256 --bisection-limit 2   --hud-html
```


[2025-09-29T22-00-27.941916-00-00.html](https://github.com/user-attachments/files/22607006/2025-09-29T22-00-27.941916-00-00.html)


[2025-09-29T22-03-58.012711-00-00.html](https://github.com/user-attachments/files/22607013/2025-09-29T22-03-58.012711-00-00.html)




----


 Algorithm:

- Goal: Cover the “unknown” span between failure and success partitions
by scheduling at most N new restarts, sampling widely via iterative
bisection.
- Intuition: Always split the largest unknown gap; choose its midpoint;
repeat until the budget is exhausted.

  Inputs/Output

  - Input covered: boolean list over the unknown region
- True = already covered/separator (e.g., pending), False = uncovered
candidate.
  - Input limit: optional int; total target coverage for this run.
      - Budget allowed = max(0, limit − sum(covered)); None = unlimited.
- Output: boolean list of equal length; True marks indices to newly
cover (schedule now).

  Procedure

  - If limit is None: return NOT covered (select all uncovered).
  - Else:
- Build contiguous uncovered gaps (sequences of False) separated by True
entries.
- Push each gap into a max-heap keyed by (-length, lo, hi) using Gap(lo,
hi):
          - length = hi − lo + 1
          - heap_key = (-length, lo, hi) for deterministic tie-breaking.
      - While allowed > 0 and heap not empty:
- Pop largest gap g; pick mid = floor((g.lo + g.hi)/2); select mid;
allowed -= 1.
- Push back sub-gaps [g.lo, mid-1] and [mid+1, g.hi] if non-empty.
      - Return the selection mask.

  Properties

  - Deterministic ties (equal-length gaps) prefer lower lo.
- Already-covered (pending) entries both reduce the budget and split
gaps, pacing new work naturally.
  - If limit ≤ current_covered → allowed = 0 → no new selections.
- Complexity: O(A log G), where A = number of picks (≤ allowed), G =
initial number of gaps.

  Integration in signal processing

  - PartitionedCommits.cover_gap_unknown_commits:
- Builds covered mask for the unknown partition: pending=True
(separator), missing=False (candidate).
- Calls the planner; maps selected indices back to commit SHAs to
restart.
  - process_valid_autorevert_pattern(bisection_limit=...):
- Applies gap-cover selections, then independently applies
failure-/success-side restarts based on infra and threshold heuristics.

---------

Co-authored-by: Copilot <[email protected]>

v20251001-180704

01 Oct 18:08
3e1acbd

Choose a tag to compare

[AUTOREVERT] Makefile targets pointing to canary (#7267)

Setting the makefile targets to point to `pytorch/pytorch-canary` as an
example.

v20251001-163637

01 Oct 16:38
277f605

Choose a tag to compare

[autorevert]  add job & hud links to the autorevert message and debug…

v20250930-222836

30 Sep 22:30
e936529

Choose a tag to compare

[autorever] exclude unstable jobs (#7260)

v20250930-134331

30 Sep 13:45
99554ad

Choose a tag to compare

[AUTOREVERT] [BUGFIX] fixing typo in variable name preventing revert …

v20250930-125800

30 Sep 12:59
53c6bdf

Choose a tag to compare

[autorevert] correctly fetch and build the gaps in the signal (#7248)

1. Fixed commits-without-jobs issue

- Problem: Commits with no workflow jobs (e.g., periodic workflow) were
excluded from signal extraction
  - Solution:
    - Added fetch_commits_in_time_range() to query push table directly
- Modified job query to filter by explicit list of head_shas instead of
JOIN
- Changed ORDER BY to use sha dimension first (preserves grouping,
actual order doesn't matter as internally extractors now iterate over
the list of commits passed explicitly)


  2. Added mandatory timestamp field to SignalCommit

  - Changes:
- SignalCommit.__init__(head_sha, timestamp, events) - timestamp is now
mandatory
    - Signal extraction populates timestamps from push table
- HUD state logger uses commit timestamp instead of computing from event
times
    - Updated 36 test constructor calls
    
    
    
  ### Testing
  
  Before:
  

[2025-09-29T19-29-47.670686-00-00.html](https://github.com/user-attachments/files/22606856/2025-09-29T19-29-47.670686-00-00.html)


After:

[2025-09-29T21-38-10.190584-00-00.html](https://github.com/user-attachments/files/22606859/2025-09-29T21-38-10.190584-00-00.html)

v20250929-230908

29 Sep 23:10
44b32da

Choose a tag to compare

[autorever] fix indentation in `fetch_tests_for_job_ids` (#7250)

Accidentally noticed another bug introduced by
https://github.com/pytorch/test-infra/pull/7241 when testing locally on
the large lookback windows:

```
python -m pytorch_auto_revert --dry-run autorevert-checker periodic --hours 256 --bisection-limit 2   --hud-html
2025-09-29 15:56:16,356 INFO [root] [v2] Start: workflows=periodic hours=256 repo=pytorch/pytorch restart_action=log revert_action=log notify_issue_number=163650
2025-09-29 15:56:16,356 INFO [root] [v2] Run timestamp (CH log ts) = 2025-09-29T22:56:16.356213+00:00
2025-09-29 15:56:16,356 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Fetching commits in time range: repo=pytorch/pytorch lookback=256h
2025-09-29 15:56:16,909 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Commits fetched: 419 commits in 0.55s
2025-09-29 15:56:16,909 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Fetching jobs: repo=pytorch/pytorch workflows=periodic commits=419 lookback=256h
2025-09-29 15:56:56,850 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Jobs fetched: 2848 rows in 39.94s
2025-09-29 15:56:56,859 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Fetching tests for 1077 job_ids (453 failed jobs) in batches
2025-09-29 15:56:56,859 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Test batch 1/2 (size=1024)
2025-09-29 15:56:56,859 INFO [pytorch_auto_revert.signal_extraction_datasource] existing rows: 0
2025-09-29 15:56:56,859 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Test batch 2/2 (size=53)
2025-09-29 15:56:56,859 INFO [pytorch_auto_revert.signal_extraction_datasource] existing rows: 0
2025-09-29 15:56:57,718 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Tests fetched: 265 rows for 1077 job_ids in 0.86s
```

notice, that no tests are read in the first batch!


after this fix:
```
python -m pytorch_auto_revert --dry-run autorevert-checker periodic --hours 256   --hud-html
2025-09-29 16:03:06,896 INFO [root] [v2] Start: workflows=periodic hours=256 repo=pytorch/pytorch restart_action=log revert_action=log notify_issue_number=163650
2025-09-29 16:03:06,896 INFO [root] [v2] Run timestamp (CH log ts) = 2025-09-29T23:03:06.896595+00:00
2025-09-29 16:03:06,897 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Fetching jobs: repo=pytorch/pytorch workflows=periodic lookback=256h
2025-09-29 16:03:49,456 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Jobs fetched: 2887 rows in 42.56s
2025-09-29 16:03:49,466 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Fetching tests for 1113 job_ids (454 failed jobs) in batches
2025-09-29 16:03:49,466 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Test batch 1/2 (size=1024)
2025-09-29 16:03:51,753 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Test batch 2/2 (size=89)
2025-09-29 16:03:53,056 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Tests fetched: 5002 rows for 1113 job_ids in 3.59s
2025-09-29 16:03:53,122 INFO [root] [v2] Extracted 144 signals
```