Skip to content

Commit

Permalink
doc: update README and include protocol for handling reliability issu…
Browse files Browse the repository at this point in the history
…es (#678)
  • Loading branch information
joyeecheung authored Sep 28, 2023
1 parent 0eea23d commit f8cd32e
Showing 1 changed file with 105 additions and 105 deletions.
210 changes: 105 additions & 105 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,19 +5,17 @@ This repo is used for tracking flaky tests on the Node.js CI and fixing them.
**Current status**: work in progress. Please go to the issue tracker to discuss!

<!-- TOC -->

- [Updating this repo](#updating-this-repo)
- [The Goal](#the-goal)
- [Node.js Core CI Reliability](#nodejs-core-ci-reliability)
- [Updating this repo](#updating-this-repo)
- [The Goal](#the-goal)
- [The Definition of Green](#the-definition-of-green)
- [CI Health History](#ci-health-history)
- [Handling Failed CI runs](#handling-failed-ci-runs)
- [Flaky Tests](#flaky-tests)
- [Identifying Flaky Tests](#identifying-flaky-tests)
- [When Discovering a Potential New Flake on the CI](#when-discovering-a-potential-new-flake-on-the-ci)
- [Infrastructure failures](#infrastructure-failures)
- [Build File Failures](#build-file-failures)
- [TODO](#todo)

- [CI Health History](#ci-health-history)
- [Protocols in improving CI reliability](#protocols-in-improving-ci-reliability)
- [Identifying flaky JS tests](#identifying-flaky-js-tests)
- [Handling flaky JS tests](#handling-flaky-js-tests)
- [Identifying infrastructure issues](#identifying-infrastructure-issues)
- [Handling infrastructure issues](#handling-infrastructure-issues)
- [TODO](#todo)
<!-- /TOC -->

## Updating this repo
Expand Down Expand Up @@ -50,104 +48,106 @@ Make the CI green again.

## CI Health History

See https://nodejs-ci-health.mmarchini.me/#/job-summary

| UTC Time | RUNNING | SUCCESS | UNSTABLE | ABORTED | FAILURE | Green Rate |
| ---------------- | ------- | ------- | -------- | ------- | ------- | ---------- |
| 2018-06-01 20:00 | 1 | 1 | 15 | 11 | 72 | 1.13% |
| 2018-06-03 11:36 | 3 | 6 | 21 | 10 | 60 | 6.89% |
| 2018-06-04 15:00 | 0 | 9 | 26 | 10 | 55 | 10.00% |
| 2018-06-15 17:42 | 1 | 27 | 4 | 17 | 51 | 32.93% |
| 2018-06-24 18:11 | 0 | 27 | 2 | 8 | 63 | 29.35% |
| 2018-07-08 19:40 | 1 | 35 | 2 | 4 | 58 | 36.84% |
| 2018-07-18 20:46 | 2 | 38 | 4 | 5 | 51 | 40.86% |
| 2018-07-24 22:30 | 2 | 46 | 3 | 4 | 45 | 48.94% |
| 2018-08-01 19:11 | 4 | 17 | 2 | 2 | 75 | 18.09% |
| 2018-08-14 15:42 | 5 | 22 | 0 | 14 | 59 | 27.16% |
| 2018-08-22 13:22 | 2 | 29 | 4 | 9 | 56 | 32.58% |
| 2018-10-31 13:28 | 0 | 40 | 13 | 4 | 43 | 41.67% |
| 2018-11-19 10:32 | 0 | 48 | 8 | 5 | 39 | 50.53% |
| 2018-12-08 20:37 | 2 | 18 | 4 | 3 | 73 | 18.95% |

## Handling Failed CI runs

### Flaky Tests

TODO: automate all of this in ncu-ci

#### Identifying Flaky Tests

When checking the CI results of a PR, if there is one or more failed tests (with
`not ok` as the TAP result):

1. If the failed test is not related to the PR (does not touch the modified
code path), search the test name in the issue tracker of this repo. If there
is an existing issue, add a reply there using the [reproduction template](./templates/repro.txt),
and open a pull request updating `flakes.json`.
2. If there are no new existing issues about the test, run the CI again. If the
failure disappears in the next run, then it is potential flake. See
[When discovering a potential flake on the CI](#when-discovering-a-potential-new-flake-on-the-ci)
on what to do for a new flake.
3. If the failure reproduces in the next run, it is likely that the failure is
related to the PR. Do not re-run CI without code changes in the next 24
hours, try to debug the failure.
4. If the cause of the failure still cannot be identified 24 hours later, and
the code has not been changed, start a CI run and see if the failure
disappears. Go back to step 3 if the failure still reproduces, and go to
step 2 if the failure disappears.

#### When Discovering a Potential New Flake on the CI

1. Open an issue in this repo using [the flake issue template](./templates/flake.txt):
[A GitHub workflow](.github/workflows/reliability_report.yml) is run every day
to produce reliability reports of the `node-test-pull-request` CI and post
it to [the issue tracker](https://github.com/nodejs/reliability/issues).

## Protocols in improving CI reliability

Most work starts with opening the issue tracker of this repository and
reading the latest report. If the report is missing, see
[the actions page](https://github.com/nodejs/reliability/actions) for
details. GitHub's API restricts the length of issue messages, so
whenever the report is too long the workflow can fail to post the
issue. But it should still leave a summary in the actions page.

### Identifying flaky JS tests

1. Check out the `JSTest Failure` section of the latest reliability report.
It contains information about the JS tests that failed more than 1 pull
requests in the last 100 `node-test-pull-request` CI runs. The more
pull requests a test fail, the higher it would be ranked, and the more
likely that it is a flake.
2. Search the name of the test in [the Node.js issue tracker](https://github.com/nodejs/node/issues)
and see if there is already an issue about it. If there is already
an issue, check if the failures are similar. Comment with updates
if necessary.
3. If the flake isn't already tracked by an issue, continue to look into
it. In the report of a JS test, check out the pull requests that it
fails and see if there is a connection. If the pull requests appear to
be unrelated, it is more likely that the test is a flake.
4. Search the historical reliability reports with the name of the test in
the reliability issue tracker, and see how long the flake has been showing
up. Gather information from the historical reports, and
[open an issue](https://github.com/nodejs/node/issues/new?assignees=&labels=flaky-test&projects=&template=4-report-a-flaky-test.yml)
in the Node.js issue tracker to track the flake.

### Handling flaky JS tests

1. If the flake only starts to show up in the recent month, check the
historical reports to see precisely when it starts to show up. Look at
commits landing on the target branch around the same time using
`https://github.com/nodejs/node/commits?since=YYYY-MM-DD`
and see if there is any pull request that looks related. If one or
more related pull requests can be found, ping the author or the
reviewer of the pull request, or the team in charge of the
related subsystem in the tracking issue or in private to see if
they can come up with a fix to just deflake the test.
2. If the test has been flaky for more than a month and no one is actively
working on it, it is unlikely to go away on its own, and it's time
to mark it as flaky. For example, if `parallel/some-flaky-test.js`
has been flaky on Windows in the CI, after making sure that there is an
issue tracking it, open a pull request to add the following entry to
[`test/parallel/parallel.status`](https://github.com/nodejs/node/tree/main/test/parallel/parallel.status):

```
[$system==win32]
# https://github.com/nodejs/node/issues/<TRACKING_ISSUE_ID>
some-flaky-test: PASS,FLAKY
```

### Identifying infrastructure issues

In the reliability reports, `Jenkins Failure`, `Git Failure` and
`Build Failure` are generally infrastructure issues and can be
handled by the `nodejs/build` team. Typical infrastructure
issues include:

- Title should be `Investigate path/under/the/test/directory/without/extension`,
for example `Investigate async-hooks/test-zlib.zlib-binding.deflate`.

2. Add the `Flaky Test` label and relevant subsystem labels (TODO: create
useful labels).

3. Open a pull request updating `flakes.json`.

4. Notify the subsystem team related to the flake.

### Infrastructure failures

When the CI run fails because:

- There are network connection issues
- There are tests fail with `ENOSPAC` (No space left on device)
- The CI machine has trouble pulling source code from the repository

Do the following:

1. Search in this repo with the error message and see if there is any open
issue about this.
2. If there is an existing issue, wait until the problem gets fixed.
3. If there are no similar issues, open a new one with
[the build infra issue template](./templates/infra.txt).
4. Add label `Build Infra`.
5. Notify the `@nodejs/build-infra` team in the issue.

### Build File Failures

When the CI run of a PR that does not touch the build files ends with build
failures (e.g. the run ends before the test runner has a chance to run):

1. Search in this repo with the error message that contains keywords like
`fatal`, `error`, etc.
2. If there is a similar issue, add a reply there using the
[reproduction template](./templates/build-file-repro.txt).
3. If there are no similar issues, open a new one with
[the build file issue template](./templates/build-file.txt).
4. Add label `Build Files`.
5. Notify the `@nodejs/build-files` team in the issue.
- The CI machine has trouble communicating to the Jenkins server
- Build timing out
- Parent job fails to trigger sub builds

Sometimes infrastructure issues can show up in the tests too, for
example tests can fail with `ENOSPAC` (No space left on device), and
the machine needs to be cleaned up to release disk space.

Some infrastructure issues can go away on its own, but if the same kind
of infrastructure issue has been failing multiple pull requests and
persists for more than a day, it's time to take action.

### Handling infrastructure issues

Check out the [Node.js build issue tracker](https://github.com/nodejs/build/issues)
to see if there is any open issue about this. If there isn't,
open a new issue about it or ask around in the `#nodejs-build` channel
in the OpenJS slack.

When reporting infrastructure issues, it's important to include
information about the particular machines where the issues happen.
On the Jenkins job page of the failed CI build where the infrastructure
is reported in the logs (not to be confused with the parent build that
trigger the sub build that has the issues), on the top-right
corner, there is normally a line similar to
`Took 16 sec on test-equinix-ubuntu2004_container-armv7l-1`.
In this case, `test-equinix-ubuntu2004_container-armv7l-1`
is the machine having infrastructure issues, and it's important
to include this information in the report.

## TODO

- [ ] Settle down on the flake database schema
- [ ] Read the flake database in ncu-ci so people can quickly tell if
a failure is a flake
a failure is a flake
- [ ] Automate the report process in ncu-ci
- [ ] Migrate existing issues in nodejs/node and nodejs/build, close outdated
ones.
- [ ] Automate CI health history tracking
ones.

0 comments on commit f8cd32e

Please sign in to comment.