optimize: Introduce automated flaky test tracking like OpenSearch #7545

OmCheeLin · 2025-07-17T04:15:05Z

I have registered the PR changes.

Ⅰ. Describe what this PR did

Automatically trigger detect-flaky-test.yml after the "build" fails and the "Rerun build" succeeds.
Download the test reports from the first and second builds:
- First build report: run-1-surefire-reports-${{ matrix.java }}
- Second build report: run-2-surefire-reports-${{ matrix.java }}
Run the Python script parse_failed_tests.py:
- Compare the test reports from the first and second builds.
- Identify tests that failed in the first run but passed in the second (i.e., flaky tests).
- Output the results as a JSON list and pass them to the next steps.
If flaky tests are found, automatically create an issue listing the unstable test names (format: ClassName.testMethod).

Ⅱ. Does this pull request fix one issue?

fixes #7448

Ⅲ. Why don't you add test cases (unit test/integration test)?

Ⅳ. Describe how to verify it

Ⅴ. Special notes for reviews

codecov · 2025-07-17T04:31:17Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.65%. Comparing base (d78267a) to head (3eb316c).
⚠️ Report is 8 commits behind head on 2.x.

Additional details and impacted files

@@             Coverage Diff              @@
##                2.x    #7545      +/-   ##
============================================
+ Coverage     60.63%   60.65%   +0.01%     
  Complexity      658      658              
============================================
  Files          1308     1308              
  Lines         49446    49446              
  Branches       5811     5811              
============================================
+ Hits          29983    29992       +9     
+ Misses        16801    16796       -5     
+ Partials       2662     2658       -4

see 5 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

YongGoose · 2025-07-17T06:09:26Z

@OmCheeLin

Could you please explain how it works?
Also, it would be great if you could show an example using your own forked repository.
I'll give you feedback after reviewing the example.

OmCheeLin · 2025-07-17T07:55:08Z

@OmCheeLin

Could you please explain how it works? Also, it would be great if you could show an example using your own forked repository. I'll give you feedback after reviewing the example.

ok, I will do it later, This pr is temporarily closed.

OmCheeLin · 2025-07-18T03:30:48Z

To speed up the CI test process, I made a slight modification to build.yml so that it only runs tests under seata-common.
I added a FlakyTest under seata-common, which will fail on the first build and succeed on the second.
Note: I modified the workflow directly on the 2.x branch, because if it's changed on other branches, some workflow files won't use the latest version during actual CI runs.
On the Actions page, after the build runs twice, detect-flaky-test will be triggered automatically.

@YongGoose this is my fork repo, see 2.x branch
https://github.com/OmCheeLin/incubator-seata

OmCheeLin · 2025-07-18T03:33:22Z

click here to see file changes @YongGoose

YongGoose · 2025-07-18T04:49:59Z

@OmCheeLin

It would be great if we could see a bit more information in the issue.

[AUTOCUT] Gradle Check Flaky Test Report for ClusterDisruptionIT opensearch-project/OpenSearch#14308

YongGoose · 2025-07-18T04:56:06Z

In a workflow, what types of runs are retried when they fail?
Also, does the workflow automatically retry if it fails?

Additionally, I think it would be nice to have a label for the issue.
Would you be able to suggest a name for the label?
I’ll take care of creating it myself.

YongGoose · 2025-07-18T04:56:57Z

Also, would it be possible to share this PR on DingTalk?
I believe this feature could be very useful, so it would be great to get feedback from more developers.

OmCheeLin · 2025-07-18T05:53:43Z

@OmCheeLin

It would be great if we could see a bit more information in the issue.

[AUTOCUT] Gradle Check Flaky Test Report for ClusterDisruptionIT opensearch-project/OpenSearch#14308

Given flaky tests, I want to know how to find in which PRs these flaky tests occurred, using a web crawler?

YongGoose · 2025-07-18T10:18:39Z

Given flaky tests, I want to know how to find in which PRs these flaky tests occurred, using a web crawler?

Instead of the PR, the URL of the action where the issue occurred would also be fine.
Would you be able to check what kind of information can be retrieved when creating an issue through github actions?

OmCheeLin · 2025-07-21T01:48:23Z

Given flaky tests, I want to know how to find in which PRs these flaky tests occurred, using a web crawler?

Instead of the PR, the URL of the action where the issue occurred would also be fine. Would you be able to check what kind of information can be retrieved when creating an issue through github actions?

It parses the surefire-reports.xml file, currently with only class names.

YongGoose · 2025-07-21T12:20:29Z

@OmCheeLin

To start with, it would be great if we could just output the class names.
We can consider upgrading the information provided through a separate PR later on.

For a smoother review process, it would also be helpful if you could clean up the code and resolve CI failures.

OmCheeLin · 2025-07-23T05:02:43Z

@YongGoose cc

OmCheeLin · 2025-07-23T05:15:08Z

I only changed changes.md, but the CI failed. The previous commit was still successful.
Is there flaky-tests?

YongGoose · 2025-07-23T08:58:17Z

I only changed changes.md, but the CI failed. The previous commit was still successful.

Is there flaky-tests?

I rerun the test
Let's see

YongGoose · 2025-07-31T00:41:59Z

@OmCheeLin

I’d appreciate it if you could create some sub-issues outlining the planned next steps after the PR gets merged.

Copilot

Pull Request Overview

This PR introduces automated flaky test detection to the CI/CD pipeline, similar to OpenSearch's approach. The system automatically triggers after a build fails initially but succeeds on rerun, identifying tests that exhibit flaky behavior.

Adds workflow to detect flaky tests by comparing test reports from failed and successful build attempts
Integrates test report uploading in the main build workflow to capture surefire/failsafe reports
Creates automated GitHub issues when flaky tests are identified

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`.github/workflows/detect-flaky-test.yml`	New workflow that downloads test reports from both build attempts and identifies flaky tests
`.github/workflows/build.yml`	Modified to upload test reports as artifacts for flaky test analysis
`.github/scripts/parse_failed_tests.py`	Python script to parse XML test reports and identify tests that failed in first run but passed in second
`changes/en-us/2.x.md`	Added changelog entry for the flaky test detection feature
`changes/zh-cn/2.x.md`	Added Chinese changelog entry for the flaky test detection feature

Comments suppressed due to low confidence (3)

.github/workflows/detect-flaky-test.yml:71

The actions/setup-python@v2 action is deprecated. Use actions/setup-python@v4 or later for better security and performance.

        uses: actions/setup-python@v2

.github/workflows/detect-flaky-test.yml:82

The actions/github-script@v6 action is outdated. Use actions/github-script@v7 or later for improved functionality and security.

        uses: actions/github-script@v6

.github/workflows/detect-flaky-test.yml:69

The environment variable I_RUN_ATTEMPT is set but never used in this workflow. This appears to be copied from the build workflow but serves no purpose here.

      # step 3

Copilot · 2025-08-06T03:08:05Z

.github/scripts/parse_failed_tests.py

+
+    flaky_tests = []
+    for test_id, status_1 in results_1.items():
+        status_2 = results_2.get(test_id, "passed")


Assuming a test is 'passed' when it's not found in the second run may be incorrect. A missing test could indicate it was skipped or not executed, which should be handled differently than a passed test.

Suggested change

status_2 = results_2.get(test_id, "passed")

if test_id not in results_2:

# Test missing in second run; cannot determine if flaky, skip

continue

status_2 = results_2[test_id]

OmCheeLin added 2 commits July 17, 2025 11:29

add

cd14644

Update build.yml

f9dc410

OmCheeLin added 2 commits July 17, 2025 15:49

example

bcf5887

example

d497468

OmCheeLin closed this Jul 17, 2025

OmCheeLin added 4 commits July 17, 2025 15:57

1

62ea426

1

d52e7ba

1

fc2443b

Update rerun-build.yml

ed7c6ef

OmCheeLin reopened this Jul 17, 2025

OmCheeLin added 4 commits July 17, 2025 16:34

1

8e5ae62

1

ebd776a

Update rerun-build.yml

31c1284

Merge branch '2.x' into flaky-test

b60b15e

YongGoose and others added 3 commits July 21, 2025 21:20

Merge branch '2.x' into flaky-test

20fc95a

update

b527ac4

changes.log

e815072

YongGoose requested a review from funky-eyes July 31, 2025 00:39

Merge branch '2.x' into flaky-test

547f8ea

YongGoose mentioned this pull request Aug 6, 2025

add support for recording specific error messages or links in the issue #7565

Open

2 tasks

Merge branch '2.x' into flaky-test

3eb316c

YongGoose requested a review from Copilot August 6, 2025 03:07

Copilot AI reviewed Aug 6, 2025

View reviewed changes

-        status_2 = results_2.get(test_id, "passed")
+        if test_id not in results_2:
+            # Test missing in second run; cannot determine if flaky, skip
+            continue
+        status_2 = results_2[test_id]

optimize: Introduce automated flaky test tracking like OpenSearch #7545

Are you sure you want to change the base?

optimize: Introduce automated flaky test tracking like OpenSearch #7545

Conversation

OmCheeLin commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Ⅰ. Describe what this PR did

Ⅱ. Does this pull request fix one issue?

Ⅲ. Why don't you add test cases (unit test/integration test)?

Ⅳ. Describe how to verify it

Ⅴ. Special notes for reviews

Uh oh!

codecov bot commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

YongGoose commented Jul 17, 2025

Uh oh!

OmCheeLin commented Jul 17, 2025

Uh oh!

OmCheeLin commented Jul 18, 2025

Uh oh!

OmCheeLin commented Jul 18, 2025

Uh oh!

YongGoose commented Jul 18, 2025

Uh oh!

YongGoose commented Jul 18, 2025

Uh oh!

YongGoose commented Jul 18, 2025

Uh oh!

OmCheeLin commented Jul 18, 2025

Uh oh!

YongGoose commented Jul 18, 2025

Uh oh!

OmCheeLin commented Jul 21, 2025

Uh oh!

YongGoose commented Jul 21, 2025

Uh oh!

OmCheeLin commented Jul 23, 2025

Uh oh!

OmCheeLin commented Jul 23, 2025

Uh oh!

YongGoose commented Jul 23, 2025

Uh oh!

YongGoose commented Jul 31, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

OmCheeLin commented Jul 17, 2025 •

edited

Loading

codecov bot commented Jul 17, 2025 •

edited

Loading