Skip to content

Intermittent CI failures #1786

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
andygrove opened this issue May 24, 2025 · 6 comments · Fixed by #1802 or #1806
Closed

Intermittent CI failures #1786

andygrove opened this issue May 24, 2025 · 6 comments · Fixed by #1802 or #1806
Labels
bug Something isn't working

Comments

@andygrove
Copy link
Member

andygrove commented May 24, 2025

Describe the bug

Since changing the DataFusion dependency to a git dependency on a pinned revision of DataFusion in #1710 we have been experiencing regular but intermittent issues with CI in PR builds. We have seen the issue consistently when testing with Spark 4.0 and not other versions, and the issues have been specific to two unit tests - array_repeat and columnar shuffle with map.

In #1779, the array_repeat test was updated to be less memory intensive, and the columnar shuffle with map test was refactored to split it into 13 separate tests (one per datatype being tested). Since then the issue has happened much less frequently, but we saw it happen again in #1773 (comment).

There are no failing tests and we do not see any errors. The process appears to be killed. unit-tests.log is empty, so we have no information so far to help figure out the root cause.

Steps to reproduce

No response

Expected behavior

No response

Additional context

No response

@andygrove andygrove added the bug Something isn't working label May 24, 2025
@alamb
Copy link
Contributor

alamb commented May 25, 2025

Here is an example CI Failure:

@andygrove can you help me understand what the symptom is? I am not familiar with spark tests and I couldn't figure out how to see what the actual error is. All I found was

[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.apache.comet.parquet.TestFileReader
[INFO] Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.709 s -- in org.apache.comet.parquet.TestFileReader
[INFO] Running org.apache.comet.parquet.TestColumnReader
25/05/23 03:52:58 INFO core/src/lib.rs: Comet native library version 0.9.0 initialized
[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 3.118 s -- in org.apache.comet.parquet.TestColumnReader
[INFO] Running org.apache.comet.parquet.TestCometInputFile
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 s -- in org.apache.comet.parquet.TestCometInputFile
[INFO] 
[INFO] Results:

@andygrove
Copy link
Member Author

Here is an example CI Failure:

* https://github.com/apache/datafusion-comet/actions/runs/15201820724/job/42757252219

@andygrove can you help me understand what the symptom is? I am not familiar with spark tests and I couldn't figure out how to see what the actual error is. All I found was

[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.apache.comet.parquet.TestFileReader
[INFO] Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.709 s -- in org.apache.comet.parquet.TestFileReader
[INFO] Running org.apache.comet.parquet.TestColumnReader
25/05/23 03:52:58 INFO core/src/lib.rs: Comet native library version 0.9.0 initialized
[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 3.118 s -- in org.apache.comet.parquet.TestColumnReader
[INFO] Running org.apache.comet.parquet.TestCometInputFile
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 s -- in org.apache.comet.parquet.TestCometInputFile
[INFO] 
[INFO] Results:

Thanks for looking at this @alamb. The symptom is that the tests appear to start slowing down massively (tests that should take 3 seconds take 9 minutes) and then GitHub kills the process because it is unable to communicate with it.

Here is example output:

Sat, 24 May 2025 02:45:30 GMT - columnar shuffle on map [float] (3 seconds, 751 milliseconds)
Sat, 24 May 2025 02:45:34 GMT - columnar shuffle on map [double] (3 seconds, 776 milliseconds)
Sat, 24 May 2025 02:45:42 GMT - columnar shuffle on map [date] (7 seconds, 957 milliseconds)
Sat, 24 May 2025 02:54:54 GMT - columnar shuffle on map [timestamp] (9 minutes, 11 seconds)
Sat, 24 May 2025 03:05:03 GMT [INFO] ------------------------------------------------------------------------
Sat, 24 May 2025 03:05:03 GMT [INFO] Reactor Summary for Comet Project Parent POM 0.9.0-SNAPSHOT:

These tests usually take 3 seconds each. Currently the issue consistently happens during these particular tests, but sometimes they run without issue.

@andygrove
Copy link
Member Author

@alamb more specifically, the failing build that you linked to:

CometShuffle4_0Suite:
- Fallback to Spark when shuffling on struct with duplicate field name (175 milliseconds)
- Unsupported types for SinglePartition should fallback to Spark (90 milliseconds)
- Fallback to Spark for unsupported input besides ordering (116 milliseconds)
- columnar shuffle on nested struct including nulls (2 seconds, 673 milliseconds)
- columnar shuffle on struct including nulls (2 seconds, 309 milliseconds)
- columnar shuffle on array/struct map key/value (18 seconds, 549 milliseconds)
- columnar shuffle on map array element (5 seconds, 313 milliseconds)
- RoundRobinPartitioning is supported by columnar shuffle (320 milliseconds)
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Comet Project Parent POM 0.9.0-SNAPSHOT:
[INFO] 
[INFO] Comet Project Parent POM ........................... SUCCESS [  8.579 s]
[INFO] comet-common ....................................... SUCCESS [01:59 min]
[INFO] comet-spark ........................................ FAILURE [  01:26 h]
[INFO] comet-spark-integration ............................ SKIPPED
[INFO] comet-fuzz ......................................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  01:29 h
[INFO] Finished at: 2025-05-23T05:21:24Z
[INFO] ------------------------------------------------------------------------
Error:  Failed to execute goal org.scalatest:scalatest-maven-plugin:2.2.0:test (test) on project comet-spark-spark4.0_2.13: There are test failures -> [Help 1]`

CometShuffle4_0Suite runs in a forked process and it terminated after the RoundRobinPartitioning is supported by columnar shuffle completed. The next test in the suit is the columnar shuffle on map test, so we can infer that the process was killed before that test completed.

@andygrove
Copy link
Member Author

The failure happened at a different point in this run:

https://github.com/apache/datafusion-comet/actions/runs/15240457481/job/42860102625?pr=1792

The failure was shortly after the column shuffle tests

@rluvaton
Copy link
Contributor

If it possible that the reason for the failures are because of using datafusion filter when not using datafusion scan?

I saw there is a bug for the test runs and fixed it #1793

@andygrove
Copy link
Member Author

I was able to prove that this issue is NOT as a result of upgrading to DataFusion 48.0.0 because I still see the issue after reverting the PR that performed this upgrade: #1795

@alamb fyi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants