Intermittent CI failures #1786

andygrove · 2025-05-24T14:48:54Z

Describe the bug

Since changing the DataFusion dependency to a git dependency on a pinned revision of DataFusion in #1710 we have been experiencing regular but intermittent issues with CI in PR builds. We have seen the issue consistently when testing with Spark 4.0 and not other versions, and the issues have been specific to two unit tests - array_repeat and columnar shuffle with map.

In #1779, the array_repeat test was updated to be less memory intensive, and the columnar shuffle with map test was refactored to split it into 13 separate tests (one per datatype being tested). Since then the issue has happened much less frequently, but we saw it happen again in #1773 (comment).

There are no failing tests and we do not see any errors. The process appears to be killed. unit-tests.log is empty, so we have no information so far to help figure out the root cause.

Steps to reproduce

No response

Expected behavior

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

alamb · 2025-05-25T10:28:54Z

Here is an example CI Failure:

https://github.com/apache/datafusion-comet/actions/runs/15201820724/job/42757252219

@andygrove can you help me understand what the symptom is? I am not familiar with spark tests and I couldn't figure out how to see what the actual error is. All I found was

[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.apache.comet.parquet.TestFileReader
[INFO] Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.709 s -- in org.apache.comet.parquet.TestFileReader
[INFO] Running org.apache.comet.parquet.TestColumnReader
25/05/23 03:52:58 INFO core/src/lib.rs: Comet native library version 0.9.0 initialized
[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 3.118 s -- in org.apache.comet.parquet.TestColumnReader
[INFO] Running org.apache.comet.parquet.TestCometInputFile
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 s -- in org.apache.comet.parquet.TestCometInputFile
[INFO] 
[INFO] Results:

andygrove · 2025-05-25T16:38:20Z

Here is an example CI Failure:

* https://github.com/apache/datafusion-comet/actions/runs/15201820724/job/42757252219

@andygrove can you help me understand what the symptom is? I am not familiar with spark tests and I couldn't figure out how to see what the actual error is. All I found was

[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.apache.comet.parquet.TestFileReader
[INFO] Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.709 s -- in org.apache.comet.parquet.TestFileReader
[INFO] Running org.apache.comet.parquet.TestColumnReader
25/05/23 03:52:58 INFO core/src/lib.rs: Comet native library version 0.9.0 initialized
[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 3.118 s -- in org.apache.comet.parquet.TestColumnReader
[INFO] Running org.apache.comet.parquet.TestCometInputFile
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 s -- in org.apache.comet.parquet.TestCometInputFile
[INFO] 
[INFO] Results:

Thanks for looking at this @alamb. The symptom is that the tests appear to start slowing down massively (tests that should take 3 seconds take 9 minutes) and then GitHub kills the process because it is unable to communicate with it.

Here is example output:

Sat, 24 May 2025 02:45:30 GMT - columnar shuffle on map [float] (3 seconds, 751 milliseconds)
Sat, 24 May 2025 02:45:34 GMT - columnar shuffle on map [double] (3 seconds, 776 milliseconds)
Sat, 24 May 2025 02:45:42 GMT - columnar shuffle on map [date] (7 seconds, 957 milliseconds)
Sat, 24 May 2025 02:54:54 GMT - columnar shuffle on map [timestamp] (9 minutes, 11 seconds)
Sat, 24 May 2025 03:05:03 GMT [INFO] ------------------------------------------------------------------------
Sat, 24 May 2025 03:05:03 GMT [INFO] Reactor Summary for Comet Project Parent POM 0.9.0-SNAPSHOT:

These tests usually take 3 seconds each. Currently the issue consistently happens during these particular tests, but sometimes they run without issue.

andygrove · 2025-05-25T16:57:48Z

@alamb more specifically, the failing build that you linked to:

CometShuffle4_0Suite:
- Fallback to Spark when shuffling on struct with duplicate field name (175 milliseconds)
- Unsupported types for SinglePartition should fallback to Spark (90 milliseconds)
- Fallback to Spark for unsupported input besides ordering (116 milliseconds)
- columnar shuffle on nested struct including nulls (2 seconds, 673 milliseconds)
- columnar shuffle on struct including nulls (2 seconds, 309 milliseconds)
- columnar shuffle on array/struct map key/value (18 seconds, 549 milliseconds)
- columnar shuffle on map array element (5 seconds, 313 milliseconds)
- RoundRobinPartitioning is supported by columnar shuffle (320 milliseconds)
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Comet Project Parent POM 0.9.0-SNAPSHOT:
[INFO] 
[INFO] Comet Project Parent POM ........................... SUCCESS [  8.579 s]
[INFO] comet-common ....................................... SUCCESS [01:59 min]
[INFO] comet-spark ........................................ FAILURE [  01:26 h]
[INFO] comet-spark-integration ............................ SKIPPED
[INFO] comet-fuzz ......................................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  01:29 h
[INFO] Finished at: 2025-05-23T05:21:24Z
[INFO] ------------------------------------------------------------------------
Error:  Failed to execute goal org.scalatest:scalatest-maven-plugin:2.2.0:test (test) on project comet-spark-spark4.0_2.13: There are test failures -> [Help 1]`

CometShuffle4_0Suite runs in a forked process and it terminated after the RoundRobinPartitioning is supported by columnar shuffle completed. The next test in the suit is the columnar shuffle on map test, so we can infer that the process was killed before that test completed.

andygrove · 2025-05-25T20:53:54Z

The failure happened at a different point in this run:

https://github.com/apache/datafusion-comet/actions/runs/15240457481/job/42860102625?pr=1792

The failure was shortly after the column shuffle tests

rluvaton · 2025-05-27T14:59:16Z

If it possible that the reason for the failures are because of using datafusion filter when not using datafusion scan?

I saw there is a bug for the test runs and fixed it #1793

andygrove · 2025-05-27T15:44:22Z

I was able to prove that this issue is NOT as a result of upgrading to DataFusion 48.0.0 because I still see the issue after reverting the PR that performed this upgrade: #1795

@alamb fyi

andygrove added the bug Something isn't working label May 24, 2025

andygrove mentioned this issue May 24, 2025

Release DataFusion 48.0.0 (June 2025) apache/datafusion#15771

Open

23 tasks

andygrove mentioned this issue May 25, 2025

[experiment] Run Comet tests in Docker #1790

Closed

This was referenced May 25, 2025

Chore: Moved strings expressions to separate file #1792

Merged

[experiment] Revert upgrade to DataFusion 48 #1795

Closed

rluvaton mentioned this issue May 27, 2025

chore: add assertion that not using comet scan but using native scan #1793

Open

andygrove mentioned this issue May 27, 2025

build: Stop running Comet's Spark 4 tests on Linux for PR builds #1802

Merged

andygrove closed this as completed in #1802 May 27, 2025

andygrove mentioned this issue May 27, 2025

fix: Re-enable Spark 4 tests on Linux #1806

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Intermittent CI failures #1786

Intermittent CI failures #1786

andygrove commented May 24, 2025 •

edited

Loading

alamb commented May 25, 2025

Uh oh!

andygrove commented May 25, 2025

Uh oh!

andygrove commented May 25, 2025

Uh oh!

andygrove commented May 25, 2025

Uh oh!

rluvaton commented May 27, 2025

Uh oh!

andygrove commented May 27, 2025

Uh oh!

Intermittent CI failures #1786

Intermittent CI failures #1786

Comments

andygrove commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the bug

Steps to reproduce

Expected behavior

Additional context

alamb commented May 25, 2025

Uh oh!

andygrove commented May 25, 2025

Uh oh!

andygrove commented May 25, 2025

Uh oh!

andygrove commented May 25, 2025

Uh oh!

rluvaton commented May 27, 2025

Uh oh!

andygrove commented May 27, 2025

Uh oh!

andygrove commented May 24, 2025 •

edited

Loading