-
Notifications
You must be signed in to change notification settings - Fork 205
Intermittent CI failures #1786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Here is an example CI Failure: @andygrove can you help me understand what the symptom is? I am not familiar with spark tests and I couldn't figure out how to see what the actual error is. All I found was
|
Thanks for looking at this @alamb. The symptom is that the tests appear to start slowing down massively (tests that should take 3 seconds take 9 minutes) and then GitHub kills the process because it is unable to communicate with it. Here is example output:
These tests usually take 3 seconds each. Currently the issue consistently happens during these particular tests, but sometimes they run without issue. |
@alamb more specifically, the failing build that you linked to:
|
The failure happened at a different point in this run: https://github.com/apache/datafusion-comet/actions/runs/15240457481/job/42860102625?pr=1792 The failure was shortly after the column shuffle tests |
If it possible that the reason for the failures are because of using datafusion filter when not using datafusion scan? I saw there is a bug for the test runs and fixed it #1793 |
Uh oh!
There was an error while loading. Please reload this page.
Describe the bug
Since changing the DataFusion dependency to a git dependency on a pinned revision of DataFusion in #1710 we have been experiencing regular but intermittent issues with CI in PR builds. We have seen the issue consistently when testing with Spark 4.0 and not other versions, and the issues have been specific to two unit tests -
array_repeat
andcolumnar shuffle with map
.In #1779, the
array_repeat
test was updated to be less memory intensive, and thecolumnar shuffle with map
test was refactored to split it into 13 separate tests (one per datatype being tested). Since then the issue has happened much less frequently, but we saw it happen again in #1773 (comment).There are no failing tests and we do not see any errors. The process appears to be killed.
unit-tests.log
is empty, so we have no information so far to help figure out the root cause.Steps to reproduce
No response
Expected behavior
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: