-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flaky test in debci #1076
Comments
I've been informed that the debci logs are being recycled. |
So, I've been staring at All tests fail while running In all failing cases we observe that the assertion fails with the random string (genererated by Line 660 in 4d896f6
The assertion is not looking at the output directly. That's hard to do in bash. Instead, the assertion is reading from a temporary file (cf. Line 663 in 4d896f6
So how can this be? Evidently the file got created when starting the first Lines 647 to 648 in 4d896f6
nc processes.
My theory/thought is that the listening If this theory is correct, then the issue can be solved by ensuring that the "job" started in line 647 terminates before the temporary file is read in line 660. Maybe moving up the closing of the file descriptor at line Lines 666 to 667 in 4d896f6
nc invocation that listens to the port to exit and make sure it gets the chance to write out and close the temporary file properly before test test reads it in line 663.
This theory could explain why this happens on test VMs more often than on developer laptops: Those worker VMs tend have a lot of load on their IO systems, leading to buffers being flushed slower than on unloaded developer systems. Moreover, I suspect that those test VMs don't run on SSDs, but on virtualized storage, that may not even be on the local machine: more opportunity for the i/o buffer to see delays before becoming available to other processes. |
Yeah all the ncat usage in bash is really a PITA to get this sync'ed up, the port listening part alone is already complex and the we kept getting broken ncat updates that broke our tests. First is /tmp a tmpfs? We use a tmpfs and do not see such flakes here so if IO is the problem then I think this should fix it. I doubt you need to actually close/sync the file on tmpfs.
I think the race might be even worse, just because we write to the socket on the client doesn't mean the server side read it already. At least in the udp case the client exits before getting a confirmation from server as this is not a stream protocol so we just send one package of data and then exit. So that means by the time cat reads the file the server did not yet read or wrote the content to the file so yeah I think you are right.
Closing alone would not help, we would close then "wait" for the background job to finish. Another way is to make the server bidirectional and send the msg back so we can compare the stdout from the ncat client. The nice thing about that is we actually ensure bidirectional traffic works. Long term I was thinking of rewriting the tests in some higher level tool or language (e.g. python or go) because really we just need to setup a listener in a namespace and the whole ncat check if port is bound in background is just fragile. And the whole check json output isn't nice to use either. |
Just thinking out aloud here: Instead of completely rewriting the tests in python or go, maybe a more specialized tool that can run the send and receiver in two different network namespaces that are given on the command-line could be plugged into the |
Sure but this is the least of my concerns, the json string matching we do in bash is just not very sane and most of the stuff we do would just be much better is a sane language where you can actually manipulate json in a decent way. |
The background process may not yet have wirrten the output to the file so we need to make sure it does first. Fixes containers#1076 Signed-off-by: Paul Holzinger <[email protected]>
I'm afraid that this is very similar to #433 but I can see it on amd64, more specifically, on Debian's CI system called debci. Here, the tests are running in a qemu VM with 4G of ram and 2 cores.
It seems to be quite reliably to break, but all failures are fairly inconsistent. Example runs I've observed so far:
I've managed to trigger it once on my laptop, by running the tests on four qemu machines in parallel. I've kept going running 5 machines in parallel, but could not reproduce anymore.
Any ideas/suggestions? Maybe I should increase the
sleep
statements inhelpers.bash
?The text was updated successfully, but these errors were encountered: