-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XdsSecurityClientServerTest.tlsClientServer_useSystemRootCerts_validationContext is flaky #11678
Comments
@ejona86 how these runs are executed (how to reproduce it locally)? |
This run was in Github Actions, which runs Testing with blaze for the --runs_per_test, I saw a few different flakes with a total flake rate around 1%, all with (While it is a 1% flake rate with blaze, those are different machines than Github Actions, so it may be a much higher flake rate on Github Actions.) http://sponge2/45c92135-104b-49fd-92cb-02899959ef7b I'm suspected --- grpc/xds/src/test/java/io/grpc/xds/XdsSecurityClientServerTest.java 2024-10-29 09:51:25.000000000 -0700
+++ grpc/xds/src/test/java/io/grpc/xds/XdsSecurityClientServerTest.java 2024-11-09 06:59:07.000000000 -0800
@@ -96,12 +96,13 @@
import org.junit.Test;
import org.junit.runner.RunWith;
import org.junit.runners.JUnit4;
+import org.junit.runners.Parameterized;
/**
* Unit tests for {@link XdsChannelCredentials} and {@link XdsServerBuilder} for plaintext/TLS/mTLS
* modes.
*/
-@RunWith(JUnit4.class)
+@RunWith(Parameterized.class)
public class XdsSecurityClientServerTest {
@Rule public final GrpcCleanupRule cleanupRule = new GrpcCleanupRule();
@@ -115,6 +116,14 @@
private FakeXdsClientPoolFactory fakePoolFactory = new FakeXdsClientPoolFactory(xdsClient);
private static final String OVERRIDE_AUTHORITY = "foo.test.google.fr";
+ @Parameterized.Parameter
+ public boolean bool;
+
+ @Parameterized.Parameters(name = "thebool={0}")
+ public static List<Boolean> data() {
+ return Arrays.asList(new Boolean[] {true, false});
+ }
+
@After
public void tearDown() throws IOException {
if (fakeNameResolverFactory != null) { That gave me a .5% flake rate in 1000 runs. http://sponge2/3a854fc8-f335-4388-ac75-d1e7d9138cb4 . So it's possible to reproduce without the spiffe code. |
Hmmm, if it's some side effect of Trying to reproduce... |
Ok caught it using @Parameters(name = "enableSpiffe={0}")
public static Collection<Boolean> data() {
return IntStream.range(0, 1000)
.mapToObj(i -> i%2 == 0)
.collect(toList());
} trick |
Ok, so the exception actually happens if we don't set trustStore System Property (just by commenting out https://github.com/grpc/grpc-java/blob/master/xds/src/test/java/io/grpc/xds/XdsSecurityClientServerTest.java#L226) I see 3 possible options:
@ejona86 WDYT? |
It shouldn't be concurrent tests. Lots would break if that was happening. I think it is more likely a resource is left around (not shutdown quickly enough) after some tests, as running that one test by itself didn't flake. |
@rockspore is taking a look at this. |
I have to admit I don't have much expertise here. It seems the flakiness goes away if we comment out the following two lines in all 3 relevant tests.:
Interestingly, both had to be commented out, as either could not remove the flakiness, during my local test runs. So I was also suspecting something related to SSL wasn't shutting down soon enough before the properties or file is erased, but haven't found any clue. |
In that case, the following addition (imho very ugly) to the failing test works: setTrustStoreSystemProperties(trustStoreFilePath.toAbsolutePath().toString());
for (int i = 0; i < 5; i++) {
if (System.getProperty("javax.net.ssl.trustStore") == null) {
Thread.sleep(200L);
setTrustStoreSystemProperties(trustStoreFilePath.toAbsolutePath().toString());
} else {
break;
}
} The number 5 here comes from 3 (tests manipulating I guess that making this test class execute the test methods in the particular order would also work - Junit4 supports it (but very limited) - WDYT? |
@erm-g Did you confirm that |
Didn't really get your q.
Yeah, the same loop can also be created there. |
Thanks for the info. My original question was basically if it's all synchronized, why did you have to create such a loop to check it was not null after you had set it at the top? I don't think by default we have those test cases run in parallel. In fact, I still saw the same failures with this retry logic added inside |
Only calls to |
I removed the I then discovered that with this setup, I ran with But why are multiple threads accessing that? Looking at the stack traces, both the reader and writer have diff --git a/xds/src/test/java/io/grpc/xds/XdsSecurityClientServerTest.java b/xds/src/test/java/io/grpc/xds/XdsSecurityClientServerTest.java
index 8e1220fe9..95574130e 100644
--- a/xds/src/test/java/io/grpc/xds/XdsSecurityClientServerTest.java
+++ b/xds/src/test/java/io/grpc/xds/XdsSecurityClientServerTest.java
@@ -90,7 +90,6 @@ import java.security.cert.CertificateException;
import java.security.cert.CertificateFactory;
import java.util.ArrayList;
import java.util.Arrays;
-import java.util.Collection;
import java.util.Collections;
import java.util.List;
import java.util.concurrent.Executors;
@@ -102,20 +101,16 @@ import org.junit.Before;
import org.junit.Rule;
import org.junit.Test;
import org.junit.runner.RunWith;
-import org.junit.runners.Parameterized;
-import org.junit.runners.Parameterized.Parameter;
-import org.junit.runners.Parameterized.Parameters;
+import org.junit.runners.JUnit4;
/**
* Unit tests for {@link XdsChannelCredentials} and {@link XdsServerBuilder} for plaintext/TLS/mTLS
* modes.
*/
-@RunWith(Parameterized.class)
+@RunWith(JUnit4.class)
public class XdsSecurityClientServerTest {
- @Parameter
- public Boolean enableSpiffe;
- private Boolean originalEnableSpiffe;
+ private boolean originalEnableSpiffe;
@Rule public final GrpcCleanupRule cleanupRule = new GrpcCleanupRule();
private int port;
@@ -128,15 +123,9 @@ public class XdsSecurityClientServerTest {
private FakeXdsClientPoolFactory fakePoolFactory = new FakeXdsClientPoolFactory(xdsClient);
private static final String OVERRIDE_AUTHORITY = "foo.test.google.fr";
- @Parameters(name = "enableSpiffe={0}")
- public static Collection<Boolean> data() {
- return ImmutableList.of(true, false);
- }
-
@Before
public void setUp() throws IOException {
saveEnvironment();
- FileWatcherCertificateProviderProvider.enableSpiffe = enableSpiffe;
}
private void saveEnvironment() {
@@ -285,7 +274,18 @@ public class XdsSecurityClientServerTest {
}
@Test
- public void tlsClientServer_Spiffe_noClientAuthentication() throws Exception {
+ public void tlsClientServer_Spiffe_noClientAuthentication_spiffeEnabled() throws Exception {
+ FileWatcherCertificateProviderProvider.enableSpiffe = true;
+ tlsClientServer_Spiffe_noClientAuthentication();
+ }
+
+ @Test
+ public void tlsClientServer_Spiffe_noClientAuthentication_spiffeDisabled() throws Exception {
+ FileWatcherCertificateProviderProvider.enableSpiffe = false;
+ tlsClientServer_Spiffe_noClientAuthentication();
+ }
+
+ private void tlsClientServer_Spiffe_noClientAuthentication() throws Exception {
DownstreamTlsContext downstreamTlsContext =
setBootstrapInfoAndBuildDownstreamTlsContext(SERVER_1_SPIFFE_PEM_FILE, null, null, null,
null, null, false, false);
@@ -302,9 +302,7 @@ public class XdsSecurityClientServerTest {
@Test
public void tlsClientServer_Spiffe_noClientAuthentication_wrongServerCert() throws Exception {
- if (!enableSpiffe) {
- return;
- }
+ FileWatcherCertificateProviderProvider.enableSpiffe = true;
DownstreamTlsContext downstreamTlsContext =
setBootstrapInfoAndBuildDownstreamTlsContext(SERVER_1_PEM_FILE, null, null, null, null,
null, false, false);
@@ -389,10 +387,23 @@ public class XdsSecurityClientServerTest {
}
}
+ @Test
+ public void mtlsClientServer_Spiffe_withClientAuthentication_withXdsChannelCreds_spiffeEnabled()
+ throws Exception {
+ FileWatcherCertificateProviderProvider.enableSpiffe = true;
+ mtlsClientServer_Spiffe_withClientAuthentication_withXdsChannelCreds();
+ }
+
+ @Test
+ public void mtlsClientServer_Spiffe_withClientAuthentication_withXdsChannelCreds_spiffeDisabled()
+ throws Exception {
+ FileWatcherCertificateProviderProvider.enableSpiffe = false;
+ mtlsClientServer_Spiffe_withClientAuthentication_withXdsChannelCreds();
+ }
+
/** mTLS with Spiffe Trust Bundle - client auth enabled - using {@link XdsChannelCredentials}
* API. */
- @Test
- public void mtlsClientServer_Spiffe_withClientAuthentication_withXdsChannelCreds()
+ private void mtlsClientServer_Spiffe_withClientAuthentication_withXdsChannelCreds()
throws Exception {
DownstreamTlsContext downstreamTlsContext =
setBootstrapInfoAndBuildDownstreamTlsContext(SERVER_1_SPIFFE_PEM_FILE, null, null, null,
@@ -408,8 +419,20 @@ public class XdsSecurityClientServerTest {
}
@Test
- public void mtlsClientServer_Spiffe_badClientCert_expectException()
+ public void mtlsClientServer_Spiffe_badClientCert_expectException_spiffeEnabled()
throws Exception {
+ FileWatcherCertificateProviderProvider.enableSpiffe = true;
+ mtlsClientServer_Spiffe_badClientCert_expectException();
+ }
+
+ @Test
+ public void mtlsClientServer_Spiffe_badClientCert_expectException_spiffeDisabled()
+ throws Exception {
+ FileWatcherCertificateProviderProvider.enableSpiffe = false;
+ mtlsClientServer_Spiffe_badClientCert_expectException();
+ }
+
+ private void mtlsClientServer_Spiffe_badClientCert_expectException() throws Exception {
DownstreamTlsContext downstreamTlsContext =
setBootstrapInfoAndBuildDownstreamTlsContext(SERVER_1_SPIFFE_PEM_FILE, null, null, null,
null, SPIFFE_TRUST_MAP_1_FILE, true, true); |
When spiffe support was added it caused tlsClientServer_useSystemRootCerts_validationContext to become flaky. This is because test execution order was important for whether the race would occur. Fixes grpc#11678
Split out the certificate provider leak into #11692, to keep this issue just for the flake. |
@ejona86 You're absolutely right about 2 instances of
I'd expect it to still be flaky because of the next "neighboring" test with 2 FWCPs One practical question: we never expect the client and server to be executed under the same JVM in the real world, correct? Because if they were, we might need to create a similar fix for the codebase itself, not just for the test. |
When spiffe support was added it caused tlsClientServer_useSystemRootCerts_validationContext to become flaky. This is because test execution order was important for whether the race would occur. Fixes #11678
You're assuming a particular execution order. They aren't executed in a defined order. (Often the order is repeatable, but it is "random" based on hashing and such.)
The main issue was the test was changing the system properties which caused the cache to become invalidated. If there is no cache, then the code works fine. |
Got it - thank you! |
@erm-g
https://github.com/grpc/grpc-java/actions/runs/11745673719/job/32723647320?pr=11673
Given this code just went in, seems highly flaky.
The text was updated successfully, but these errors were encountered: