Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XdsSecurityClientServerTest.tlsClientServer_useSystemRootCerts_validationContext is flaky #11678

Closed
ejona86 opened this issue Nov 8, 2024 · 19 comments · Fixed by #11691
Closed
Assignees
Labels
highly flaky Issue is for a test that is crying wolf
Milestone

Comments

@ejona86
Copy link
Member

ejona86 commented Nov 8, 2024

@erm-g
https://github.com/grpc/grpc-java/actions/runs/11745673719/job/32723647320?pr=11673

XdsSecurityClientServerTest > tlsClientServer_useSystemRootCerts_validationContext[enableSpiffe=true] FAILED
    io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
    Channel Pipeline: [SslHandler#0, ProtocolNegotiators$ClientTlsHandler#0, WriteBufferingAndExceptionHandler#0, DefaultChannelPipeline$TailContext#0]
        at app//io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:268)
        at app//io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:249)
        at app//io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:167)
        at app//io.grpc.testing.protobuf.SimpleServiceGrpc$SimpleServiceBlockingStub.unaryRpc(SimpleServiceGrpc.java:340)
        at app//io.grpc.xds.XdsSecurityClientServerTest.unaryRpc(XdsSecurityClientServerTest.java:668)
        at app//io.grpc.xds.XdsSecurityClientServerTest.tlsClientServer_useSystemRootCerts_validationContext(XdsSecurityClientServerTest.java:238)

        Caused by:
        javax.net.ssl.SSLHandshakeException: General OpenSslEngine problem
            at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.handshakeException(ReferenceCountedOpenSslEngine.java:1927)
            at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.wrap(ReferenceCountedOpenSslEngine.java:848)
            at java.base/javax.net.ssl.SSLEngine.wrap(SSLEngine.java:564)
            at io.netty.handler.ssl.SslHandler.wrap(SslHandler.java:1131)
            at io.netty.handler.ssl.SslHandler.wrapNonAppData(SslHandler.java:975)
            at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1511)
            at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1338)
            at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1387)
            at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:530)
            at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:469)
            at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290)
            at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
            at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
            at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
            at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1407)
            at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
            at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
            at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:918)
            at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:799)
            at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:501)
            at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:399)
            at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:994)
            at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
            at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
            at java.base/java.lang.Thread.run(Thread.java:840)

            Caused by:
            sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
                at java.base/sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:439)
                at java.base/sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:306)
                at java.base/sun.security.validator.Validator.validate(Validator.java:264)
                at java.base/sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:285)
                at java.base/sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:144)
                at io.netty.handler.ssl.EnhancingX509ExtendedTrustManager.checkServerTrusted(EnhancingX509ExtendedTrustManager.java:69)
                at io.netty.handler.ssl.ReferenceCountedOpenSslClientContext$ExtendedTrustManagerVerifyCallback.verify(ReferenceCountedOpenSslClientContext.java:235)
                at io.netty.handler.ssl.ReferenceCountedOpenSslContext$AbstractCertificateVerifier.verify(ReferenceCountedOpenSslContext.java:797)
                at io.netty.internal.tcnative.CertificateVerifierTask.runTask(CertificateVerifierTask.java:36)
                at io.netty.internal.tcnative.SSLTask.run(SSLTask.java:48)
                at io.netty.internal.tcnative.SSLTask.run(SSLTask.java:42)
                at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.runAndResetNeedTask(ReferenceCountedOpenSslEngine.java:1533)
                at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.access$700(ReferenceCountedOpenSslEngine.java:94)
                at io.netty.handler.ssl.ReferenceCountedOpenSslEngine$TaskDecorator.run(ReferenceCountedOpenSslEngine.java:1505)
                at io.netty.handler.ssl.SslHandler.runDelegatedTasks(SslHandler.java:1651)
                at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1497)
                ... 19 more

                Caused by:
                sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
                    at java.base/sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:148)
                    at java.base/sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:129)
                    at java.base/java.security.cert.CertPathBuilder.build(CertPathBuilder.java:297)
                    at java.base/sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:434)

3079 tests completed, 1 failed, 7 skipped
                    ... 34 more

Given this code just went in, seems highly flaky.

@ejona86 ejona86 added the highly flaky Issue is for a test that is crying wolf label Nov 8, 2024
@erm-g
Copy link
Contributor

erm-g commented Nov 9, 2024

@ejona86 how these runs are executed (how to reproduce it locally)?
I just ran the whole test class and then the failing test itself locally (cloudtop) few thousand times and didn't observe any issues.

@ejona86
Copy link
Member Author

ejona86 commented Nov 9, 2024

This run was in Github Actions, which runs gradle build.

Testing with blaze for the --runs_per_test, I saw a few different flakes with a total flake rate around 1%, all with useSystemRootCerts but a mix of enableSpiffe of true and false. Importantly, with 1000 runs I didn't see any flakes without #11627 nor any flakes with --test_filter='XdsSecurityClientServerTest.tlsClientServer_useSystemRootCerts_validationContext'. So somehow the tests are interacting with each other or causing a different execution speed.

(While it is a 1% flake rate with blaze, those are different machines than Github Actions, so it may be a much higher flake rate on Github Actions.)

http://sponge2/45c92135-104b-49fd-92cb-02899959ef7b
http://sponge2/510a9a83-1283-47a4-adda-06ae17ce6918

I'm suspected @Parameterized is the trigger more than the new spiffe code. To confirm, I added parameterized to the test without PR #11627:

--- grpc/xds/src/test/java/io/grpc/xds/XdsSecurityClientServerTest.java	2024-10-29 09:51:25.000000000 -0700
+++ grpc/xds/src/test/java/io/grpc/xds/XdsSecurityClientServerTest.java	2024-11-09 06:59:07.000000000 -0800
@@ -96,12 +96,13 @@
 import org.junit.Test;
 import org.junit.runner.RunWith;
 import org.junit.runners.JUnit4;
+import org.junit.runners.Parameterized;
 
 /**
  * Unit tests for {@link XdsChannelCredentials} and {@link XdsServerBuilder} for plaintext/TLS/mTLS
  * modes.
  */
-@RunWith(JUnit4.class)
+@RunWith(Parameterized.class)
 public class XdsSecurityClientServerTest {
 
   @Rule public final GrpcCleanupRule cleanupRule = new GrpcCleanupRule();
@@ -115,6 +116,14 @@
   private FakeXdsClientPoolFactory fakePoolFactory = new FakeXdsClientPoolFactory(xdsClient);
   private static final String OVERRIDE_AUTHORITY = "foo.test.google.fr";
 
+  @Parameterized.Parameter
+  public boolean bool;
+
+  @Parameterized.Parameters(name = "thebool={0}")
+  public static List<Boolean> data() {
+    return Arrays.asList(new Boolean[] {true, false});
+  }
+
   @After
   public void tearDown() throws IOException {
     if (fakeNameResolverFactory != null) {

That gave me a .5% flake rate in 1000 runs. http://sponge2/3a854fc8-f335-4388-ac75-d1e7d9138cb4 . So it's possible to reproduce without the spiffe code.

@kannanjgithub

@erm-g
Copy link
Contributor

erm-g commented Nov 9, 2024

Hmmm, if it's some side effect of @Parameterized I can try to isolate the new tests using subclasses like I did here - https://github.com/grpc/grpc-java/blob/master/core/src/test/java/io/grpc/internal/SpiffeUtilTest.java#L48

Trying to reproduce...

@erm-g
Copy link
Contributor

erm-g commented Nov 9, 2024

Ok caught it using
./gradlew build :grpc-xds:jacocoTestReport -PtargetArch=x86_64 -Pcheckstyle.ignoreFailures=false -PfailOnWarnings=false -PerrorProne=true -Dorg.gradle.parallel=true -PskipAndroid=true -PskipCodegen=true
and

@Parameters(name = "enableSpiffe={0}")
public static Collection<Boolean> data() {
    return IntStream.range(0, 1000)
        .mapToObj(i -> i%2 == 0)
        .collect(toList());
  }

trick

@erm-g
Copy link
Contributor

erm-g commented Nov 10, 2024

Ok, so the exception actually happens if we don't set trustStore System Property (just by commenting out https://github.com/grpc/grpc-java/blob/master/xds/src/test/java/io/grpc/xds/XdsSecurityClientServerTest.java#L226)
So I suspect that the rootcause is concurrent execution of the tests where one thread calls clearTrustStoreSystemProperties immediately after another calls setTrustStoreSystemProperties. However I don't see any gradle properties enabling parallel execution of test methods.

I see 3 possible options:

  1. Add synchronization technique for the test. I used trustStoreFilePath as an object to sync on for setTrustStoreSystemProperties, clearTrustStoreSystemProperties, and tlsClientServer_useSystemRootCerts_validationContext - 1000 runs didn't produce the exception, 5k sometimes fail with non-related to this issue timeouts and OOM. So I can create a PR with that.
  2. I introduced Parameterized here to avoid directl manipulation with System properties (GRPC_EXPERIMENTAL_SPIFFE_TRUST_BUNDLE_MAP flag), but since we do it anyway in this test, I can drop Parameterized in favor of a similar manipulation
  3. Mark the test as 'known flaky' (aka do nothing) - once the flag is removed, we can remove Parameterized and the test should be fine.

@ejona86 WDYT?

@ejona86
Copy link
Member Author

ejona86 commented Nov 11, 2024

It shouldn't be concurrent tests. Lots would break if that was happening. I think it is more likely a resource is left around (not shutdown quickly enough) after some tests, as running that one test by itself didn't flake.

@ejona86
Copy link
Member Author

ejona86 commented Nov 11, 2024

@ejona86 ejona86 added this to the 1.69 milestone Nov 12, 2024
@matthewstevenson88
Copy link
Contributor

@rockspore is taking a look at this.

@rockspore
Copy link
Collaborator

I have to admit I don't have much expertise here. It seems the flakiness goes away if we comment out the following two lines in all 3 relevant tests.:

      Files.deleteIfExists(trustStoreFilePath.toAbsolutePath());
      clearTrustStoreSystemProperties();

Interestingly, both had to be commented out, as either could not remove the flakiness, during my local test runs.

So I was also suspecting something related to SSL wasn't shutting down soon enough before the properties or file is erased, but haven't found any clue.

@erm-g
Copy link
Contributor

erm-g commented Nov 13, 2024

It shouldn't be concurrent tests. Lots would break if that was happening. I think it is more likely a resource is left around (not shutdown quickly enough) after some tests, as running that one test by itself didn't flake.

In that case, the following addition (imho very ugly) to the failing test works:

      setTrustStoreSystemProperties(trustStoreFilePath.toAbsolutePath().toString());
      for (int i = 0; i < 5; i++) {
        if (System.getProperty("javax.net.ssl.trustStore") == null) {
          Thread.sleep(200L);
          setTrustStoreSystemProperties(trustStoreFilePath.toAbsolutePath().toString());
        } else {
          break;
        }
      }

The number 5 here comes from 3 (tests manipulating System.properties) * 2 (spiffe Param values) - 1 (current test), and 200ms is max time per test I observed locally.
Also 0 failures on 1000 local runs, some OOMs on 5000
MB just single sleep for 200ms * total number of tests in this class would also work.

I guess that making this test class execute the test methods in the particular order would also work - Junit4 supports it (but very limited) - FixMethodOrder

WDYT?

@rockspore
Copy link
Collaborator

@erm-g Did you confirm that System.setProperty() indeed didn't take effect soon enough such that System.getProperty() still returned null, thus entering your retry loop? Because if this is the case, maybe this can just be added inside setTrustStoreSystemProperties().

@erm-g
Copy link
Contributor

erm-g commented Nov 13, 2024

@erm-g Did you confirm that System.setProperty() indeed didn't take effect soon enough such that System.getProperty() still returned null, thus entering your retry loop?`.

Didn't really get your q. soon enough for set/get Property is not applicable since it's just a Map with synchronized access - https://docs.oracle.com/javase/7/docs/api/java/util/Properties.html

Because if this is the case, maybe this can just be added inside `setTrustStoreSystemProperties()

Yeah, the same loop can also be created there.

@rockspore
Copy link
Collaborator

@erm-g Did you confirm that System.setProperty() indeed didn't take effect soon enough such that System.getProperty() still returned null, thus entering your retry loop?`.

Didn't really get your q. soon enough for set/get Property is not applicable since it's just a Map with synchronized access - https://docs.oracle.com/javase/7/docs/api/java/util/Properties.html

Because if this is the case, maybe this can just be added inside `setTrustStoreSystemProperties()

Yeah, the same loop can also be created there.

Thanks for the info. My original question was basically if it's all synchronized, why did you have to create such a loop to check it was not null after you had set it at the top? I don't think by default we have those test cases run in parallel.

In fact, I still saw the same failures with this retry logic added inside setTrustStoreSystemProperties()

@erm-g
Copy link
Contributor

erm-g commented Nov 14, 2024

My original question was basically if it's all synchronized, why did you have to create such a loop to check it was not null after you had set it at the top?

Only calls to setProperty() and getProperty() are synchronized, but the behavior we observe here tells us that needed properties are cleared after https://github.com/grpc/grpc-java/blob/master/xds/src/test/java/io/grpc/xds/XdsSecurityClientServerTest.java#L226 after the lock is released.
Eric believes that it might be a resource not shutdown quickly enough after another test, so my loop just waits for some empiric time.

@ejona86
Copy link
Member Author

ejona86 commented Nov 14, 2024

I removed the @Parameterized (and confirmed line and condition test coverage did not change), and surprisingly it still failed.

I then discovered that with this setup, tlsClientServer_useSystemRootCerts_validationContext fails only when tlsClientServer_Spiffe_noClientAuthentication_wrongServerCert is run. If I put org.junit.Assume.assumeTrue(false); as the first line of tlsClientServer_Spiffe_noClientAuthentication_wrongServerCert, then tlsClientServer_useSystemRootCerts_validationContext stops flaking. So what's special about that test?

I ran with --config=tsan, and only in scenarios it was flaky do I see a race in the JDK (http://sponge2/7d32cc24-5556-4d59-92b4-a2382cb59687). There is an unsynchronized read of descriptor, while the write is within the lock. In the Java memory model, that doesn't cause a tear by itself because TrustStoreDescriptor only has final fields. So the only risk is a thread could see the old value. But csRef is also mutated and read without synchronization. Those two fields could be out-of-sync.

But why are multiple threads accessing that? Looking at the stack traces, both the reader and writer have FileWatcherCertificateProvider.checkAndReloadCertificates(). There's probably two concurrent because one for both client and server. To see how many exist at once I added println to FileWatcherCertificateProvider.start and close and looked at successful runs. I see half the providers appear not to be shut down, and with a pattern of two created, then one shut down. I expect either the client or server is leaking.

diff --git a/xds/src/test/java/io/grpc/xds/XdsSecurityClientServerTest.java b/xds/src/test/java/io/grpc/xds/XdsSecurityClientServerTest.java
index 8e1220fe9..95574130e 100644
--- a/xds/src/test/java/io/grpc/xds/XdsSecurityClientServerTest.java
+++ b/xds/src/test/java/io/grpc/xds/XdsSecurityClientServerTest.java
@@ -90,7 +90,6 @@ import java.security.cert.CertificateException;
 import java.security.cert.CertificateFactory;
 import java.util.ArrayList;
 import java.util.Arrays;
-import java.util.Collection;
 import java.util.Collections;
 import java.util.List;
 import java.util.concurrent.Executors;
@@ -102,20 +101,16 @@ import org.junit.Before;
 import org.junit.Rule;
 import org.junit.Test;
 import org.junit.runner.RunWith;
-import org.junit.runners.Parameterized;
-import org.junit.runners.Parameterized.Parameter;
-import org.junit.runners.Parameterized.Parameters;
+import org.junit.runners.JUnit4;
 
 /**
  * Unit tests for {@link XdsChannelCredentials} and {@link XdsServerBuilder} for plaintext/TLS/mTLS
  * modes.
  */
-@RunWith(Parameterized.class)
+@RunWith(JUnit4.class)
 public class XdsSecurityClientServerTest {
 
-  @Parameter
-  public Boolean enableSpiffe;
-  private Boolean originalEnableSpiffe;
+  private boolean originalEnableSpiffe;
 
   @Rule public final GrpcCleanupRule cleanupRule = new GrpcCleanupRule();
   private int port;
@@ -128,15 +123,9 @@ public class XdsSecurityClientServerTest {
   private FakeXdsClientPoolFactory fakePoolFactory = new FakeXdsClientPoolFactory(xdsClient);
   private static final String OVERRIDE_AUTHORITY = "foo.test.google.fr";
 
-  @Parameters(name = "enableSpiffe={0}")
-  public static Collection<Boolean> data() {
-    return ImmutableList.of(true, false);
-  }
-
   @Before
   public void setUp() throws IOException {
     saveEnvironment();
-    FileWatcherCertificateProviderProvider.enableSpiffe = enableSpiffe;
   }
 
   private void saveEnvironment() {
@@ -285,7 +274,18 @@ public class XdsSecurityClientServerTest {
   }
 
   @Test
-  public void tlsClientServer_Spiffe_noClientAuthentication() throws Exception {
+  public void tlsClientServer_Spiffe_noClientAuthentication_spiffeEnabled() throws Exception {
+    FileWatcherCertificateProviderProvider.enableSpiffe = true;
+    tlsClientServer_Spiffe_noClientAuthentication();
+  }
+
+  @Test
+  public void tlsClientServer_Spiffe_noClientAuthentication_spiffeDisabled() throws Exception {
+    FileWatcherCertificateProviderProvider.enableSpiffe = false;
+    tlsClientServer_Spiffe_noClientAuthentication();
+  }
+
+  private void tlsClientServer_Spiffe_noClientAuthentication() throws Exception {
     DownstreamTlsContext downstreamTlsContext =
         setBootstrapInfoAndBuildDownstreamTlsContext(SERVER_1_SPIFFE_PEM_FILE, null, null, null,
             null, null, false, false);
@@ -302,9 +302,7 @@ public class XdsSecurityClientServerTest {
 
   @Test
   public void tlsClientServer_Spiffe_noClientAuthentication_wrongServerCert() throws Exception {
-    if (!enableSpiffe) {
-      return;
-    }
+    FileWatcherCertificateProviderProvider.enableSpiffe = true;
     DownstreamTlsContext downstreamTlsContext =
         setBootstrapInfoAndBuildDownstreamTlsContext(SERVER_1_PEM_FILE, null, null, null, null,
             null, false, false);
@@ -389,10 +387,23 @@ public class XdsSecurityClientServerTest {
     }
   }
 
+  @Test
+  public void mtlsClientServer_Spiffe_withClientAuthentication_withXdsChannelCreds_spiffeEnabled()
+      throws Exception {
+    FileWatcherCertificateProviderProvider.enableSpiffe = true;
+    mtlsClientServer_Spiffe_withClientAuthentication_withXdsChannelCreds();
+  }
+
+  @Test
+  public void mtlsClientServer_Spiffe_withClientAuthentication_withXdsChannelCreds_spiffeDisabled()
+      throws Exception {
+    FileWatcherCertificateProviderProvider.enableSpiffe = false;
+    mtlsClientServer_Spiffe_withClientAuthentication_withXdsChannelCreds();
+  }
+
   /** mTLS with Spiffe Trust Bundle - client auth enabled - using {@link XdsChannelCredentials}
    * API. */
-  @Test
-  public void mtlsClientServer_Spiffe_withClientAuthentication_withXdsChannelCreds()
+  private void mtlsClientServer_Spiffe_withClientAuthentication_withXdsChannelCreds()
       throws Exception {
     DownstreamTlsContext downstreamTlsContext =
         setBootstrapInfoAndBuildDownstreamTlsContext(SERVER_1_SPIFFE_PEM_FILE, null, null, null,
@@ -408,8 +419,20 @@ public class XdsSecurityClientServerTest {
   }
 
   @Test
-  public void mtlsClientServer_Spiffe_badClientCert_expectException()
+  public void mtlsClientServer_Spiffe_badClientCert_expectException_spiffeEnabled()
       throws Exception {
+    FileWatcherCertificateProviderProvider.enableSpiffe = true;
+    mtlsClientServer_Spiffe_badClientCert_expectException();
+  }
+
+  @Test
+  public void mtlsClientServer_Spiffe_badClientCert_expectException_spiffeDisabled()
+      throws Exception {
+    FileWatcherCertificateProviderProvider.enableSpiffe = false;
+    mtlsClientServer_Spiffe_badClientCert_expectException();
+  }
+
+  private void mtlsClientServer_Spiffe_badClientCert_expectException() throws Exception {
     DownstreamTlsContext downstreamTlsContext =
         setBootstrapInfoAndBuildDownstreamTlsContext(SERVER_1_SPIFFE_PEM_FILE, null, null, null,
             null, SPIFFE_TRUST_MAP_1_FILE, true, true);

@ejona86 ejona86 self-assigned this Nov 14, 2024
ejona86 added a commit to ejona86/grpc-java that referenced this issue Nov 14, 2024
When spiffe support was added it caused
tlsClientServer_useSystemRootCerts_validationContext to become flaky.
This is because test execution order was important for whether the race
would occur.

Fixes grpc#11678
@ejona86
Copy link
Member Author

ejona86 commented Nov 14, 2024

Split out the certificate provider leak into #11692, to keep this issue just for the flake.

@erm-g
Copy link
Contributor

erm-g commented Nov 15, 2024

@ejona86 You're absolutely right about 2 instances of FileWatcherCertificateProvider - they're started for most of the tests here (one is for client and another for server). That's why I'm confused about

If I put org.junit.Assume.assumeTrue(false); as the first line of tlsClientServer_Spiffe_noClientAuthentication_wrongServerCert, then tlsClientServer_useSystemRootCerts_validationContext stops flaking.

I'd expect it to still be flaky because of the next "neighboring" test with 2 FWCPs

One practical question: we never expect the client and server to be executed under the same JVM in the real world, correct? Because if they were, we might need to create a similar fix for the codebase itself, not just for the test.

ejona86 added a commit that referenced this issue Nov 15, 2024
When spiffe support was added it caused
tlsClientServer_useSystemRootCerts_validationContext to become flaky.
This is because test execution order was important for whether the race
would occur.

Fixes #11678
@ejona86
Copy link
Member Author

ejona86 commented Nov 15, 2024

I'd expect it to still be flaky because of the next "neighboring" test with 2 FWCPs

You're assuming a particular execution order. They aren't executed in a defined order. (Often the order is repeatable, but it is "random" based on hashing and such.)

we never expect the client and server to be executed under the same JVM in the real world, correct?

The main issue was the test was changing the system properties which caused the cache to become invalidated. If there is no cache, then the code works fine.

@erm-g
Copy link
Contributor

erm-g commented Nov 15, 2024

I'd expect it to still be flaky because of the next "neighboring" test with 2 FWCPs

You're assuming a particular execution order. They aren't executed in a defined order. (Often the order is repeatable, but it is "random" based on hashing and such.)

we never expect the client and server to be executed under the same JVM in the real world, correct?

The main issue was the test was changing the system properties which caused the cache to become invalidated. If there is no cache, then the code works fine.

Got it - thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
highly flaky Issue is for a test that is crying wolf
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants