Skip to content

Deadlock fix minus refactor #73

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

ashleycoxley
Copy link

@ashleycoxley ashleycoxley commented Jun 14, 2022

Adds the Shyiko-authored deadlock fix to the Osheroff branch, minus some refactoring. These commits have been submitted to the Osheroff fork twice before, and rejected because they contained too much refactoring:
#22
#24

This is my attempt to submit just the deadlock fix without the refactoring. I've tested that without these changes, the deadlock situation brought up here is still a problem in the current Osheroff fork, and that with these changes, that deadlock situation does not come up.

Copy link

@fivetran-mattalexander fivetran-mattalexander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me

@ashleycoxley
Copy link
Author

@osheroff could you take a look at this when you have time? Our application that uses the binary log client is currently forked from the Shyiko fork. We'd love to rebase to this fork, but the deadlock fix is a must-have for us. We can maintain it in our own fork if needed, but since it's been requested here a few times, and would make our rebase easier, I thought it would be a good idea to merge it here.

@osheroff
Copy link
Owner

osheroff commented Jun 17, 2022 via email

@ashleycoxley
Copy link
Author

Thanks for looking, @osheroff! Sorry for the late reply here - I've been traveling the last week.

We haven't been running what's in this PR, but we've been running a fork containing the 2 original Shyiko commits because we saw this deadlock issue in production. I attempted here to just undo some of the simpler refactors that were included in the original "deadlock fix" release. I'm a little tentative to do more because I also am not entirely sure what's essential. It looks to be the addition of the countdown latch, passing a local channel around, and some re-ordering of the connect method control flow.

Can't say I'm confident enough to re-write a simpler fix, so I was hoping a slightly simpler version of those original commits might fly - no good?

@osheroff
Copy link
Owner

i mean... I'm just hesitant to commit code that neither the committer nor the maintainer really understand, even if it did come straight from the stanley shyko's mouth, ya know? I'll see if I can find time repro the issue and maybe craft a fix I understand...

@ashleycoxley
Copy link
Author

I understand.

I can reproduce it (inconsistently) with the parameters noted here (+ keepalive set to true): shyiko#321 (comment)

@osheroff
Copy link
Owner

also happy to have you take a stab and collaborate with you, but no worries if you're not wanting to get deep into the muck and mire here

@osheroff
Copy link
Owner

(too many metaphors)

@ashleycoxley
Copy link
Author

Let me check in with my team on this and see what we have time and space for. Thanks a bunch.

@ashleycoxley
Copy link
Author

@osheroff Do you have time to meet and talk over how to move forward?

@ashleycoxley
Copy link
Author

So, I've tried undoing more changes (local channel changes, etc) and haven't been able to find a more minimal version (of these changes, at least) that still prevents a deadlock when setting the keepalive interval to 2 and the heartbeat interval to 10. I wonder what specifically is screaming "window dressing" - maybe the diff looks more significant than it is because of the addition of new try blocks, etc? @osheroff would love if you could attempt to address this in a way that works for you, because the deadlocking causes significant issues for us, but we want to base our fork off this one in order to keep current with other fixes and improvements. Do you have a rough ETA on when you could have this? Thanks!

@osheroff
Copy link
Owner

I'm looking now.

@ashleycoxley are your stack traces the same as reported in the original issue? The deadlock occurs in the disconnect code path?

@osheroff
Copy link
Owner

like, I think the solution might be as simple as signalling to the keepalive thread that it shouldn't attempt reconnect because the process is shutting down; but I'd like to see:

  • any code you have that might repro the issue (I can't seem to naively/locally)
  • stack traces from the deadlock
  • your current codebase so I can line up the deadlock stack traces with actual lines of code

@ashleycoxley
Copy link
Author

ashleycoxley commented Jul 14, 2022

are your stack traces the same as reported in the original issue? The deadlock occurs in the disconnect code path?

Yes. I think the original issue details our own experience with it pretty well and answers your questions about our code (let me know if that's not the case).

As that issue lays out - the deadlock can be reproduced by setting keepalive to true, the heartbeat interval to 10, and the keepalive interval to 2. I reproduce it using these parameters and running BinaryLogClientIntegrationTest. (Setting keepalive interval to 1 was suggested in that original ticket, but this causes those tests to fail for other reasons - 2 reproduces the issue.) eta - It doesn't actually fail every time, but is fairly consistent - one out of every 2 or 3 runs for me.

Here's a thread dump taken after that test ran into a deadlock using those parameters:

"main" #1 prio=5 os_prio=31 cpu=770.40ms elapsed=108.83s tid=0x00007f92b0009000 nid=0x2803 waiting on condition  [0x000070000b725000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
        - parking to wait for  <0x00000006160f4380> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:194)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt([email protected]/AbstractQueuedSynchronizer.java:885)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued([email protected]/AbstractQueuedSynchronizer.java:917)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:1240)
        at java.util.concurrent.locks.ReentrantLock.lock([email protected]/ReentrantLock.java:267)
        at com.github.shyiko.mysql.binlog.BinaryLogClient.disconnect(BinaryLogClient.java:1189)
        at com.github.shyiko.mysql.binlog.BinaryLogClientIntegrationTest.tearDown(BinaryLogClientIntegrationTest.java:1194)
        at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0([email protected]/Native Method)
        at jdk.internal.reflect.NativeMethodAccessorImpl.invoke([email protected]/NativeMethodAccessorImpl.java:62)
        at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke([email protected]/DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke([email protected]/Method.java:566)
        at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:84)
        at org.testng.internal.Invoker.invokeConfigurationMethod(Invoker.java:564)
        at org.testng.internal.Invoker.invokeConfigurations(Invoker.java:213)
        at org.testng.internal.Invoker.invokeConfigurations(Invoker.java:138)
        at org.testng.internal.TestMethodWorker.invokeAfterClassMethods(TestMethodWorker.java:225)
        at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:114)
        at org.testng.TestRunner.privateRun(TestRunner.java:767)
        at org.testng.TestRunner.run(TestRunner.java:617)
        at org.testng.SuiteRunner.runTest(SuiteRunner.java:334)
        at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:329)
        at org.testng.SuiteRunner.privateRun(SuiteRunner.java:291)
        at org.testng.SuiteRunner.run(SuiteRunner.java:240)
        at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52)
        at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:86)
        at org.testng.TestNG.runSuitesSequentially(TestNG.java:1224)
        at org.testng.TestNG.runSuitesLocally(TestNG.java:1149)
        at org.testng.TestNG.run(TestNG.java:1057)
        at org.apache.maven.surefire.testng.TestNGExecutor.run(TestNGExecutor.java:69)
        at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.executeSingleClass(TestNGDirectoryTestSuite.java:120)
        at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.execute(TestNGDirectoryTestSuite.java:104)
        at org.apache.maven.surefire.testng.TestNGProvider.invoke(TestNGProvider.java:113)
        at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:200)
        at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153)
        at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103)

   Locked ownable synchronizers:
        - None

"Reference Handler" #2 daemon prio=10 os_prio=31 cpu=0.52ms elapsed=108.81s tid=0x00007f92b7070000 nid=0x4303 waiting on condition  [0x000070000be3b000]
   java.lang.Thread.State: RUNNABLE
        at java.lang.ref.Reference.waitForReferencePendingList([email protected]/Native Method)
        at java.lang.ref.Reference.processPendingReferences([email protected]/Reference.java:241)
        at java.lang.ref.Reference$ReferenceHandler.run([email protected]/Reference.java:213)

   Locked ownable synchronizers:
        - None

"Finalizer" #3 daemon prio=8 os_prio=31 cpu=0.45ms elapsed=108.81s tid=0x00007f92c680d000 nid=0x3203 in Object.wait()  [0x000070000bf3e000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait([email protected]/Native Method)
        - waiting on <0x0000000616013130> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove([email protected]/ReferenceQueue.java:155)
        - waiting to re-lock in wait() <0x0000000616013130> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove([email protected]/ReferenceQueue.java:176)
        at java.lang.ref.Finalizer$FinalizerThread.run([email protected]/Finalizer.java:170)

   Locked ownable synchronizers:
        - None

"Signal Dispatcher" #4 daemon prio=9 os_prio=31 cpu=0.26ms elapsed=108.80s tid=0x00007f92a6818800 nid=0x3903 runnable  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
        - None

"Service Thread" #5 daemon prio=9 os_prio=31 cpu=0.05ms elapsed=108.80s tid=0x00007f92a780d000 nid=0x5603 runnable  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
        - None

"C2 CompilerThread0" #6 daemon prio=9 os_prio=31 cpu=1579.70ms elapsed=108.80s tid=0x00007f927680a800 nid=0xa803 waiting on condition  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE
   No compile task

   Locked ownable synchronizers:
        - None

"C1 CompilerThread0" #14 daemon prio=9 os_prio=31 cpu=472.68ms elapsed=108.80s tid=0x00007f92c6814000 nid=0x5903 waiting on condition  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE
   No compile task

   Locked ownable synchronizers:
        - None

"Sweeper thread" #18 daemon prio=9 os_prio=31 cpu=0.07ms elapsed=108.80s tid=0x00007f92c6815000 nid=0x5b03 runnable  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
        - None

"Common-Cleaner" #19 daemon prio=8 os_prio=31 cpu=0.70ms elapsed=108.77s tid=0x00007f92a78aa000 nid=0x5e03 in Object.wait()  [0x000070000c85c000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait([email protected]/Native Method)
        - waiting on <0x0000000616055818> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove([email protected]/ReferenceQueue.java:155)
        - waiting to re-lock in wait() <0x0000000616055818> (a java.lang.ref.ReferenceQueue$Lock)
        at jdk.internal.ref.CleanerImpl.run([email protected]/CleanerImpl.java:148)
        at java.lang.Thread.run([email protected]/Thread.java:834)
        at jdk.internal.misc.InnocuousThread.run([email protected]/InnocuousThread.java:134)

   Locked ownable synchronizers:
        - None

"mysql-cj-abandoned-connection-cleanup" #23 daemon prio=5 os_prio=31 cpu=4.19ms elapsed=101.17s tid=0x00007f92a686b000 nid=0x8503 in Object.wait()  [0x000070000d7b0000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait([email protected]/Native Method)
        - waiting on <0x0000000616009bb0> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove([email protected]/ReferenceQueue.java:155)
        - waiting to re-lock in wait() <0x0000000616009bb0> (a java.lang.ref.ReferenceQueue$Lock)
        at com.mysql.cj.jdbc.AbandonedConnectionCleanupThread.run(AbandonedConnectionCleanupThread.java:85)
        at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
        at java.lang.Thread.run([email protected]/Thread.java:834)

   Locked ownable synchronizers:
        - <0x0000000616009db0> (a java.util.concurrent.ThreadPoolExecutor$Worker)

"blc-keepalive-127.0.0.1:23840" #146 prio=5 os_prio=31 cpu=19.46ms elapsed=88.64s tid=0x00007f92b0292000 nid=0x86af waiting on condition  [0x000070000c95f000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
        - parking to wait for  <0x00000006160f4380> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:194)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt([email protected]/AbstractQueuedSynchronizer.java:885)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued([email protected]/AbstractQueuedSynchronizer.java:917)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:1240)
        at java.util.concurrent.locks.ReentrantLock.lock([email protected]/ReentrantLock.java:267)
        at com.github.shyiko.mysql.binlog.BinaryLogClient.terminateConnect(BinaryLogClient.java:1219)
        at com.github.shyiko.mysql.binlog.BinaryLogClient.connect(BinaryLogClient.java:889)
        at com.github.shyiko.mysql.binlog.BinaryLogClient$5.run(BinaryLogClient.java:814)
        at java.util.concurrent.Executors$RunnableAdapter.call([email protected]/Executors.java:515)
        at java.util.concurrent.FutureTask.run([email protected]/FutureTask.java:264)
        at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
        at java.lang.Thread.run([email protected]/Thread.java:834)

   Locked ownable synchronizers:
        - <0x0000000616068c60> (a java.util.concurrent.ThreadPoolExecutor$Worker)

"blc-127.0.0.1:23840" #211 prio=5 os_prio=31 cpu=0.31ms elapsed=88.40s tid=0x00007f92a6911800 nid=0x9f37 runnable  [0x000070000ca62000]
   java.lang.Thread.State: RUNNABLE
        at java.net.SocketInputStream.socketRead0([email protected]/Native Method)
        at java.net.SocketInputStream.socketRead([email protected]/SocketInputStream.java:115)
        at java.net.SocketInputStream.read([email protected]/SocketInputStream.java:168)
        at java.net.SocketInputStream.read([email protected]/SocketInputStream.java:140)
        at com.github.shyiko.mysql.binlog.io.BufferedSocketInputStream.read(BufferedSocketInputStream.java:51)
        at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.readWithinBlockBoundaries(ByteArrayInputStream.java:226)
        at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.peek(ByteArrayInputStream.java:198)
        at com.github.shyiko.mysql.binlog.BinaryLogClient.connect(BinaryLogClient.java:530)
        at com.github.shyiko.mysql.binlog.BinaryLogClient$7.run(BinaryLogClient.java:862)
        at java.lang.Thread.run([email protected]/Thread.java:834)

   Locked ownable synchronizers:
        - <0x00000006160f4380> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

"Attach Listener" #212 daemon prio=9 os_prio=31 cpu=94.08ms elapsed=13.39s tid=0x00007f92c6866000 nid=0x9807 waiting on condition  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
        - None

"RMI TCP Accept-0" #213 daemon prio=9 os_prio=31 cpu=1.69ms elapsed=13.00s tid=0x00007f92a78b2800 nid=0x8303 runnable  [0x000070000dab9000]
   java.lang.Thread.State: RUNNABLE
        at java.net.PlainSocketImpl.socketAccept([email protected]/Native Method)
        at java.net.AbstractPlainSocketImpl.accept([email protected]/AbstractPlainSocketImpl.java:458)
        at java.net.ServerSocket.implAccept([email protected]/ServerSocket.java:565)
        at java.net.ServerSocket.accept([email protected]/ServerSocket.java:533)
        at sun.management.jmxremote.LocalRMIServerSocketFactory$1.accept([email protected]/LocalRMIServerSocketFactory.java:52)
        at sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop([email protected]/TCPTransport.java:394)
        at sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run([email protected]/TCPTransport.java:366)
        at java.lang.Thread.run([email protected]/Thread.java:834)

   Locked ownable synchronizers:
        - None

"RMI TCP Connection(1)-192.168.4.208" #214 daemon prio=9 os_prio=31 cpu=116.21ms elapsed=12.95s tid=0x00007f92c6891000 nid=0x8103 runnable  [0x000070000dbba000]
   java.lang.Thread.State: RUNNABLE
        at java.net.SocketInputStream.socketRead0([email protected]/Native Method)
        at java.net.SocketInputStream.socketRead([email protected]/SocketInputStream.java:115)
        at java.net.SocketInputStream.read([email protected]/SocketInputStream.java:168)
        at java.net.SocketInputStream.read([email protected]/SocketInputStream.java:140)
        at java.io.BufferedInputStream.fill([email protected]/BufferedInputStream.java:252)
        at java.io.BufferedInputStream.read([email protected]/BufferedInputStream.java:271)
        - locked <0x000000061b06f428> (a java.io.BufferedInputStream)
        at java.io.FilterInputStream.read([email protected]/FilterInputStream.java:83)
        at sun.rmi.transport.tcp.TCPTransport.handleMessages([email protected]/TCPTransport.java:544)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0([email protected]/TCPTransport.java:796)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0([email protected]/TCPTransport.java:677)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$$Lambda$131/0x000000080026d040.run([email protected]/Unknown Source)
        at java.security.AccessController.doPrivileged([email protected]/Native Method)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run([email protected]/TCPTransport.java:676)
        at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
        at java.lang.Thread.run([email protected]/Thread.java:834)

   Locked ownable synchronizers:
        - <0x000000061b2d3778> (a java.util.concurrent.ThreadPoolExecutor$Worker)

"RMI Scheduler(0)" #215 daemon prio=9 os_prio=31 cpu=0.23ms elapsed=12.93s tid=0x00007f92a7ad5000 nid=0x7f03 waiting on condition  [0x000070000dcbf000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
        - parking to wait for  <0x000000061b2c6808> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.parkNanos([email protected]/LockSupport.java:234)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos([email protected]/AbstractQueuedSynchronizer.java:2123)
        at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take([email protected]/ScheduledThreadPoolExecutor.java:1182)
        at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take([email protected]/ScheduledThreadPoolExecutor.java:899)
        at java.util.concurrent.ThreadPoolExecutor.getTask([email protected]/ThreadPoolExecutor.java:1054)
        at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1114)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
        at java.lang.Thread.run([email protected]/Thread.java:834)

   Locked ownable synchronizers:
        - None

"JMX server connection timeout 216" #216 daemon prio=9 os_prio=31 cpu=5.74ms elapsed=12.92s tid=0x00007f9276860000 nid=0x6e03 in Object.wait()  [0x000070000ddc2000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait([email protected]/Native Method)
        - waiting on <0x000000061b0d9e60> (a [I)
        at com.sun.jmx.remote.internal.ServerCommunicatorAdmin$Timeout.run([email protected]/ServerCommunicatorAdmin.java:171)
        - waiting to re-lock in wait() <0x000000061b0d9e60> (a [I)
        at java.lang.Thread.run([email protected]/Thread.java:834)

   Locked ownable synchronizers:
        - None



@osheroff
Copy link
Owner

great, thanks for that. i'll see if I can get it repro'ing over the weekend and find a fix.

@ashleycoxley
Copy link
Author

Awesome! Thanks.

@osheroff
Copy link
Owner

#78 isn't too bad; just ask for shutdown and then release the contended lock before going into a wait-loop. Now, having crazy low heartbeat/keepalive settings exposes other race conditions; i'm not sure if they'd be a problem with non-pathological heartbeat/keepalive settings, but at least this fixes the deadlock.

@ashleycoxley
Copy link
Author

Cool, that is so much simpler.

@osheroff
Copy link
Owner

we fixed this elsewhere

@osheroff osheroff closed this Jan 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants