-
Notifications
You must be signed in to change notification settings - Fork 175
Deadlock fix minus refactor #73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock fix minus refactor #73
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable to me
@osheroff could you take a look at this when you have time? Our application that uses the binary log client is currently forked from the Shyiko fork. We'd love to rebase to this fork, but the deadlock fix is a must-have for us. We can maintain it in our own fork if needed, but since it's been requested here a few times, and would make our rebase easier, I thought it would be a good idea to merge it here. |
I looked at this a bit. I gotta say I’m a little mystified by these changes (which ones do something? Which are just window dressing?).
You’re running with this in production for awhile?
… On Jun 16, 2022, at 14:11, Ashley Cox ***@***.***> wrote:
@osheroff could you take a look at this when you have time? Our application that uses the binary log client is currently forked from the Shyiko fork. We'd love to rebase to this fork, but the deadlock fix is a must-have for us. We can maintain it in our own fork if needed, but since it's been requested here a few times, and would make our rebase easier, I thought it would be a good idea to merge it here.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.
|
Thanks for looking, @osheroff! Sorry for the late reply here - I've been traveling the last week. We haven't been running what's in this PR, but we've been running a fork containing the 2 original Shyiko commits because we saw this deadlock issue in production. I attempted here to just undo some of the simpler refactors that were included in the original "deadlock fix" release. I'm a little tentative to do more because I also am not entirely sure what's essential. It looks to be the addition of the countdown latch, passing a local channel around, and some re-ordering of the connect method control flow. Can't say I'm confident enough to re-write a simpler fix, so I was hoping a slightly simpler version of those original commits might fly - no good? |
i mean... I'm just hesitant to commit code that neither the committer nor the maintainer really understand, even if it did come straight from the stanley shyko's mouth, ya know? I'll see if I can find time repro the issue and maybe craft a fix I understand... |
I understand. I can reproduce it (inconsistently) with the parameters noted here (+ keepalive set to true): shyiko#321 (comment) |
also happy to have you take a stab and collaborate with you, but no worries if you're not wanting to get deep into the muck and mire here |
(too many metaphors) |
Let me check in with my team on this and see what we have time and space for. Thanks a bunch. |
@osheroff Do you have time to meet and talk over how to move forward? |
So, I've tried undoing more changes (local channel changes, etc) and haven't been able to find a more minimal version (of these changes, at least) that still prevents a deadlock when setting the keepalive interval to 2 and the heartbeat interval to 10. I wonder what specifically is screaming "window dressing" - maybe the diff looks more significant than it is because of the addition of new try blocks, etc? @osheroff would love if you could attempt to address this in a way that works for you, because the deadlocking causes significant issues for us, but we want to base our fork off this one in order to keep current with other fixes and improvements. Do you have a rough ETA on when you could have this? Thanks! |
I'm looking now. @ashleycoxley are your stack traces the same as reported in the original issue? The deadlock occurs in the disconnect code path? |
like, I think the solution might be as simple as signalling to the keepalive thread that it shouldn't attempt reconnect because the process is shutting down; but I'd like to see:
|
Yes. I think the original issue details our own experience with it pretty well and answers your questions about our code (let me know if that's not the case). As that issue lays out - the deadlock can be reproduced by setting keepalive to true, the heartbeat interval to 10, and the keepalive interval to 2. I reproduce it using these parameters and running BinaryLogClientIntegrationTest. (Setting keepalive interval to 1 was suggested in that original ticket, but this causes those tests to fail for other reasons - 2 reproduces the issue.) eta - It doesn't actually fail every time, but is fairly consistent - one out of every 2 or 3 runs for me. Here's a thread dump taken after that test ran into a deadlock using those parameters:
|
great, thanks for that. i'll see if I can get it repro'ing over the weekend and find a fix. |
Awesome! Thanks. |
#78 isn't too bad; just ask for shutdown and then release the contended lock before going into a wait-loop. Now, having crazy low heartbeat/keepalive settings exposes other race conditions; i'm not sure if they'd be a problem with non-pathological heartbeat/keepalive settings, but at least this fixes the deadlock. |
Cool, that is so much simpler. |
we fixed this elsewhere |
Adds the Shyiko-authored deadlock fix to the Osheroff branch, minus some refactoring. These commits have been submitted to the Osheroff fork twice before, and rejected because they contained too much refactoring:
#22
#24
This is my attempt to submit just the deadlock fix without the refactoring. I've tested that without these changes, the deadlock situation brought up here is still a problem in the current Osheroff fork, and that with these changes, that deadlock situation does not come up.