Skip to content

Conversation

@HaraldNordgren
Copy link
Contributor

@HaraldNordgren HaraldNordgren commented Oct 28, 2025

Follow-up to #402 (comment).

@HaraldNordgren HaraldNordgren force-pushed the TestSubscriptionClose_flake branch from c3c0cc0 to 50ba8c0 Compare October 28, 2025 08:48
@HaraldNordgren
Copy link
Contributor Author

HaraldNordgren commented Oct 28, 2025

I repeatedly amended and pushed my commit to re-run the tests. The 8th time it failed, it seems there is a deadlock in the code.

@HaraldNordgren HaraldNordgren marked this pull request as draft October 28, 2025 09:16
@HaraldNordgren HaraldNordgren force-pushed the TestSubscriptionClose_flake branch 5 times, most recently from 21e0698 to 24ed421 Compare October 28, 2025 11:10
@HaraldNordgren HaraldNordgren changed the title integration test: longer counting to avoid flakiness integration test: prevent deadlocks Oct 28, 2025
@HaraldNordgren HaraldNordgren force-pushed the TestSubscriptionClose_flake branch 2 times, most recently from 0af0f08 to 366f1f3 Compare October 28, 2025 11:12
@HaraldNordgren HaraldNordgren force-pushed the TestSubscriptionClose_flake branch from 366f1f3 to 50ba8c0 Compare October 28, 2025 13:46
@HaraldNordgren HaraldNordgren force-pushed the TestSubscriptionClose_flake branch 3 times, most recently from fa5ed52 to e4b8869 Compare October 28, 2025 15:00
@HaraldNordgren HaraldNordgren force-pushed the TestSubscriptionClose_flake branch 2 times, most recently from 034b115 to f97ba73 Compare October 28, 2025 18:18
@HaraldNordgren HaraldNordgren force-pushed the TestSubscriptionClose_flake branch from f97ba73 to 2523e2a Compare October 28, 2025 19:20
@HaraldNordgren
Copy link
Contributor Author

Hi @benjaminjkraft, I'm at a bit of a loss here.

The deadlock in the tests seems going be stemming from the forwardData will still holding a lock.

However, deleting it exposes a situation where we can end up writing on a closed channel. That gives a panic which could be caught and ignored. But that seems like a very bad way to handle problems.

There seems to be an inherent race condition between checking isClosing and actually using channel. It's seems tricky to actually make this atomic.

Do you have any ideas?

@HaraldNordgren HaraldNordgren force-pushed the TestSubscriptionClose_flake branch 2 times, most recently from 210caa3 to a984656 Compare October 28, 2025 22:14
@HaraldNordgren HaraldNordgren force-pushed the TestSubscriptionClose_flake branch 4 times, most recently from b56cbd8 to d6b326f Compare October 28, 2025 22:41
@HaraldNordgren HaraldNordgren force-pushed the TestSubscriptionClose_flake branch from d6b326f to f9a5e5d Compare October 28, 2025 22:51
@HaraldNordgren HaraldNordgren changed the title integration test: prevent deadlocks websocket: prevent deadlocks Oct 28, 2025
@HaraldNordgren HaraldNordgren force-pushed the TestSubscriptionClose_flake branch 7 times, most recently from 2fa4559 to 01125ae Compare October 28, 2025 23:32
@HaraldNordgren HaraldNordgren force-pushed the TestSubscriptionClose_flake branch from 01125ae to 9f24ffc Compare October 28, 2025 23:43
@HaraldNordgren HaraldNordgren marked this pull request as ready for review October 28, 2025 23:52
@HaraldNordgren HaraldNordgren force-pushed the TestSubscriptionClose_flake branch 2 times, most recently from bcce435 to 9f24ffc Compare October 29, 2025 08:18
Copy link
Collaborator

@benjaminjkraft benjaminjkraft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this gets kinda complicated. I would need to think some more to see if I think it's right (for example I sure don't recall the rules for atomics). It feels like it's just sweeping problems under the rug.

I was too lazy to think so I gave this to Claude, who refactored to have just one lock (in the client); this passes the tests at least and seems less likely to have a hidden deadlock, although I'm not sure if I fully understand what invariants we need to ensure. Does having two actually let us be meaningfully more granular? (Or, can we ensure we always take the two locks out in some order? And we should probably regardless ensure we never hold a lock during network operations or calls to user-controlled code.)

@HaraldNordgren
Copy link
Contributor Author

@benjaminjkraft Makes sense, I added some comment on the PR. Thanks for the help! ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants