fix: read exactly M values from channel#5447
Conversation
|
I fixed this problem accidentally as well with #5449 but that solves more, because on DATA strategy early return, the subsequent RACE strategy will not deal with stale counters. the Edit: though it makes sense to give a dedicated error return at the end of the block, even if the assumption should be correct now. that return never should be reached anyway and a named error could indicate it. |
|
@nugaon I don't fully understand why the buffered channel is needed in the first place. This is an anti-pattern in go and is against our coding guidelines: channel size should be one or none. If we would respect context cancellation and not have these sort of returns that forever block on a channel (aka: have a |
|
on fan-in cases like this where we have exact number of go processes, creating the channel with the size of the routines could be considerable because it does not block at all. we could make it with waitgroups and wait for all processes to return. |
Checklist
Description
Read exactly M values from channel instead of reading "forever", which leads to deadlock in really seldom cases, which happen in mostly in CI
Open API Spec Version Changes (if applicable)
Motivation and Context (Optional)
runStrategyassumes that while consuming results fromc, one of the two exit conditions will always become true before reads are exhausted. Because of that assumption, the previous loop usedfor range c(which only terminates when c is closed).Why this can deadlock
In the DATA -> RACE fallback path, some DATA-owned shard fetches may still be finishing while RACE starts.
RACE can include data shard indices whose waits[i] is not closed yet; for those indices, fetch takes the fly=false path and waits on waits[i].
For a successful DATA fetch, the operation order is:
There is a small window between (1) and (2):
c,runStrategycan consume that last value fromc,fetchedCntmay still be stale for that check.If this happens on the last available message in
c, neither exit condition may trigger in that iteration, and the next read blocks forever (no more writers, channel not closed).Related Issue (Optional)
Screenshots (if appropriate):
AI Disclosure