-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: force snapshot under memory pressure #25726
Conversation
- The core of the change is to introduce another method `force_flush_buffer` in `Wal` trait. This gives a handle to choose when to kick off snapshot. - A higher level background loop is introduced that checks the overall table buffer size every `N` seconds and if it is greater than a threshold (`X`) then it calls `force_flush_buffer`. Both `N` and `X` are configurable through cli. `N` defaults to 10s and `X` defaults to 70% - Some refactoring of the code went on to make sure the calls made via `Wal` trait to flush buffer and cleanup any snapshot is reused across both branches (forcing snapshot and normal wal buffer flush) closes: #25685
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a comment about the logic, I think there's one bit to be cleaned up. Will have a quick call to walk through.
/// Interval to check buffer size (and compare with `force_snapshot_mem_threshold`) | ||
#[clap( | ||
long = "force-snapshot-interval", | ||
env = "INFLUXDB3_FORCE_SNAPSHOT_INTERVAL", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this needs to be a configuration option. Just have it as a constant. You can then either initialize with that or, in tests, with something smaller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed this interval in the other PR, although I was planning to use it in e2e tests.
@@ -118,6 +123,29 @@ impl SnapshotTracker { | |||
}) | |||
} | |||
|
|||
fn snapshot_up_to_last_wal_period(&mut self) -> Option<SnapshotInfo> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function name is a bit off. During normal snapshot operation, we want to snapshot the flush size number of wal periods. But we want to leave behind flush size / 2 periods. So that means we wait until we have flush size + flush size / 2 periods and then we take the oldest flush size periods, leaving behind flush size / 2.
If we're in a situation where we can't flush the WAL and leave behind periods because the time stamps of the data are all interleaved, we flush everything, except the most recent wal period. We do this because the buffer snapshots what is in it and then puts the most recent period into the snapshot.
In the case of forcing a snapshot, we don't need to check should_snapshot
, we just treat it like the situation where we have 3x the flush size. So we want to snapshot everything minus the last wal period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
force_flush_buffer
inWal
trait. This gives a handle to choose when to kick off snapshot.N
seconds and if it is greater than a threshold (X
) then it callsforce_flush_buffer
. BothN
andX
are configurable through cli.N
defaults to 10s andX
defaults to 70%Wal
trait to flush buffer and cleanup any snapshot is reused across both branches (forcing snapshot and normal wal buffer flush)closes: #25685