-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Remove infinite snapshot logging #137821
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Remove infinite snapshot logging #137821
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -29,8 +29,12 @@ | |
| import java.util.function.Supplier; | ||
|
|
||
| /** | ||
| * Tracks progress of shard snapshots during shutdown, on this single data node. Periodically reports progress via logging, the interval for | ||
| * which see {@link #SNAPSHOT_PROGRESS_DURING_SHUTDOWN_LOG_INTERVAL_SETTING}. | ||
| * Tracks progress of shard snapshots during shutdown, on this single data node. Periodically reports progress via logging until | ||
| * all snapshots have completed or are paused, the interval for which see {@link #SNAPSHOT_PROGRESS_DURING_SHUTDOWN_LOG_INTERVAL_SETTING}. | ||
| * <P> | ||
| * Note that this class is used even when the node isn't shutting down. When {@link SnapshotShardsService} starts a new snapshot task, | ||
| * the {@link SnapshotShutdownProgressTracker} is updated, so that if the node shuts down while the task is executing, we have an accurate | ||
| * counter for in-progress snapshots. This counter is decremented when the snapshot task finishes, either successfully or not. | ||
| */ | ||
| public class SnapshotShutdownProgressTracker { | ||
|
|
||
|
|
@@ -66,6 +70,7 @@ public class SnapshotShutdownProgressTracker { | |
|
|
||
| /** | ||
| * Tracks the number of shard snapshots that have started on the data node but not yet finished. | ||
| * If the node starts shutting down, when this reaches 0, we stop logging the periodic progress report | ||
| */ | ||
| private final AtomicLong numberOfShardSnapshotsInProgressOnDataNode = new AtomicLong(); | ||
|
|
||
|
|
@@ -133,25 +138,47 @@ private void cancelProgressLogger() { | |
| * Logs information about shard snapshot progress. | ||
| */ | ||
| private void logProgressReport() { | ||
| logger.info( | ||
| """ | ||
| Current active shard snapshot stats on data node [{}]. \ | ||
| Node shutdown cluster state update received at [{} millis]. \ | ||
| Finished signalling shard snapshots to pause at [{} millis]. \ | ||
| Number shard snapshots running [{}]. \ | ||
| Number shard snapshots waiting for master node reply to status update request [{}] \ | ||
| Shard snapshot completion stats since shutdown began: Done [{}]; Failed [{}]; Aborted [{}]; Paused [{}]\ | ||
| """, | ||
| getLocalNodeId.get(), | ||
| shutdownStartMillis, | ||
| shutdownFinishedSignallingPausingMillis, | ||
| numberOfShardSnapshotsInProgressOnDataNode.get(), | ||
| shardSnapshotRequests.size(), | ||
| doneCount.get(), | ||
| failureCount.get(), | ||
| abortedCount.get(), | ||
| pausedCount.get() | ||
| ); | ||
| // If there are no more snapshots in progress, then stop logging periodic progress reports | ||
| if (numberOfShardSnapshotsInProgressOnDataNode.get() == 0) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we safely add asserts in this if-statement that the other stat values are zero / empty?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm what would you be looking for specifically here? I don't think there's anything to assert but please correct me if I'm wrong! Running through the values outputted in the log:
|
||
| logger.info( | ||
| """ | ||
| All shard snapshots have finished or been paused on data node [{}].\ | ||
| Node shutdown cluster state update received at [{} millis]. \ | ||
| Progress logging completed at [{} millis]. \ | ||
| Number shard snapshots waiting for master node reply to status update request [{}] \ | ||
| Shard snapshot completion stats since shutdown began: Done [{}]; Failed [{}]; Aborted [{}]; Paused [{}]\ | ||
| """, | ||
| getLocalNodeId.get(), | ||
| shutdownStartMillis, | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This patch doesn't have your incoming change to improve the time value logging, but we'll want to use that in this log message, too.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah I'll merge those changes in 👍 |
||
| threadPool.relativeTimeInMillis(), | ||
| shardSnapshotRequests.size(), | ||
| doneCount.get(), | ||
| failureCount.get(), | ||
| abortedCount.get(), | ||
| pausedCount.get() | ||
| ); | ||
| cancelProgressLogger(); | ||
| } else { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. opt nit: I personally prefer an if-statement with an early return, to reduce line nesting / improve clarity, rather than if-else statements. There's not a style in the codebase for it, though, so if you personally like the if-else better, totally fine. It would look like It starts to get wild to follow when something like this grows |
||
| logger.info( | ||
| """ | ||
| Current active shard snapshot stats on data node [{}]. \ | ||
| Node shutdown cluster state update received at [{} millis]. \ | ||
| Finished signalling shard snapshots to pause at [{} millis]. \ | ||
| Number shard snapshots running [{}]. \ | ||
| Number shard snapshots waiting for master node reply to status update request [{}] \ | ||
| Shard snapshot completion stats since shutdown began: Done [{}]; Failed [{}]; Aborted [{}]; Paused [{}]\ | ||
| """, | ||
| getLocalNodeId.get(), | ||
| shutdownStartMillis, | ||
| shutdownFinishedSignallingPausingMillis, | ||
| numberOfShardSnapshotsInProgressOnDataNode.get(), | ||
| shardSnapshotRequests.size(), | ||
| doneCount.get(), | ||
| failureCount.get(), | ||
| abortedCount.get(), | ||
| pausedCount.get() | ||
| ); | ||
| } | ||
| // Use a callback to log the shard snapshot details. | ||
| logIndexShardSnapshotStatuses.accept(logger); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We might as well skip this additional logging (or do-nothing method call, rather) when halting because there shouldn't be any index shard snapshots in progress anymore. |
||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify that it always tracks snapshots but only logs about them when shutdown starts?