-
Notifications
You must be signed in to change notification settings - Fork 179
log an error instead of blocking a wal truncation on checkpoint gaps. #407
Conversation
missing wal segment is not so crucial to cause an exit. An error log message should be enough. Signed-off-by: Krasi Georgiev <[email protected]>
Hmm, this means we could lose metrics. We need to come up with a policy about when to break and when to potentially lose metrics before we merge this in. |
The other alternative is rebuilding it from the HEAD as you suggested. I looked into it, but it is not trivial. |
@fabxc what is your comment for this one? |
From the linked issue I read that it is not clear what went wrong in that users setup, right? The problem is that any checkpoint with missing segments since the last checkpoint will potentially or even likely be corrupted. It would certainly be great to understand what caused this. |
correct I couldn't find anything wrong in the code that might cause this.
https://github.com/prometheus/tsdb/blob/d804a27062fc524a7494592b72cb23cae1f709cc/wal.go#L280 Isn't this what the code is already doing when there is a wal corruption anyway?
Yeah I agree, I will revisit and will try to get a bit more details.
Isn't loss of data better than no startup at all which would normally results in deleting the entire WAL folder anyway. |
As @fabxc and @brian-brazil pointed out, this'll mean that some data for some series will exist over time-range and some series will be lost. This will mean false data for some queries and we don't want that. If some data is missing, I think the best recourse is to delete all the segments that come after that, as part of the "repair" process. Maybe we log the error and delete the data? This would ensure that prometheus is not broken and unless Prometheus is restarted, we'll potentially not lose any data? |
yes I understand the logic now and agree with you. Since the original issue was never confirmed I am closing this until we confirm the actual bug if any. |
missing wal segment blocks truncation and the wal grows indefinitely.
An error log message should be enough to indicate a problem.
related to: prometheus/prometheus#4695
Signed-off-by: Krasi Georgiev [email protected]