log an error instead of blocking a wal truncation on checkpoint gaps. #407

krasi-georgiev · 2018-10-11T21:09:44Z

missing wal segment blocks truncation and the wal grows indefinitely.

An error log message should be enough to indicate a problem.

Signed-off-by: Krasi Georgiev [email protected]

missing wal segment is not so crucial to cause an exit. An error log message should be enough. Signed-off-by: Krasi Georgiev <[email protected]>

gouthamve · 2018-10-13T17:13:05Z

Hmm, this means we could lose metrics. We need to come up with a policy about when to break and when to potentially lose metrics before we merge this in.

krasi-georgiev · 2018-10-14T09:11:36Z

The other alternative is rebuilding it from the HEAD as you suggested. I looked into it, but it is not trivial.

krasi-georgiev · 2018-10-22T10:56:49Z

@fabxc what is your comment for this one?

fabxc · 2018-10-22T21:06:23Z

From the linked issue I read that it is not clear what went wrong in that users setup, right? The problem is that any checkpoint with missing segments since the last checkpoint will potentially or even likely be corrupted.
The missing segments may contain series records and all subsequent samples for that series will fail. One could argue that we can't do anything else anyway and this is thus an acceptable tradeoff. But it also feels a bit wrong to deliberately continue with corrupted data.

It would certainly be great to understand what caused this.
From the user's logs, it appears that this has been an issue at startup already. Maybe a decent intermediate step would be to bail on startup if we detect this issue?

krasi-georgiev · 2018-10-29T16:26:19Z

From the linked issue I read that it is not clear what went wrong in that users setup, right?

correct I couldn't find anything wrong in the code that might cause this.

The problem is that any checkpoint with missing segments since the last checkpoint will potentially or even likely be corrupted.
The missing segments may contain series records and all subsequent samples for that series will fail. One could argue that we can't do anything else anyway and this is thus an acceptable tradeoff. But it also feels a bit wrong to deliberately continue with corrupted data.

https://github.com/prometheus/tsdb/blob/d804a27062fc524a7494592b72cb23cae1f709cc/wal.go#L280

Isn't this what the code is already doing when there is a wal corruption anyway?
All samples for missing series will be ignored. Another way would be to try and reconstruct the missing WAL file from the head, but last time I looked it won't be just few lines of code.

It would certainly be great to understand what caused this.

Yeah I agree, I will revisit and will try to get a bit more details.

From the user's logs, it appears that this has been an issue at startup already. Maybe a decent intermediate step would be to bail on startup if we detect this issue?

Isn't loss of data better than no startup at all which would normally results in deleting the entire WAL folder anyway.
Otherwise what do you think should be the next course of action after such corruption?
I can probably add something to the tsdb scan command that is still in review.

gouthamve · 2018-11-28T11:24:04Z

As @fabxc and @brian-brazil pointed out, this'll mean that some data for some series will exist over time-range and some series will be lost. This will mean false data for some queries and we don't want that.

If some data is missing, I think the best recourse is to delete all the segments that come after that, as part of the "repair" process. Maybe we log the error and delete the data? This would ensure that prometheus is not broken and unless Prometheus is restarted, we'll potentially not lose any data?

krasi-georgiev · 2018-12-17T15:55:09Z

yes I understand the logic now and agree with you.

Since the original issue was never confirmed I am closing this until we confirm the actual bug if any.

log an error instead of exiting on wal checkpoint gaps.

984c05c

missing wal segment is not so crucial to cause an exit. An error log message should be enough. Signed-off-by: Krasi Georgiev <[email protected]>

krasi-georgiev changed the title ~~log an error instead of exiting on wal checkpoint gaps.~~ log an error instead of blocking a wal truncation on checkpoint gaps. Oct 11, 2018

krasi-georgiev mentioned this pull request Oct 11, 2018

WAL files are not truncated after upgrade to 2.4.2, caused by unexpected gap to last checkpoint prometheus/prometheus#4695

Closed

krasi-georgiev closed this Dec 17, 2018

krasi-georgiev deleted the gap-wal branch December 17, 2018 15:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

log an error instead of blocking a wal truncation on checkpoint gaps. #407

log an error instead of blocking a wal truncation on checkpoint gaps. #407

krasi-georgiev commented Oct 11, 2018 •

edited

Loading

gouthamve commented Oct 13, 2018

krasi-georgiev commented Oct 14, 2018 •

edited

Loading

krasi-georgiev commented Oct 22, 2018

fabxc commented Oct 22, 2018

krasi-georgiev commented Oct 29, 2018

gouthamve commented Nov 28, 2018

krasi-georgiev commented Dec 17, 2018

log an error instead of blocking a wal truncation on checkpoint gaps. #407

log an error instead of blocking a wal truncation on checkpoint gaps. #407

Conversation

krasi-georgiev commented Oct 11, 2018 • edited Loading

gouthamve commented Oct 13, 2018

krasi-georgiev commented Oct 14, 2018 • edited Loading

krasi-georgiev commented Oct 22, 2018

fabxc commented Oct 22, 2018

krasi-georgiev commented Oct 29, 2018

gouthamve commented Nov 28, 2018

krasi-georgiev commented Dec 17, 2018

krasi-georgiev commented Oct 11, 2018 •

edited

Loading

krasi-georgiev commented Oct 14, 2018 •

edited

Loading