Skip to content
This repository has been archived by the owner on Aug 13, 2019. It is now read-only.

log an error instead of blocking a wal truncation on checkpoint gaps. #407

Closed

Conversation

krasi-georgiev
Copy link
Contributor

@krasi-georgiev krasi-georgiev commented Oct 11, 2018

missing wal segment blocks truncation and the wal grows indefinitely.

An error log message should be enough to indicate a problem.

related to: prometheus/prometheus#4695

Signed-off-by: Krasi Georgiev [email protected]

missing wal segment is not so crucial to cause an exit.

An error log message should be enough.

Signed-off-by: Krasi Georgiev <[email protected]>
@krasi-georgiev krasi-georgiev changed the title log an error instead of exiting on wal checkpoint gaps. log an error instead of blocking a wal truncation on checkpoint gaps. Oct 11, 2018
@gouthamve
Copy link
Collaborator

Hmm, this means we could lose metrics. We need to come up with a policy about when to break and when to potentially lose metrics before we merge this in.

@krasi-georgiev
Copy link
Contributor Author

krasi-georgiev commented Oct 14, 2018

The other alternative is rebuilding it from the HEAD as you suggested. I looked into it, but it is not trivial.

@krasi-georgiev
Copy link
Contributor Author

@fabxc what is your comment for this one?

@fabxc
Copy link
Contributor

fabxc commented Oct 22, 2018

From the linked issue I read that it is not clear what went wrong in that users setup, right? The problem is that any checkpoint with missing segments since the last checkpoint will potentially or even likely be corrupted.
The missing segments may contain series records and all subsequent samples for that series will fail. One could argue that we can't do anything else anyway and this is thus an acceptable tradeoff. But it also feels a bit wrong to deliberately continue with corrupted data.

It would certainly be great to understand what caused this.
From the user's logs, it appears that this has been an issue at startup already. Maybe a decent intermediate step would be to bail on startup if we detect this issue?

@krasi-georgiev
Copy link
Contributor Author

From the linked issue I read that it is not clear what went wrong in that users setup, right?

correct I couldn't find anything wrong in the code that might cause this.

The problem is that any checkpoint with missing segments since the last checkpoint will potentially or even likely be corrupted.
The missing segments may contain series records and all subsequent samples for that series will fail. One could argue that we can't do anything else anyway and this is thus an acceptable tradeoff. But it also feels a bit wrong to deliberately continue with corrupted data.

https://github.com/prometheus/tsdb/blob/d804a27062fc524a7494592b72cb23cae1f709cc/wal.go#L280

Isn't this what the code is already doing when there is a wal corruption anyway?
All samples for missing series will be ignored. Another way would be to try and reconstruct the missing WAL file from the head, but last time I looked it won't be just few lines of code.

It would certainly be great to understand what caused this.

Yeah I agree, I will revisit and will try to get a bit more details.

From the user's logs, it appears that this has been an issue at startup already. Maybe a decent intermediate step would be to bail on startup if we detect this issue?

Isn't loss of data better than no startup at all which would normally results in deleting the entire WAL folder anyway.
Otherwise what do you think should be the next course of action after such corruption?
I can probably add something to the tsdb scan command that is still in review.

@gouthamve
Copy link
Collaborator

As @fabxc and @brian-brazil pointed out, this'll mean that some data for some series will exist over time-range and some series will be lost. This will mean false data for some queries and we don't want that.

If some data is missing, I think the best recourse is to delete all the segments that come after that, as part of the "repair" process. Maybe we log the error and delete the data? This would ensure that prometheus is not broken and unless Prometheus is restarted, we'll potentially not lose any data?

@krasi-georgiev
Copy link
Contributor Author

yes I understand the logic now and agree with you.

Since the original issue was never confirmed I am closing this until we confirm the actual bug if any.

@krasi-georgiev krasi-georgiev deleted the gap-wal branch December 17, 2018 15:55
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants