Set logging to debug when health check fails #3701
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why this should be merged
The default logging level for avalanchego is
Info
, which doesn't say too much about what the system is doing.It is too often that the system is unhealthy (the periodical health check failure can be seen in the metrics, as well as in the log) but it is not clear from the logs why it is unhealthy, because the default logging level is too succinct.
In such times, changing the logging level may be either impossible (if the admin API isn't enabled and restarting the node is not possible), or even too late (restarting the node may cause the underlying problem we wish to investigate to not repeat).
It is therefore beneficial if avalanchego can be configured to change its logging level to debug on its own when it detects it is unhealthy, and to revert back to its previous logging level when it detects it has recovered.
Since we won't gain more information the longer the logging level is set to debug, there should be a maximum duration of time, after which the logging level is reverted back to what it was before.
How this works
The health check now notifies the logging factory whether or not at least one health check failed.
Based on the result, the logging factory either amplifies or reverts the logging levels of all loggers.
Each logger created by the logging factory records its desired logging level for both of its cores (display and file). Upon an explicit change of the logging levels (via the admin API), the desired logging level is recorded as well.
Upon amplification, the logging factory iterates through all loggers and over all cores of each logger and sets the logging level to DEBUG. Upon amplification after the amplification duration has been exhausted, it reverts the loggers to what they were earlier before the amplification, and also marks itself to cease further amplifications until the health checks all succeed again. Upon a success of all health checks, the logging level of all loggers is reverted to what it was, and then further amplifications are possible in case the health check fail once more.
How this was tested
Added an end to end test, as well as did a manual test:
Ran a node on a VM with the newly introduced flags:
Then waited for consensus to start:
And afterwards, I disconnected it from all nodes via:
sudo iptables -A OUTPUT -p tcp --destination-port 9651 -j DROP
.Then upon detecting a health check failure it changed its logging level on its own (see below):
Since I didn't specify a
--log-auto-amplification-max-duration
, the default is 5 minutes, so I let the node log in DEBUG for 5 minutes, and then confirmed it stopped logging in DEBUG and reverted its logging level back to what it was:Afterwards I re-connected the node to the network via
sudo iptables -D OUTPUT -p tcp --destination-port 9651 -j DROP
and waited for the log level to be reverted back to info:I then disconnected the node from the network again, and confirmed it amplifies its logging to DEBUG once more:
Need to be documented in RELEASES.md?
Updated it.