Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set logging to debug when health check fails #3701

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

yacovm
Copy link
Contributor

@yacovm yacovm commented Feb 5, 2025

Why this should be merged

The default logging level for avalanchego is Info, which doesn't say too much about what the system is doing.

It is too often that the system is unhealthy (the periodical health check failure can be seen in the metrics, as well as in the log) but it is not clear from the logs why it is unhealthy, because the default logging level is too succinct.

In such times, changing the logging level may be either impossible (if the admin API isn't enabled and restarting the node is not possible), or even too late (restarting the node may cause the underlying problem we wish to investigate to not repeat).

It is therefore beneficial if avalanchego can be configured to change its logging level to debug on its own when it detects it is unhealthy, and to revert back to its previous logging level when it detects it has recovered.

Since we won't gain more information the longer the logging level is set to debug, there should be a maximum duration of time, after which the logging level is reverted back to what it was before.

How this works

The health check now notifies the logging factory whether or not at least one health check failed.

Based on the result, the logging factory either amplifies or reverts the logging levels of all loggers.

Each logger created by the logging factory records its desired logging level for both of its cores (display and file). Upon an explicit change of the logging levels (via the admin API), the desired logging level is recorded as well.

Upon amplification, the logging factory iterates through all loggers and over all cores of each logger and sets the logging level to DEBUG. Upon amplification after the amplification duration has been exhausted, it reverts the loggers to what they were earlier before the amplification, and also marks itself to cease further amplifications until the health checks all succeed again. Upon a success of all health checks, the logging level of all loggers is reverted to what it was, and then further amplifications are possible in case the health check fail once more.

How this was tested

Added an end to end test, as well as did a manual test:

Ran a node on a VM with the newly introduced flags:

~/avalanchego$ ./build/avalanchego --network-id=fuji --log-auto-amplification true  &> out.log &

Then waited for consensus to start:

[02-07|20:08:32.635] INFO <P Chain> snowman/engine.go:512 starting consensus {"lastAcceptedID": "msVQ8GFz8jQUNUXpxmbcBPwwQp8oyzZtq36tFbD5w4KrCaGdU", "lastAcceptedHeight": 194357}
[02-07|20:08:32.648] INFO <C Chain> snowman/engine.go:512 starting consensus {"lastAcceptedID": "gE8KDsnZmNfkTadRYBiyVwF7GZMPEiKgsKmgUQmGRuCj35ST7", "lastAcceptedHeight": 37949979}
[02-07|20:08:42.626] INFO <X Chain> snowman/engine.go:512 starting consensus {"lastAcceptedID": "2kL7dXKvwh1mFXHxVAX5w7Xpqxq9Cc9MGoBpQpWEysypTDmezN", "lastAcceptedHeight": 31425}
[02-07|20:08:55.776] INFO health/worker.go:270 check started passing {"name": "readiness", "name": "bootstrapped", "tags": ["application"]}
[02-07|20:08:55.776] INFO health/worker.go:270 check started passing {"name": "health", "name": "bootstrapped", "tags": ["application"]}

And afterwards, I disconnected it from all nodes via:

sudo iptables -A OUTPUT -p tcp --destination-port 9651 -j DROP.

Then upon detecting a health check failure it changed its logging level on its own (see below):

[02-07|20:08:55.776] INFO health/worker.go:270 check started passing {"name": "readiness", "name": "bootstrapped", "tags": ["application"]}
[02-07|20:08:55.776] INFO health/worker.go:270 check started passing {"name": "health", "name": "bootstrapped", "tags": ["application"]}
[02-07|20:10:25.776] WARN health/worker.go:261 check started failing {"name": "health", "name": "P", "tags": ["11111111111111111111111111111111LpoYY"], "error": "engine: block processing too long: 37.732442896s > 30s; networking: not connected to enough stake: connected to 0.000038%; required at least 80.000000%"}
[02-07|20:10:25.776] WARN health/worker.go:261 check started failing {"name": "health", "name": "C", "tags": ["11111111111111111111111111111111LpoYY"], "error": "not connected to enough stake: connected to 0.000038%; required at least 80.000000%"}
[02-07|20:10:25.776] WARN health/worker.go:261 check started failing {"name": "health", "name": "X", "tags": ["11111111111111111111111111111111LpoYY"], "error": "not connected to enough stake: connected to 0.000038%; required at least 80.000000%"}
[02-07|20:10:25.846] DEBUG <C Chain> handler/handler.go:858 forwarding chan message to consensus {"messageOp": "gossip_request"}
[02-07|20:10:25.846] DEBUG <C Chain> snowman/engine.go:160 skipping block gossip {"reason": "blocks currently processing", "numProcessing": 3}
[02-07|20:10:25.846] DEBUG <C Chain> snowman/engine.go:927 dropped query for block {"reason": "insufficient connected stake", "blkID": "2X6ETyVyks8z9imZXa9QMEvZuJ5giuijT3ky2idpazbMeC28Da", "ratio": 0.0000003809401764048558}
[02-07|20:10:25.846] DEBUG <C Chain> snowman/engine.go:927 dropped query for block {"reason": "insufficient connected stake", "blkID": "2X6ETyVyks8z9imZXa9QMEvZuJ5giuijT3ky2idpazbMeC28Da", "ratio": 0.0000003809401764048558}
[02-07|20:10:25.846] DEBUG <C Chain> snowman/engine.go:927 dropped query for block {"reason": "insufficient connected stake", "blkID": "2X6ETyVyks8z9imZXa9QMEvZuJ5giuijT3ky2idpazbMeC28Da", "ratio": 0.0000003809401764048558}
[02-07|20:10:25.846] DEBUG <C Chain> snowman/engine.go:927 dropped query for block {"reason": "insufficient connected stake", "blkID": "2X6ETyVyks8z9imZXa9QMEvZuJ5giuijT3ky2idpazbMeC28Da", "ratio": 0.0000003809401764048558}

Since I didn't specify a --log-auto-amplification-max-duration, the default is 5 minutes, so I let the node log in DEBUG for 5 minutes, and then confirmed it stopped logging in DEBUG and reverted its logging level back to what it was:

~/avalanchego$ grep DEBUG out.log | head -1; grep DEBUG out.log | tail -1
[02-07|20:10:25.846] DEBUG <C Chain> handler/handler.go:858 forwarding chan message to consensus {"messageOp": "gossip_request"}
[02-07|20:15:25.774] DEBUG <X Chain> handler/handler.go:455 finished handling sync message {"messageOp": "chits"}

Afterwards I re-connected the node to the network via sudo iptables -D OUTPUT -p tcp --destination-port 9651 -j DROP and waited for the log level to be reverted back to info:

[02-07|20:15:25.774] DEBUG <X Chain> handler/handler.go:455 finished handling sync message {"messageOp": "chits"}
[02-07|20:17:25.775] INFO health/worker.go:270 check started passing {"name": "health", "name": "P", "tags": ["11111111111111111111111111111111LpoYY"]}
[02-07|20:17:25.775] INFO health/worker.go:270 check started passing {"name": "health", "name": "C", "tags": ["11111111111111111111111111111111LpoYY"]}
[02-07|20:17:25.775] INFO health/worker.go:270 check started passing {"name": "health", "name": "X", "tags": ["11111111111111111111111111111111LpoYY"]}
[02-07|20:18:56.680] INFO <C Chain> core/blockchain.go:1493 Resetting chain preference               number=37,950,061 hash=855ba8..28a24a drop=1 dropfrom=93524f..7975f7 add=1 addfrom=97566c..2fe828

I then disconnected the node from the network again, and confirmed it amplifies its logging to DEBUG once more:

[02-07|20:17:25.775] INFO health/worker.go:270 check started passing {"name": "health", "name": "P", "tags": ["11111111111111111111111111111111LpoYY"]}
[02-07|20:17:25.775] INFO health/worker.go:270 check started passing {"name": "health", "name": "C", "tags": ["11111111111111111111111111111111LpoYY"]}
[02-07|20:17:25.775] INFO health/worker.go:270 check started passing {"name": "health", "name": "X", "tags": ["11111111111111111111111111111111LpoYY"]}
[02-07|20:18:56.680] INFO <C Chain> core/blockchain.go:1493 Resetting chain preference               number=37,950,061 hash=855ba8..28a24a drop=1 dropfrom=93524f..7975f7 add=1 addfrom=97566c..2fe828
[02-07|20:20:25.776] WARN health/worker.go:261 check started failing {"name": "health", "name": "X", "tags": ["11111111111111111111111111111111LpoYY"], "error": "not connected to enough stake: connected to 0.000038
%; required at least 80.000000%"}
[02-07|20:20:25.776] WARN health/worker.go:261 check started failing {"name": "health", "name": "P", "tags": ["11111111111111111111111111111111LpoYY"], "error": "not connected to enough stake: connected to 0.000038
%; required at least 80.000000%"}
[02-07|20:20:25.776] WARN health/worker.go:261 check started failing {"name": "health", "name": "C", "tags": ["11111111111111111111111111111111LpoYY"], "error": "not connected to enough stake: connected to 0.000038
%; required at least 80.000000%"}
[02-07|20:20:25.847] DEBUG <C Chain> handler/handler.go:858 forwarding chan message to consensus {"messageOp": "gossip_request"}
[02-07|20:20:25.847] DEBUG <C Chain> snowman/engine.go:160 skipping block gossip {"reason": "blocks currently processing", "numProcessing": 2}
[02-07|20:20:25.847] DEBUG <C Chain> snowman/engine.go:927 dropped query for block {"reason": "insufficient connected stake", "blkID": "2i17dfiGNwgJ3C8PuPnLoB8hB3RRGX7nBeHG55LSC5zFxvw5iZ", "ratio": 0.00000038094018
547456964}
[02-07|20:20:25.847] DEBUG <C Chain> snowman/engine.go:927 dropped query for block {"reason": "insufficient connected stake", "blkID": "2i17dfiGNwgJ3C8PuPnLoB8hB3RRGX7nBeHG55LSC5zFxvw5iZ", "ratio": 0.00000038094018
547456964}
[02-07|20:20:25.847] DEBUG <C Chain> snowman/engine.go:927 dropped query for block {"reason": "insufficient connected stake", "blkID": "2i17dfiGNwgJ3C8PuPnLoB8hB3RRGX7nBeHG55LSC5zFxvw5iZ", "ratio": 0.00000038094018
547456964}
[02-07|20:20:25.847] DEBUG <C Chain> snowman/engine.go:927 dropped query for block {"reason": "insufficient connected stake", "blkID": "2i17dfiGNwgJ3C8PuPnLoB8hB3RRGX7nBeHG55LSC5zFxvw5iZ", "ratio": 0.00000038094018
547456964}

Need to be documented in RELEASES.md?

Updated it.

@yacovm yacovm marked this pull request as draft February 5, 2025 22:34
@yacovm yacovm force-pushed the turnOnDebug branch 16 times, most recently from 161351e to fea9528 Compare February 7, 2025 22:44
@yacovm yacovm marked this pull request as ready for review February 7, 2025 23:05
@meaghanfitzgerald
Copy link
Contributor

I'm debating if this feature should be hidden behind a flag --log-auto-amplification true or if it should be enabled by default. This would likely greatly approve the node UX and help platform debug reported issues by not needing to request the reportee manually change their log level to provide more info

@yacovm
Copy link
Contributor Author

yacovm commented Feb 10, 2025

I'm debating if this feature should be hidden behind a flag --log-auto-amplification true or if it should be enabled by default. This would likely greatly approve the node UX and help platform debug reported issues by not needing to request the reportee manually change their log level to provide more info

Right now it's hidden by default. I think that's the right way to go for new UX features. It would be straight-forward to make it enabled by default once there is sufficient community demand.

I want to avoid a situation where people are surprised their logging level radically changed without doing anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants