Set logging to debug when health check fails #3701

yacovm · 2025-02-05T22:34:27Z

Why this should be merged

The default logging level for avalanchego is Info, which doesn't say too much about what the system is doing.

It is too often that the system is unhealthy (the periodical health check failure can be seen in the metrics, as well as in the log) but it is not clear from the logs why it is unhealthy, because the default logging level is too succinct.

In such times, changing the logging level may be either impossible (if the admin API isn't enabled and restarting the node is not possible), or even too late (restarting the node may cause the underlying problem we wish to investigate to not repeat).

It is therefore beneficial if avalanchego can be configured to change its logging level to debug on its own when it detects it is unhealthy, and to revert back to its previous logging level when it detects it has recovered.

Since we won't gain more information the longer the logging level is set to debug, there should be a maximum duration of time, after which the logging level is reverted back to what it was before.

How this works

The health check now notifies the logging factory whether or not at least one health check failed.

Based on the result, the logging factory either amplifies or reverts the logging levels of all loggers.

Each logger created by the logging factory records its desired logging level for both of its cores (display and file). Upon an explicit change of the logging levels (via the admin API), the desired logging level is recorded as well.

Upon amplification, the logging factory iterates through all loggers and over all cores of each logger and sets the logging level to DEBUG. Upon amplification after the amplification duration has been exhausted, it reverts the loggers to what they were earlier before the amplification, and also marks itself to cease further amplifications until the health checks all succeed again. Upon a success of all health checks, the logging level of all loggers is reverted to what it was, and then further amplifications are possible in case the health check fail once more.

How this was tested

Added an end to end test, as well as did a manual test:

Ran a node on a VM with the newly introduced flags:

~/avalanchego$ ./build/avalanchego --network-id=fuji --log-auto-amplification true  &> out.log &

Then waited for consensus to start:

[02-07|20:08:32.635] INFO <P Chain> snowman/engine.go:512 starting consensus {"lastAcceptedID": "msVQ8GFz8jQUNUXpxmbcBPwwQp8oyzZtq36tFbD5w4KrCaGdU", "lastAcceptedHeight": 194357}
[02-07|20:08:32.648] INFO <C Chain> snowman/engine.go:512 starting consensus {"lastAcceptedID": "gE8KDsnZmNfkTadRYBiyVwF7GZMPEiKgsKmgUQmGRuCj35ST7", "lastAcceptedHeight": 37949979}
[02-07|20:08:42.626] INFO <X Chain> snowman/engine.go:512 starting consensus {"lastAcceptedID": "2kL7dXKvwh1mFXHxVAX5w7Xpqxq9Cc9MGoBpQpWEysypTDmezN", "lastAcceptedHeight": 31425}
[02-07|20:08:55.776] INFO health/worker.go:270 check started passing {"name": "readiness", "name": "bootstrapped", "tags": ["application"]}
[02-07|20:08:55.776] INFO health/worker.go:270 check started passing {"name": "health", "name": "bootstrapped", "tags": ["application"]}

And afterwards, I disconnected it from all nodes via:

sudo iptables -A OUTPUT -p tcp --destination-port 9651 -j DROP.

Then upon detecting a health check failure it changed its logging level on its own (see below):

[02-07|20:08:55.776] INFO health/worker.go:270 check started passing {"name": "readiness", "name": "bootstrapped", "tags": ["application"]}
[02-07|20:08:55.776] INFO health/worker.go:270 check started passing {"name": "health", "name": "bootstrapped", "tags": ["application"]}
[02-07|20:10:25.776] WARN health/worker.go:261 check started failing {"name": "health", "name": "P", "tags": ["11111111111111111111111111111111LpoYY"], "error": "engine: block processing too long: 37.732442896s > 30s; networking: not connected to enough stake: connected to 0.000038%; required at least 80.000000%"}
[02-07|20:10:25.776] WARN health/worker.go:261 check started failing {"name": "health", "name": "C", "tags": ["11111111111111111111111111111111LpoYY"], "error": "not connected to enough stake: connected to 0.000038%; required at least 80.000000%"}
[02-07|20:10:25.776] WARN health/worker.go:261 check started failing {"name": "health", "name": "X", "tags": ["11111111111111111111111111111111LpoYY"], "error": "not connected to enough stake: connected to 0.000038%; required at least 80.000000%"}
[02-07|20:10:25.846] DEBUG <C Chain> handler/handler.go:858 forwarding chan message to consensus {"messageOp": "gossip_request"}
[02-07|20:10:25.846] DEBUG <C Chain> snowman/engine.go:160 skipping block gossip {"reason": "blocks currently processing", "numProcessing": 3}
[02-07|20:10:25.846] DEBUG <C Chain> snowman/engine.go:927 dropped query for block {"reason": "insufficient connected stake", "blkID": "2X6ETyVyks8z9imZXa9QMEvZuJ5giuijT3ky2idpazbMeC28Da", "ratio": 0.0000003809401764048558}
[02-07|20:10:25.846] DEBUG <C Chain> snowman/engine.go:927 dropped query for block {"reason": "insufficient connected stake", "blkID": "2X6ETyVyks8z9imZXa9QMEvZuJ5giuijT3ky2idpazbMeC28Da", "ratio": 0.0000003809401764048558}
[02-07|20:10:25.846] DEBUG <C Chain> snowman/engine.go:927 dropped query for block {"reason": "insufficient connected stake", "blkID": "2X6ETyVyks8z9imZXa9QMEvZuJ5giuijT3ky2idpazbMeC28Da", "ratio": 0.0000003809401764048558}
[02-07|20:10:25.846] DEBUG <C Chain> snowman/engine.go:927 dropped query for block {"reason": "insufficient connected stake", "blkID": "2X6ETyVyks8z9imZXa9QMEvZuJ5giuijT3ky2idpazbMeC28Da", "ratio": 0.0000003809401764048558}

Since I didn't specify a --log-auto-amplification-max-duration, the default is 5 minutes, so I let the node log in DEBUG for 5 minutes, and then confirmed it stopped logging in DEBUG and reverted its logging level back to what it was:

~/avalanchego$ grep DEBUG out.log | head -1; grep DEBUG out.log | tail -1
[02-07|20:10:25.846] DEBUG <C Chain> handler/handler.go:858 forwarding chan message to consensus {"messageOp": "gossip_request"}
[02-07|20:15:25.774] DEBUG <X Chain> handler/handler.go:455 finished handling sync message {"messageOp": "chits"}

Afterwards I re-connected the node to the network via sudo iptables -D OUTPUT -p tcp --destination-port 9651 -j DROP and waited for the log level to be reverted back to info:

[02-07|20:15:25.774] DEBUG <X Chain> handler/handler.go:455 finished handling sync message {"messageOp": "chits"}
[02-07|20:17:25.775] INFO health/worker.go:270 check started passing {"name": "health", "name": "P", "tags": ["11111111111111111111111111111111LpoYY"]}
[02-07|20:17:25.775] INFO health/worker.go:270 check started passing {"name": "health", "name": "C", "tags": ["11111111111111111111111111111111LpoYY"]}
[02-07|20:17:25.775] INFO health/worker.go:270 check started passing {"name": "health", "name": "X", "tags": ["11111111111111111111111111111111LpoYY"]}
[02-07|20:18:56.680] INFO <C Chain> core/blockchain.go:1493 Resetting chain preference               number=37,950,061 hash=855ba8..28a24a drop=1 dropfrom=93524f..7975f7 add=1 addfrom=97566c..2fe828

I then disconnected the node from the network again, and confirmed it amplifies its logging to DEBUG once more:

[02-07|20:17:25.775] INFO health/worker.go:270 check started passing {"name": "health", "name": "P", "tags": ["11111111111111111111111111111111LpoYY"]}
[02-07|20:17:25.775] INFO health/worker.go:270 check started passing {"name": "health", "name": "C", "tags": ["11111111111111111111111111111111LpoYY"]}
[02-07|20:17:25.775] INFO health/worker.go:270 check started passing {"name": "health", "name": "X", "tags": ["11111111111111111111111111111111LpoYY"]}
[02-07|20:18:56.680] INFO <C Chain> core/blockchain.go:1493 Resetting chain preference               number=37,950,061 hash=855ba8..28a24a drop=1 dropfrom=93524f..7975f7 add=1 addfrom=97566c..2fe828
[02-07|20:20:25.776] WARN health/worker.go:261 check started failing {"name": "health", "name": "X", "tags": ["11111111111111111111111111111111LpoYY"], "error": "not connected to enough stake: connected to 0.000038
%; required at least 80.000000%"}
[02-07|20:20:25.776] WARN health/worker.go:261 check started failing {"name": "health", "name": "P", "tags": ["11111111111111111111111111111111LpoYY"], "error": "not connected to enough stake: connected to 0.000038
%; required at least 80.000000%"}
[02-07|20:20:25.776] WARN health/worker.go:261 check started failing {"name": "health", "name": "C", "tags": ["11111111111111111111111111111111LpoYY"], "error": "not connected to enough stake: connected to 0.000038
%; required at least 80.000000%"}
[02-07|20:20:25.847] DEBUG <C Chain> handler/handler.go:858 forwarding chan message to consensus {"messageOp": "gossip_request"}
[02-07|20:20:25.847] DEBUG <C Chain> snowman/engine.go:160 skipping block gossip {"reason": "blocks currently processing", "numProcessing": 2}
[02-07|20:20:25.847] DEBUG <C Chain> snowman/engine.go:927 dropped query for block {"reason": "insufficient connected stake", "blkID": "2i17dfiGNwgJ3C8PuPnLoB8hB3RRGX7nBeHG55LSC5zFxvw5iZ", "ratio": 0.00000038094018
547456964}
[02-07|20:20:25.847] DEBUG <C Chain> snowman/engine.go:927 dropped query for block {"reason": "insufficient connected stake", "blkID": "2i17dfiGNwgJ3C8PuPnLoB8hB3RRGX7nBeHG55LSC5zFxvw5iZ", "ratio": 0.00000038094018
547456964}
[02-07|20:20:25.847] DEBUG <C Chain> snowman/engine.go:927 dropped query for block {"reason": "insufficient connected stake", "blkID": "2i17dfiGNwgJ3C8PuPnLoB8hB3RRGX7nBeHG55LSC5zFxvw5iZ", "ratio": 0.00000038094018
547456964}
[02-07|20:20:25.847] DEBUG <C Chain> snowman/engine.go:927 dropped query for block {"reason": "insufficient connected stake", "blkID": "2i17dfiGNwgJ3C8PuPnLoB8hB3RRGX7nBeHG55LSC5zFxvw5iZ", "ratio": 0.00000038094018
547456964}

Need to be documented in RELEASES.md?

Updated it.

Signed-off-by: Yacov Manevich <[email protected]>

meaghanfitzgerald · 2025-02-10T17:33:11Z

I'm debating if this feature should be hidden behind a flag --log-auto-amplification true or if it should be enabled by default. This would likely greatly approve the node UX and help platform debug reported issues by not needing to request the reportee manually change their log level to provide more info

yacovm · 2025-02-10T17:35:45Z

I'm debating if this feature should be hidden behind a flag --log-auto-amplification true or if it should be enabled by default. This would likely greatly approve the node UX and help platform debug reported issues by not needing to request the reportee manually change their log level to provide more info

Right now it's hidden by default. I think that's the right way to go for new UX features. It would be straight-forward to make it enabled by default once there is sufficient community demand.

I want to avoid a situation where people are surprised their logging level radically changed without doing anything.

yacovm requested a review from StephenButtolph as a code owner February 5, 2025 22:34

yacovm marked this pull request as draft February 5, 2025 22:34

yacovm force-pushed the turnOnDebug branch 16 times, most recently from 161351e to fea9528 Compare February 7, 2025 22:44

Set logging to debug when health check fails

20b9114

Signed-off-by: Yacov Manevich <[email protected]>

yacovm force-pushed the turnOnDebug branch from fea9528 to 20b9114 Compare February 7, 2025 23:03

yacovm marked this pull request as ready for review February 7, 2025 23:05

yacovm requested a review from meaghanfitzgerald as a code owner February 7, 2025 23:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set logging to debug when health check fails #3701

Set logging to debug when health check fails #3701

yacovm commented Feb 5, 2025 •

edited

Loading

meaghanfitzgerald commented Feb 10, 2025

yacovm commented Feb 10, 2025

Set logging to debug when health check fails #3701

Are you sure you want to change the base?

Set logging to debug when health check fails #3701

Conversation

yacovm commented Feb 5, 2025 • edited Loading

Why this should be merged

How this works

How this was tested

Need to be documented in RELEASES.md?

meaghanfitzgerald commented Feb 10, 2025

yacovm commented Feb 10, 2025

yacovm commented Feb 5, 2025 •

edited

Loading