-
Notifications
You must be signed in to change notification settings - Fork 31
log: add logging to failover module #2339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ca94601
to
698af87
Compare
fce6fa0
to
3f143bc
Compare
There is also roles/coordinator.lua where locates coordinator's logic. something like: previous leader instance storage-b2(addr1.buiseness.net:3301) abcd-abcd-1234-1234 id dead, appoint new leader ... and add info about previous leader for manual leader change cartridge/cartridge/roles/coordinator.lua Lines 381 to 384 in 763de81
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for logging improvements.
I think we should log here an instance alias too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should also log more details about make_decision—for instance, how many candidates were examined, in case the whole replica set looks dead and that’s why no decision was made, or because immunity_timeout hasn’t expired yet.
We should also log that the control_loop has started and finished; this occurs when membership notifications arrive.
Could you also attach an example of what the logs will look like in the merge request?
I’ve added the requested improvements:
Example of how the logs look now:
Let me know if you’d like any further adjustments! |
bfd92d3
to
de86aa1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to log replicaset's alias. It is in topology_cfg so we can add it.
About coordinator:
We need to log why an appointment was made and why it wasn’t.
Log levels
info
— when an appointment is madewarn
— when an appointment is not made
When an appointment is made (info
):
- The current leader is unhealthy; after examining N candidates a new one was chosen.
- This is the first appointment; the first node in the topology was selected.
When an appointment is not made (warn
):
- The replica set’s immunity timeout hasn’t expired.
- The current leader is alive and electable (may be too frequent to log; absence of such logs can itself imply everything is fine).
Suggested flow:
-
First check if the current leader is healthy.
- If not, log the problem and proceed.
- If yes, log that fact and exit.
-
If the leader is unhealthy but still under an immunity timeout, log the timeout and the decision not to switch.
Logs must make it clear:
- How candidates are evaluated
- Which replica sets were considered
- Why the master was changed or kept unchanged.
c783bc2
to
473221f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
local control_fiber = fiber.new(control_loop, session)
control_fiber:name('failover-coordinate')
Since the fiber is already named, that name will appear in the logs, so an extra prefix isn’t necessary.
The same applies to fencing_healthcheck
:
vars.fencing_fiber = fiber.new(fencing_watch)
vars.fencing_fiber:name('cartridge.fencing')
The fiber name itself makes it clear where the log entry came from.
Let’s also keep the log line format:
Replicaset %s: appoint %s (%q) (manual)
so it always begins with “Replicaset …”.
For manual appointments, it would be helpful to include the replicaset alias as well.
Thanks!
@Satbek I’ve updated the code as discussed:
Let me know if you’d like any further tweaks or adjustments! |
bea0124
to
ff5c712
Compare
Here is an example of the log output with the current changes applied:
|
0ef9b3c
to
321bacf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
please, update changelog
… and control loop activity
Co-authored-by: Satbek Turganbayev <[email protected]>
Co-authored-by: Satbek Turganbayev <[email protected]>
Co-authored-by: Satbek Turganbayev <[email protected]>
Co-authored-by: Satbek Turganbayev <[email protected]>
880b105
to
02d4da8
Compare
4a04a75
to
42c8a0c
Compare
…eplicasets are always present
42c8a0c
to
22b3c43
Compare
What has been done:
Added structured logging to failover.lua and coordinator.lua. Now logs cover leader switches (with aliases and URIs), quorum and health checks, immunity status, and errors during failover decisions.
Why:
Previously, the failover process lacked visibility. It was hard to trace leader election logic or investigate why a leader was changed without deep manual inspection.
What problem is being solved:
This change improves observability of failover and leader appointment workflows. It makes it easier to debug production incidents and analyze unexpected leader transitions.
I didn't forget about
Close TNTP-3341