log: add logging to failover module #2339

vakhov · 2025-06-25T05:35:14Z

What has been done:
Added structured logging to failover.lua and coordinator.lua. Now logs cover leader switches (with aliases and URIs), quorum and health checks, immunity status, and errors during failover decisions.

Why:
Previously, the failover process lacked visibility. It was hard to trace leader election logic or investigate why a leader was changed without deep manual inspection.

What problem is being solved:
This change improves observability of failover and leader appointment workflows. It makes it easier to debug production incidents and analyze unexpected leader transitions.

I didn't forget about

Tests
Changelog
Documentation

Close TNTP-3341

Satbek · 2025-06-27T09:13:04Z

There is also roles/coordinator.lua where locates coordinator's logic.
I think we should add the reason for appointment in log message

https://github.com/tarantool/cartridge/blob/763de81461d48ac5d6aa219ca52f0f4ff8f28a6b/cartridge/roles/coordinator.lua#L106C18-L109C18

something like:

previous leader instance storage-b2(addr1.buiseness.net:3301) abcd-abcd-1234-1234 id dead, appoint new leader ...

and add info about previous leader for manual leader change

cartridge/cartridge/roles/coordinator.lua

Lines 381 to 384 in 763de81

    
           log.info('Replicaset %s: appoint %s (%q) (manual)', 
        
               replicaset_uuid, decision.leader, 
        
               assert(servers[decision.leader]).uri 
        
           )

Satbek

Thanks for logging improvements.

I think we should log here an instance alias too

https://github.com/tarantool/cartridge/pull/2339/files#diff-b3b13e336d0031b78b83b2caa6fff8da5ce604376fecbe1c83723d0ab263a522L269-L274

cartridge/failover.lua

Satbek

Maybe we should also log more details about make_decision—for instance, how many candidates were examined, in case the whole replica set looks dead and that’s why no decision was made, or because immunity_timeout hasn’t expired yet.

We should also log that the control_loop has started and finished; this occurs when membership notifications arrive.

Could you also attach an example of what the logs will look like in the merge request?

cartridge/failover.lua

cartridge/roles/coordinator.lua

cartridge/failover.lua

cartridge/roles/coordinator.lua

vakhov · 2025-07-01T06:30:08Z

Maybe we should also log more details about make_decision—for instance, how many candidates were examined, in case the whole replica set looks dead and that’s why no decision was made, or because immunity_timeout hasn’t expired yet.

We should also log that the control_loop has started and finished; this occurs when membership notifications arrive.

Could you also attach an example of what the logs will look like in the merge request?

I’ve added the requested improvements:

Logs in make_decision now include:
- how many candidates were evaluated,
- whether immunity is still active,
- a warning if no healthy candidates were found.
control_loop logs when it starts and finishes each iteration.

Example of how the logs look now:

control_loop: started
make_decision: evaluating replicaset bbbbbbbb-0000-0000-0000-000000000000 (candidates=3)
make_decision: immunity not expired for leader bbbbbbbb-bbbb-0000-0000-000000000001 (storage-1, localhost:13302) (expires in 1.0 sec)
control_loop: finished

control_loop: started
make_decision: evaluating replicaset bbbbbbbb-0000-0000-0000-000000000000 (candidates=3)
make_decision: no healthy candidates found in replicaset bbbbbbbb-0000-0000-0000-000000000000
control_loop: finished

control_loop: started
make_decision: evaluating replicaset bbbbbbbb-0000-0000-0000-000000000000 (candidates=3)
control_loop: replicaset bbbbbbbb-0000-0000-0000-000000000000: appoint new leader bbbbbbbb-bbbb-0000-0000-000000000001 (storage-1, localhost:13302), previous leader bbbbbbbb-bbbb-0000-0000-000000000002 (storage-2, localhost:13303)
control_loop: finished

Let me know if you’d like any further adjustments!

Satbek

We also need to log replicaset's alias. It is in topology_cfg so we can add it.

About coordinator:

We need to log why an appointment was made and why it wasn’t.

Log levels

info — when an appointment is made
warn — when an appointment is not made

When an appointment is made (info):

The current leader is unhealthy; after examining N candidates a new one was chosen.
This is the first appointment; the first node in the topology was selected.

When an appointment is not made (warn):

The replica set’s immunity timeout hasn’t expired.
The current leader is alive and electable (may be too frequent to log; absence of such logs can itself imply everything is fine).

Suggested flow:

First check if the current leader is healthy.
- If not, log the problem and proceed.
- If yes, log that fact and exit.
If the leader is unhealthy but still under an immunity timeout, log the timeout and the decision not to switch.

Logs must make it clear:

How candidates are evaluated
Which replica sets were considered
Why the master was changed or kept unchanged.

cartridge/roles/coordinator.lua

Satbek

Hi,

local control_fiber = fiber.new(control_loop, session)
control_fiber:name('failover-coordinate')

Since the fiber is already named, that name will appear in the logs, so an extra prefix isn’t necessary.

The same applies to fencing_healthcheck:

vars.fencing_fiber = fiber.new(fencing_watch)
vars.fencing_fiber:name('cartridge.fencing')

The fiber name itself makes it clear where the log entry came from.

Let’s also keep the log line format:

Replicaset %s: appoint %s (%q) (manual)

so it always begins with “Replicaset …”.
For manual appointments, it would be helpful to include the replicaset alias as well.

Thanks!

cartridge/roles/coordinator.lua

vakhov · 2025-07-03T07:23:46Z

@Satbek I’ve updated the code as discussed:

Removed redundant log prefixes like control_loop: (and other) since the fiber names are already visible in log entries.
Adjusted log messages to always start with Replicaset ... for consistency.
Cleaned up log formatting to avoid duplication and keep entries concise.

Let me know if you’d like any further tweaks or adjustments!

vakhov · 2025-07-03T09:56:52Z

Here is an example of the log output with the current changes applied:

Making decisions
Replicaset 11111111-2222-3333-4444-555555555555(storage-replicaset): no appointment made (reason=immunity_not_expired, checked=0)
Wait membership notifications

Making decisions
Replicaset 11111111-2222-3333-4444-555555555555(storage-replicaset): no appointment made (reason=no_healthy_candidates, checked=3)
Wait membership notifications

Making decisions
Replicaset 11111111-2222-3333-4444-555555555555(storage-replicaset): appoint new leader bbbbbbbb-bbbb-cccc-dddd-eeeeeeeeeeee (storage-2, "localhost:3302"), previous leader aaaaaaaa-aaaa-bbbb-cccc-dddddddddddd (storage-1, "localhost:3301") (reason=new_leader_selected, checked=2)
Wait membership notifications

Satbek

LGTM

please, update changelog

cartridge/failover.lua

…rn only

… and control loop activity

…ked count

…unity logs

Co-authored-by: Satbek Turganbayev <[email protected]>

…eplicasets are always present

vakhov requested a review from Satbek June 25, 2025 05:35

vakhov self-assigned this Jun 25, 2025

vakhov force-pushed the TNTP-3341-add-failover-logging branch from ca94601 to 698af87 Compare June 25, 2025 05:40

vakhov removed their assignment Jun 26, 2025

vakhov force-pushed the TNTP-3341-add-failover-logging branch 3 times, most recently from fce6fa0 to 3f143bc Compare June 26, 2025 06:14

Satbek requested changes Jun 27, 2025

View reviewed changes

vakhov requested a review from Satbek June 30, 2025 06:40

Satbek requested changes Jun 30, 2025

View reviewed changes

cartridge/failover.lua Outdated Show resolved Hide resolved

cartridge/roles/coordinator.lua Outdated Show resolved Hide resolved

cartridge/failover.lua Outdated Show resolved Hide resolved

cartridge/roles/coordinator.lua Outdated Show resolved Hide resolved

vakhov force-pushed the TNTP-3341-add-failover-logging branch 6 times, most recently from bfd92d3 to de86aa1 Compare July 1, 2025 08:38

vakhov requested a review from Satbek July 1, 2025 09:23

Satbek requested changes Jul 1, 2025

View reviewed changes

vakhov force-pushed the TNTP-3341-add-failover-logging branch 4 times, most recently from c783bc2 to 473221f Compare July 1, 2025 14:51

vakhov requested a review from Satbek July 2, 2025 05:36

Satbek requested changes Jul 2, 2025

View reviewed changes

cartridge/roles/coordinator.lua Show resolved Hide resolved

cartridge/roles/coordinator.lua Outdated Show resolved Hide resolved

vakhov requested a review from Satbek July 3, 2025 07:23

vakhov force-pushed the TNTP-3341-add-failover-logging branch from bea0124 to ff5c712 Compare July 3, 2025 09:18

vakhov force-pushed the TNTP-3341-add-failover-logging branch 8 times, most recently from 0ef9b3c to 321bacf Compare July 3, 2025 10:21

Satbek reviewed Jul 3, 2025

View reviewed changes

cartridge/failover.lua Outdated Show resolved Hide resolved

cartridge/failover.lua Outdated Show resolved Hide resolved

Alex Vakhov and others added 12 commits July 3, 2025 20:28

log: add logging to failover module

ee6031e

log: add previous leader alias, uri, and reason to appointment logs

3c57dcd

log(failover): remove debug logs for 2.10 compatibility, keep info/wa…

a3c756f

…rn only

log(coordinator): add logs for candidate evaluation, immunity status,…

95d48b5

… and control loop activity

log(coordinator): restructure make_decision to return reason and chec…

69c5218

…ked count

log(coordinator): change decision checks order to avoid excessive imm…

2ddca82

…unity logs

Update cartridge/roles/coordinator.lua

974bc70

Co-authored-by: Satbek Turganbayev <[email protected]>

Update cartridge/roles/coordinator.lua

4274db8

Co-authored-by: Satbek Turganbayev <[email protected]>

log: remove redundant log prefixes and include replicaset alias

b983b70

Update cartridge/failover.lua

c55788a

Co-authored-by: Satbek Turganbayev <[email protected]>

Update cartridge/failover.lua

9bb6d59

Co-authored-by: Satbek Turganbayev <[email protected]>

docs(changelog): document improved failover and leader election logging

02d4da8

vakhov force-pushed the TNTP-3341-add-failover-logging branch from 880b105 to 02d4da8 Compare July 3, 2025 15:30

refactor(coordinator): remove redundant assert in describe()

5c83255

vakhov force-pushed the TNTP-3341-add-failover-logging branch from 4a04a75 to 42c8a0c Compare July 4, 2025 07:45

fix(coordinator): re-add asserts to ensure topology_cfg.servers and r…

22b3c43

…eplicasets are always present

vakhov force-pushed the TNTP-3341-add-failover-logging branch from 42c8a0c to 22b3c43 Compare July 4, 2025 07:46

Satbek approved these changes Jul 4, 2025

View reviewed changes

vakhov merged commit 0f99d69 into master Jul 4, 2025
22 checks passed

vakhov deleted the TNTP-3341-add-failover-logging branch July 4, 2025 08:33

Satbek mentioned this pull request Jul 8, 2025

Log additional failover information #2313

Closed

log: add logging to failover module #2339

log: add logging to failover module #2339

Uh oh!

Conversation

vakhov commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Satbek commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Satbek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Satbek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vakhov commented Jul 1, 2025

Uh oh!

Satbek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Satbek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vakhov commented Jul 3, 2025

Uh oh!

vakhov commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Satbek left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vakhov commented Jun 25, 2025 •

edited

Loading

Satbek commented Jun 27, 2025 •

edited

Loading

vakhov commented Jul 3, 2025 •

edited

Loading

Satbek left a comment •

edited

Loading