Skip to content

Conversation

vakhov
Copy link
Contributor

@vakhov vakhov commented Jun 25, 2025

What has been done:
Added structured logging to failover.lua and coordinator.lua. Now logs cover leader switches (with aliases and URIs), quorum and health checks, immunity status, and errors during failover decisions.

Why:
Previously, the failover process lacked visibility. It was hard to trace leader election logic or investigate why a leader was changed without deep manual inspection.

What problem is being solved:
This change improves observability of failover and leader appointment workflows. It makes it easier to debug production incidents and analyze unexpected leader transitions.

I didn't forget about

  • Tests
  • Changelog
  • Documentation

Close TNTP-3341

@vakhov vakhov requested a review from Satbek June 25, 2025 05:35
@vakhov vakhov self-assigned this Jun 25, 2025
@vakhov vakhov force-pushed the TNTP-3341-add-failover-logging branch from ca94601 to 698af87 Compare June 25, 2025 05:40
@vakhov vakhov removed their assignment Jun 26, 2025
@vakhov vakhov force-pushed the TNTP-3341-add-failover-logging branch 3 times, most recently from fce6fa0 to 3f143bc Compare June 26, 2025 06:14
@Satbek
Copy link
Contributor

Satbek commented Jun 27, 2025

There is also roles/coordinator.lua where locates coordinator's logic.
I think we should add the reason for appointment in log message

https://github.com/tarantool/cartridge/blob/763de81461d48ac5d6aa219ca52f0f4ff8f28a6b/cartridge/roles/coordinator.lua#L106C18-L109C18

something like:

previous leader instance storage-b2(addr1.buiseness.net:3301) abcd-abcd-1234-1234 id dead, appoint new leader ...

and add info about previous leader for manual leader change

log.info('Replicaset %s: appoint %s (%q) (manual)',
replicaset_uuid, decision.leader,
assert(servers[decision.leader]).uri
)

Copy link
Contributor

@Satbek Satbek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vakhov vakhov requested a review from Satbek June 30, 2025 06:40
Copy link
Contributor

@Satbek Satbek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should also log more details about make_decision—for instance, how many candidates were examined, in case the whole replica set looks dead and that’s why no decision was made, or because immunity_timeout hasn’t expired yet.

We should also log that the control_loop has started and finished; this occurs when membership notifications arrive.

Could you also attach an example of what the logs will look like in the merge request?

@vakhov
Copy link
Contributor Author

vakhov commented Jul 1, 2025

Maybe we should also log more details about make_decision—for instance, how many candidates were examined, in case the whole replica set looks dead and that’s why no decision was made, or because immunity_timeout hasn’t expired yet.

We should also log that the control_loop has started and finished; this occurs when membership notifications arrive.

Could you also attach an example of what the logs will look like in the merge request?

I’ve added the requested improvements:

  • Logs in make_decision now include:

    • how many candidates were evaluated,
    • whether immunity is still active,
    • a warning if no healthy candidates were found.
  • control_loop logs when it starts and finishes each iteration.

Example of how the logs look now:

control_loop: started
make_decision: evaluating replicaset bbbbbbbb-0000-0000-0000-000000000000 (candidates=3)
make_decision: immunity not expired for leader bbbbbbbb-bbbb-0000-0000-000000000001 (storage-1, localhost:13302) (expires in 1.0 sec)
control_loop: finished

control_loop: started
make_decision: evaluating replicaset bbbbbbbb-0000-0000-0000-000000000000 (candidates=3)
make_decision: no healthy candidates found in replicaset bbbbbbbb-0000-0000-0000-000000000000
control_loop: finished

control_loop: started
make_decision: evaluating replicaset bbbbbbbb-0000-0000-0000-000000000000 (candidates=3)
control_loop: replicaset bbbbbbbb-0000-0000-0000-000000000000: appoint new leader bbbbbbbb-bbbb-0000-0000-000000000001 (storage-1, localhost:13302), previous leader bbbbbbbb-bbbb-0000-0000-000000000002 (storage-2, localhost:13303)
control_loop: finished

Let me know if you’d like any further adjustments!

@vakhov vakhov force-pushed the TNTP-3341-add-failover-logging branch 6 times, most recently from bfd92d3 to de86aa1 Compare July 1, 2025 08:38
@vakhov vakhov requested a review from Satbek July 1, 2025 09:23
Copy link
Contributor

@Satbek Satbek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to log replicaset's alias. It is in topology_cfg so we can add it.

About coordinator:

We need to log why an appointment was made and why it wasn’t.

Log levels

  • info — when an appointment is made
  • warn — when an appointment is not made

When an appointment is made (info):

  • The current leader is unhealthy; after examining N candidates a new one was chosen.
  • This is the first appointment; the first node in the topology was selected.

When an appointment is not made (warn):

  • The replica set’s immunity timeout hasn’t expired.
  • The current leader is alive and electable (may be too frequent to log; absence of such logs can itself imply everything is fine).

Suggested flow:

  1. First check if the current leader is healthy.

    • If not, log the problem and proceed.
    • If yes, log that fact and exit.
  2. If the leader is unhealthy but still under an immunity timeout, log the timeout and the decision not to switch.

Logs must make it clear:

  • How candidates are evaluated
  • Which replica sets were considered
  • Why the master was changed or kept unchanged.

@vakhov vakhov force-pushed the TNTP-3341-add-failover-logging branch 4 times, most recently from c783bc2 to 473221f Compare July 1, 2025 14:51
@vakhov vakhov requested a review from Satbek July 2, 2025 05:36
Copy link
Contributor

@Satbek Satbek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

local control_fiber = fiber.new(control_loop, session)
control_fiber:name('failover-coordinate')

Since the fiber is already named, that name will appear in the logs, so an extra prefix isn’t necessary.

The same applies to fencing_healthcheck:

vars.fencing_fiber = fiber.new(fencing_watch)
vars.fencing_fiber:name('cartridge.fencing')

The fiber name itself makes it clear where the log entry came from.

Let’s also keep the log line format:

Replicaset %s: appoint %s (%q) (manual)

so it always begins with “Replicaset …”.
For manual appointments, it would be helpful to include the replicaset alias as well.

Thanks!

@vakhov
Copy link
Contributor Author

vakhov commented Jul 3, 2025

@Satbek I’ve updated the code as discussed:

  • Removed redundant log prefixes like control_loop: (and other) since the fiber names are already visible in log entries.
  • Adjusted log messages to always start with Replicaset ... for consistency.
  • Cleaned up log formatting to avoid duplication and keep entries concise.

Let me know if you’d like any further tweaks or adjustments!

@vakhov vakhov requested a review from Satbek July 3, 2025 07:23
@vakhov vakhov force-pushed the TNTP-3341-add-failover-logging branch from bea0124 to ff5c712 Compare July 3, 2025 09:18
@vakhov
Copy link
Contributor Author

vakhov commented Jul 3, 2025

Here is an example of the log output with the current changes applied:

Making decisions
Replicaset 11111111-2222-3333-4444-555555555555(storage-replicaset): no appointment made (reason=immunity_not_expired, checked=0)
Wait membership notifications

Making decisions
Replicaset 11111111-2222-3333-4444-555555555555(storage-replicaset): no appointment made (reason=no_healthy_candidates, checked=3)
Wait membership notifications

Making decisions
Replicaset 11111111-2222-3333-4444-555555555555(storage-replicaset): appoint new leader bbbbbbbb-bbbb-cccc-dddd-eeeeeeeeeeee (storage-2, "localhost:3302"), previous leader aaaaaaaa-aaaa-bbbb-cccc-dddddddddddd (storage-1, "localhost:3301") (reason=new_leader_selected, checked=2)
Wait membership notifications

@vakhov vakhov force-pushed the TNTP-3341-add-failover-logging branch 8 times, most recently from 0ef9b3c to 321bacf Compare July 3, 2025 10:21
Copy link
Contributor

@Satbek Satbek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

please, update changelog

@vakhov vakhov force-pushed the TNTP-3341-add-failover-logging branch from 880b105 to 02d4da8 Compare July 3, 2025 15:30
@vakhov vakhov force-pushed the TNTP-3341-add-failover-logging branch from 4a04a75 to 42c8a0c Compare July 4, 2025 07:45
@vakhov vakhov force-pushed the TNTP-3341-add-failover-logging branch from 42c8a0c to 22b3c43 Compare July 4, 2025 07:46
@vakhov vakhov merged commit 0f99d69 into master Jul 4, 2025
22 checks passed
@vakhov vakhov deleted the TNTP-3341-add-failover-logging branch July 4, 2025 08:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants