Skip to content

Conversation

@lidezhu
Copy link
Collaborator

@lidezhu lidezhu commented Dec 2, 2025

What problem does this PR solve?

Issue Number: ref #2751

What is changed and how it works?

This pull request aims to significantly enhance the monitoring and observability of the TiCDC system. It introduces new metrics to track critical event processing and dispatcher management operations, alongside comprehensive updates to the Grafana dashboard. These changes provide more granular insights into system performance, particularly regarding event store activity, dispatcher health, and the state of slowest table replication, facilitating better debugging and performance analysis.

Highlights

  • Metric Variable Renaming: Internal Go variables CounterKv and CounterResolved in the event store were renamed to kvEventCount and resolvedEventCount respectively for improved clarity.
  • New Event Store Notification Metric: A new Prometheus counter, EventStoreNotifyDispatcherCount, was introduced to accurately track the total number of dispatcher notifications sent by the event store, reflecting the number of subscribers notified.
  • New Event Service Reset Metric: A new Prometheus counter, EventServiceResetDispatcherCount, was added to monitor the frequency of event dispatcher reset operations.
  • Update Grafana: Update Grafana for easy check.
image image image image

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Dec 2, 2025
@gemini-code-assist
Copy link

Summary of Changes

Hello @lidezhu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the system's monitoring capabilities by introducing a new metric for tracking dispatcher resets and refining the collection logic for existing event-related metrics. These improvements aim to provide more detailed and accurate insights into the event service's operation, which will be crucial for better performance analysis and debugging, particularly when integrated with Grafana dashboards.

Highlights

  • Metric Renaming: Renamed internal metric counters CounterKv to kvEventCount and CounterResolved to resolvedEventCount in the event store for improved clarity and consistency.
  • Refined Resolved Event Counting: The resolvedEventCount metric now increments by the actual number of subscribers notified, providing a more accurate representation of resolved event processing.
  • New Dispatcher Reset Metric: Introduced a new Prometheus counter, EventServiceResetDispatcherCount, to track the frequency of event dispatcher reset operations, enhancing observability into dispatcher management.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@ti-chi-bot ti-chi-bot bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed do-not-merge/needs-linked-issue labels Dec 2, 2025
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces improvements to metrics. It adds a new counter EventServiceResetDispatcherCount to track dispatcher resets, which is a valuable addition for monitoring. It also refactors the naming of some metric variables for better style. However, the change to the EventStoreReceivedEventCount metric for resolved events alters its meaning from counting received events to counting outgoing notifications to dispatchers. This makes the metric's name and help text misleading and inconsistent with how other event types for the same metric are handled. I've provided a suggestion to address this for better clarity and maintainability of the metrics.

lidezhu and others added 4 commits December 2, 2025 17:11
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@lidezhu
Copy link
Collaborator Author

lidezhu commented Dec 2, 2025

/gemini summary

@gemini-code-assist
Copy link

Summary of Changes

This pull request aims to significantly enhance the monitoring and observability of the TiCDC system. It introduces new metrics to track critical event processing and dispatcher management operations, alongside comprehensive updates to the Grafana dashboard. These changes provide more granular insights into system performance, particularly regarding event store activity, dispatcher health, and the state of slowest table replication, facilitating better debugging and performance analysis.

Highlights

  • Metric Variable Renaming: Internal Go variables CounterKv and CounterResolved in the event store were renamed to kvEventCount and resolvedEventCount respectively for improved clarity.
  • New Event Store Notification Metric: A new Prometheus counter, EventStoreNotifyDispatcherCount, was introduced to accurately track the total number of dispatcher notifications sent by the event store, reflecting the number of subscribers notified.
  • New Event Service Reset Metric: A new Prometheus counter, EventServiceResetDispatcherCount, was added to monitor the frequency of event dispatcher reset operations.
  • Grafana Dashboard Overhaul: The ticdc_new_arch_with_keyspace_name.json dashboard underwent significant updates, including reorganization and re-positioning of various panels, introduction of new panels to visualize "Slowest Table Checkpoint" and "Slowest Table Replication State", refinement of existing metric queries and titles (notably swapping the descriptions and metrics for "Operator Cost Duration" and "Split Span Check Duration" panels), and re-purposing a panel to specifically track "Memory Quota" for the event-collector-redo module.
Changelog
  • logservice/eventstore/event_store.go
    • Renamed internal metric variables (CounterKv to kvEventCount, CounterResolved to resolvedEventCount).
    • Added an increment for the newly introduced EventStoreNotifyDispatcherCount metric, which counts the number of subscribers notified when a resolved timestamp is processed.
  • metrics/grafana/ticdc_new_arch_with_keyspace_name.json
    • Updated the dashboard's metadata (iteration, version, title).
    • Adjusted the module filter for ticdc_dynamic_stream_memory_usage.
    • Extensively re-positioned panels by modifying their y coordinates.
    • Updated legend configurations (e.g., current, max, values).
    • Added queryType: "randomWalk" to some targets.
    • Swapped the expr and title for "Operator Cost Duration" and "Split Span Check Duration" panels.
    • Introduced new panels for "Slowest Table Checkpoint" and "Slowest Table Replication State".
    • Re-configured an existing panel to display "Memory Quota" for the event-collector-redo module.
  • pkg/eventservice/event_broker.go
    • Incorporated an increment for the new EventServiceResetDispatcherCount metric.
  • pkg/metrics/event_service.go
    • Defined the EventServiceResetDispatcherCount Prometheus counter, which tracks the number of event dispatcher reset operations.
  • pkg/metrics/event_store.go
    • Defined the EventStoreNotifyDispatcherCount Prometheus counter, which tracks the number of times the event store notifies dispatchers with resolved timestamps.
Activity
  • The pull request is currently awaiting approval from flowbehappy, as indicated by the ti-chi-bot[bot].
  • gemini-code-assist[bot] provided an initial summary and highlights of the changes.
  • A review comment from gemini-code-assist[bot] pointed out a potential semantic change in how resolvedEventCount was being updated, suggesting a separate metric for dispatcher notifications. The actual implementation in the PR addresses this by introducing a new metric for notifications while keeping the original event count.
  • The author, lidezhu, subsequently requested another summary from gemini-code-assist[bot].

@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Dec 2, 2025
@ti-chi-bot
Copy link

ti-chi-bot bot commented Dec 2, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hongyunyan, tenfyzhong

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [hongyunyan,tenfyzhong]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Dec 2, 2025
@ti-chi-bot
Copy link

ti-chi-bot bot commented Dec 2, 2025

[LGTM Timeline notifier]

Timeline:

  • 2025-12-02 10:26:01.237670864 +0000 UTC m=+345506.051448446: ☑️ agreed by hongyunyan.
  • 2025-12-02 10:33:28.317660759 +0000 UTC m=+345953.131438331: ☑️ agreed by tenfyzhong.

@lidezhu lidezhu added the needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. label Dec 2, 2025
@lidezhu
Copy link
Collaborator Author

lidezhu commented Dec 3, 2025

/retest

@lidezhu
Copy link
Collaborator Author

lidezhu commented Dec 3, 2025

/test pull-cdc-mysql-integration-heavy

1 similar comment
@lidezhu
Copy link
Collaborator Author

lidezhu commented Dec 3, 2025

/test pull-cdc-mysql-integration-heavy

@ti-chi-bot ti-chi-bot bot merged commit f0d4d6f into master Dec 3, 2025
19 checks passed
@ti-chi-bot ti-chi-bot bot deleted the ldz/improve-metrics12021 branch December 3, 2025 08:22
ti-chi-bot pushed a commit to ti-chi-bot/ticdc-1 that referenced this pull request Dec 3, 2025
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-8.5: #3447.
But this PR has conflicts, please resolve them!

lidezhu added a commit that referenced this pull request Dec 3, 2025
* This is an automated cherry-pick of #3430

Signed-off-by: ti-chi-bot <[email protected]>

* fix check

---------

Signed-off-by: ti-chi-bot <[email protected]>
Co-authored-by: lidezhu <[email protected]>
Co-authored-by: lidezhu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants