[Store] Add support for static labels (host IP/cluster name) in client metrics #1081

cocktail828 · 2025-11-19T15:20:37Z

Description

Current client metrics lack necessary context information. In actual troubleshooting, key details like host IP and cluster name can effectively improve problem-solving efficiency, which are essential supplements for production-grade instrumentation.

Changes in This PR

Add support for static labels in client metrics, including core fields such as host IP and cluster name.
Keep the original metric logic unchanged—only attach static labels without any non-breaking changes.
Label values follow existing configuration standards, requiring no additional manual setup and adapting to the current deployment workflow.

gemini-code-assist · 2025-11-19T15:20:55Z

Summary of Changes

Hello @cocktail828, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the observability of client metrics by integrating essential static context information, specifically the host IP and cluster name. By introducing a new 'hybrid' metric mechanism, these labels are seamlessly attached to existing metrics without disrupting their core functionality. This enhancement is crucial for effective production-grade monitoring and troubleshooting, allowing operators to quickly identify the source and context of metric data. The configuration is straightforward, relying on standard environment variables, and is thoroughly documented.

Highlights

Enhanced Client Metrics: Client metrics now include static labels such as host IP (instance_id) and cluster name (cluster_name), providing richer context for monitoring and troubleshooting.
New Metric Types for Static Labels: Introduced hybrid_counter and hybrid_histogram metric types to support the inclusion of static labels alongside dynamic labels, ensuring existing metric logic remains unchanged.
Environment Variable Configuration: Static labels are configured via new environment variables: MC_STORE_NODE_IP for host IP and MC_STORE_CLUSTER_NAME for the cluster name. Documentation has been updated to reflect these additions.
Metric Serialization Verification: Added test cases to confirm that the new instance_id and cluster_name labels are correctly serialized and present in the client metrics output.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a valuable feature by adding static labels like host IP and cluster name to client metrics, which will significantly improve observability and troubleshooting. The implementation adds a new hybrid_metric.h to support metrics with both static and dynamic labels, and the changes are accompanied by relevant test updates. My review identifies a critical bug in the serialization logic of the new hybrid histogram, along with several opportunities for performance improvements and minor corrections in hybrid_metric.h. I have also included a suggestion to improve the clarity of the new documentation.

gemini-code-assist · 2025-11-19T15:23:42Z

mooncake-store/include/hybrid_metric.h

+    void serialize(std::string& str) override {
+        auto value_map = sum_->copy();
+        if (value_map.empty()) {
+            return;
+        }
+
+        serialize_head(str);
+
+        std::string value_str;
+        auto bucket_counts = get_bucket_counts();
+        for (auto& e : value_map) {
+            auto& labels_value = e->label;
+            auto& value = e->value;
+            if (value == 0) {
+                continue;
+            }
+
+            value_type count = 0;
+            for (size_t i = 0; i < bucket_counts.size(); i++) {
+                auto counter = bucket_counts[i];
+                value_str.append(name_).append("_bucket{");
+                if (!labels_name_.empty()) {
+                    build_label_string_with_static(value_str, labels_name_,
+                                                   labels_value);
+                    value_str.append(",");
+                }
+
+                if (i == bucket_boundaries_.size()) {
+                    value_str.append("le=\"").append("+Inf").append("\"} ");
+                } else {
+                    value_str.append("le=\"")
+                        .append(std::to_string(bucket_boundaries_[i]))
+                        .append("\"} ");
+                }
+
+                count += counter->value(labels_value);
+                value_str.append(std::to_string(count));
+                value_str.append("\n");
+            }
+
+            str.append(value_str);
+
+            str.append(name_);
+            str.append("_sum{");
+            build_label_string_with_static(str, sum_->labels_name(),
+                                           labels_value);
+            str.append("} ");
+
+            str.append(std::to_string(value));
+            str.append("\n");
+
+            str.append(name_).append("_count{");
+            build_label_string_with_static(str, sum_->labels_name(),
+                                           labels_value);
+            str.append("} ");
+            str.append(std::to_string(count));
+            str.append("\n");
+        }
+        if (value_str.empty()) {
+            str.clear();
+        }
+    }


There is a critical bug in the serialize method. The value_str is declared outside the main loop and never cleared, causing it to accumulate bucket data from all label sets. This results in incorrect and duplicated metric output. Additionally, the if (value_str.empty()) check at the end is logically flawed and will cause a compilation error once value_str is correctly scoped inside the loop.

The suggested change fixes this by moving value_str inside the loop and removing the problematic final check.

void serialize(std::string& str) override { auto value_map = sum_->copy(); if (value_map.empty()) { return; } serialize_head(str); auto bucket_counts = get_bucket_counts(); for (auto& e : value_map) { auto& labels_value = e->label; auto& value = e->value; if (value == 0) { continue; } std::string value_str; value_type count = 0; for (size_t i = 0; i < bucket_counts.size(); i++) { auto counter = bucket_counts[i]; value_str.append(name_).append("_bucket{"); if (!labels_name_.empty()) { build_label_string_with_static(value_str, labels_name_, labels_value); value_str.append(","); } if (i == bucket_boundaries_.size()) { value_str.append("le=\"").append("+Inf").append("\"} "); } else { value_str.append("le=\"") .append(std::to_string(bucket_boundaries_[i])) .append("\"} "); } count += counter->value(labels_value); value_str.append(std::to_string(count)); value_str.append("\n"); } str.append(value_str); str.append(name_); str.append("_sum{"); build_label_string_with_static(str, sum_->labels_name(), labels_value); str.append("} "); str.append(std::to_string(value)); str.append("\n"); str.append(name_).append("_count{"); build_label_string_with_static(str, sum_->labels_name(), labels_value); str.append("} "); str.append(std::to_string(count)); str.append("\n"); } }

gemini-code-assist · 2025-11-19T15:23:42Z

doc/en/mooncake-store.md

+- MC_STORE_CLUSTER_ID: Identify the metadata when multiple cluster share the same master, default 'mooncake'
+- MC_STORE_MEMCPY: Enables or disables local memcpy optimization, set to 1/true to enable, 0/false to disable
+- MC_STORE_NODE_IP: Used by client metrics, the node's IP address
+- MC_STORE_CLUSTER_NAME: Note: the envariable identify the LLM model name (e.g., Qwen-Max, Qwen-Plus)


There's a typo here. envariable should be environment variable. Also, the description could be rephrased for better clarity.

Suggested change

- MC_STORE_CLUSTER_NAME: Note: the envariable identify the LLM model name (e.g., Qwen-Max, Qwen-Plus)

- MC_STORE_CLUSTER_NAME: This environment variable identifies the LLM model or cluster name (e.g., Qwen-Max, Qwen-Plus).

gemini-code-assist · 2025-11-19T15:23:42Z

mooncake-store/include/hybrid_metric.h

+    bool has_label_value(const std::string& value) override {
+        auto map = Base::copy();
+        for (auto& e : map) {
+            auto& label_value = e->label;
+            if (auto it =
+                    std::find(label_value.begin(), label_value.end(), value);
+                it != label_value.end()) {
+                return true;
+            }
+        }
+
+        return false;
+    }


This implementation of has_label_value can be inefficient. It calls Base::copy(), which likely copies the entire underlying map of metrics, and then performs a linear scan. For a large number of metrics, this could become a performance bottleneck if called frequently. If this function is in a hot path, consider a more optimized approach, perhaps using an auxiliary data structure for faster lookups.

gemini-code-assist · 2025-11-19T15:23:42Z

mooncake-store/include/hybrid_metric.h

+            if (value == 0) {
+                continue;
+            }


The check if (value == 0) to skip serialization might be incorrect. If negative values can be observed, it's possible for the sum (value) to be zero while the count is non-zero. In such cases, the histogram should still be serialized. It would be more robust to check if the count is zero instead of the sum.

gemini-code-assist · 2025-11-19T15:23:42Z

mooncake-store/include/hybrid_metric.h

+                        .append("\"} ");
+                }
+
+                count += counter->value(labels_value);


Inside the serialize method, counter->value(labels_value) is called within a loop for every bucket. This leads to multiple map lookups for the same labels_value, which is inefficient. For better performance, consider calculating the total count for all buckets with a single pass or optimizing the lookup strategy.

gemini-code-assist · 2025-11-19T15:23:43Z

mooncake-store/include/hybrid_metric.h

+
+using hybrid_histogram_5t = basic_hybrid_histogram<int64_t, 5>;
+using hybrid_histogram_5d = basic_hybrid_histogram<double, 5>;
+}  // namespace ylt::metric


This file is missing a newline at the end. It's a common convention to end files with a newline to avoid issues with some tools and scripts.

} // namespace ylt::metric

stmatengss · 2025-11-20T09:38:33Z

mooncake-store/include/client_metric.h

+// Static labels include but are not limited to machine address, cluster name,
+// etc. These labels remain constant during the lifetime of the application
+const std::string kInstanceID = get_env_or_default("MC_STORE_NODE_IP");
+const std::string kClusterName = get_env_or_default("MC_STORE_CLUSTER_NAME");


Why not define this variable as parameter in the master side

The master isn’t intended to handle this—it only exports RPC-related metrics.
Node IP, cluster name, transfer engine related metric and the like are client-side metric.

For instance, to troubleshoot transfer latency issues, node IP information is essential for pinpointing single-point failures—otherwise, we only have global metrics or non-specific errors with no visibility into which node is at fault.

[Store] enhance client metric

35a8790

gemini-code-assist bot reviewed Nov 19, 2025

View reviewed changes

[Store] enhance client metric - fix typos

df8baec

stmatengss reviewed Nov 20, 2025

View reviewed changes

[Store] remove unnecessary move operations and improve performance

923a7e3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Store] Add support for static labels (host IP/cluster name) in client metrics #1081

[Store] Add support for static labels (host IP/cluster name) in client metrics #1081

Uh oh!

cocktail828 commented Nov 19, 2025

Uh oh!

gemini-code-assist bot commented Nov 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 19, 2025

Uh oh!

gemini-code-assist bot Nov 19, 2025

Uh oh!

gemini-code-assist bot Nov 19, 2025

Uh oh!

gemini-code-assist bot Nov 19, 2025

Uh oh!

gemini-code-assist bot Nov 19, 2025

Uh oh!

gemini-code-assist bot Nov 19, 2025

Uh oh!

stmatengss Nov 20, 2025

Uh oh!

cocktail828 Nov 20, 2025 •

edited

Loading

Uh oh!

cocktail828 Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	- MC_STORE_CLUSTER_NAME: Note: the envariable identify the LLM model name (e.g., Qwen-Max, Qwen-Plus)
	- MC_STORE_CLUSTER_NAME: This environment variable identifies the LLM model or cluster name (e.g., Qwen-Max, Qwen-Plus).

[Store] Add support for static labels (host IP/cluster name) in client metrics #1081

Are you sure you want to change the base?

[Store] Add support for static labels (host IP/cluster name) in client metrics #1081

Uh oh!

Conversation

cocktail828 commented Nov 19, 2025

Description

Changes in This PR

Uh oh!

gemini-code-assist bot commented Nov 19, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

stmatengss Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

cocktail828 Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cocktail828 Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cocktail828 Nov 20, 2025 •

edited

Loading