Skip to content

Conversation

@cocktail828
Copy link
Contributor

Description

Current client metrics lack necessary context information. In actual troubleshooting, key details like host IP and cluster name can effectively improve problem-solving efficiency, which are essential supplements for production-grade instrumentation.

Changes in This PR

  • Add support for static labels in client metrics, including core fields such as host IP and cluster name.
  • Keep the original metric logic unchanged—only attach static labels without any non-breaking changes.
  • Label values follow existing configuration standards, requiring no additional manual setup and adapting to the current deployment workflow.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @cocktail828, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the observability of client metrics by integrating essential static context information, specifically the host IP and cluster name. By introducing a new 'hybrid' metric mechanism, these labels are seamlessly attached to existing metrics without disrupting their core functionality. This enhancement is crucial for effective production-grade monitoring and troubleshooting, allowing operators to quickly identify the source and context of metric data. The configuration is straightforward, relying on standard environment variables, and is thoroughly documented.

Highlights

  • Enhanced Client Metrics: Client metrics now include static labels such as host IP (instance_id) and cluster name (cluster_name), providing richer context for monitoring and troubleshooting.
  • New Metric Types for Static Labels: Introduced hybrid_counter and hybrid_histogram metric types to support the inclusion of static labels alongside dynamic labels, ensuring existing metric logic remains unchanged.
  • Environment Variable Configuration: Static labels are configured via new environment variables: MC_STORE_NODE_IP for host IP and MC_STORE_CLUSTER_NAME for the cluster name. Documentation has been updated to reflect these additions.
  • Metric Serialization Verification: Added test cases to confirm that the new instance_id and cluster_name labels are correctly serialized and present in the client metrics output.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable feature by adding static labels like host IP and cluster name to client metrics, which will significantly improve observability and troubleshooting. The implementation adds a new hybrid_metric.h to support metrics with both static and dynamic labels, and the changes are accompanied by relevant test updates. My review identifies a critical bug in the serialization logic of the new hybrid histogram, along with several opportunities for performance improvements and minor corrections in hybrid_metric.h. I have also included a suggestion to improve the clarity of the new documentation.

Comment on lines 287 to 348
void serialize(std::string& str) override {
auto value_map = sum_->copy();
if (value_map.empty()) {
return;
}

serialize_head(str);

std::string value_str;
auto bucket_counts = get_bucket_counts();
for (auto& e : value_map) {
auto& labels_value = e->label;
auto& value = e->value;
if (value == 0) {
continue;
}

value_type count = 0;
for (size_t i = 0; i < bucket_counts.size(); i++) {
auto counter = bucket_counts[i];
value_str.append(name_).append("_bucket{");
if (!labels_name_.empty()) {
build_label_string_with_static(value_str, labels_name_,
labels_value);
value_str.append(",");
}

if (i == bucket_boundaries_.size()) {
value_str.append("le=\"").append("+Inf").append("\"} ");
} else {
value_str.append("le=\"")
.append(std::to_string(bucket_boundaries_[i]))
.append("\"} ");
}

count += counter->value(labels_value);
value_str.append(std::to_string(count));
value_str.append("\n");
}

str.append(value_str);

str.append(name_);
str.append("_sum{");
build_label_string_with_static(str, sum_->labels_name(),
labels_value);
str.append("} ");

str.append(std::to_string(value));
str.append("\n");

str.append(name_).append("_count{");
build_label_string_with_static(str, sum_->labels_name(),
labels_value);
str.append("} ");
str.append(std::to_string(count));
str.append("\n");
}
if (value_str.empty()) {
str.clear();
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a critical bug in the serialize method. The value_str is declared outside the main loop and never cleared, causing it to accumulate bucket data from all label sets. This results in incorrect and duplicated metric output. Additionally, the if (value_str.empty()) check at the end is logically flawed and will cause a compilation error once value_str is correctly scoped inside the loop.

The suggested change fixes this by moving value_str inside the loop and removing the problematic final check.

    void serialize(std::string& str) override {
        auto value_map = sum_->copy();
        if (value_map.empty()) {
            return;
        }

        serialize_head(str);

        auto bucket_counts = get_bucket_counts();
        for (auto& e : value_map) {
            auto& labels_value = e->label;
            auto& value = e->value;
            if (value == 0) {
                continue;
            }

            std::string value_str;
            value_type count = 0;
            for (size_t i = 0; i < bucket_counts.size(); i++) {
                auto counter = bucket_counts[i];
                value_str.append(name_).append("_bucket{");
                if (!labels_name_.empty()) {
                    build_label_string_with_static(value_str, labels_name_,
                                                   labels_value);
                    value_str.append(",");
                }

                if (i == bucket_boundaries_.size()) {
                    value_str.append("le=\"").append("+Inf").append("\"} ");
                } else {
                    value_str.append("le=\"")
                        .append(std::to_string(bucket_boundaries_[i]))
                        .append("\"} ");
                }

                count += counter->value(labels_value);
                value_str.append(std::to_string(count));
                value_str.append("\n");
            }

            str.append(value_str);

            str.append(name_);
            str.append("_sum{");
            build_label_string_with_static(str, sum_->labels_name(),
                                           labels_value);
            str.append("} ");

            str.append(std::to_string(value));
            str.append("\n");

            str.append(name_).append("_count{");
            build_label_string_with_static(str, sum_->labels_name(),
                                           labels_value);
            str.append("} ");
            str.append(std::to_string(count));
            str.append("\n");
        }
    }

- MC_STORE_CLUSTER_ID: Identify the metadata when multiple cluster share the same master, default 'mooncake'
- MC_STORE_MEMCPY: Enables or disables local memcpy optimization, set to 1/true to enable, 0/false to disable
- MC_STORE_NODE_IP: Used by client metrics, the node's IP address
- MC_STORE_CLUSTER_NAME: Note: the envariable identify the LLM model name (e.g., Qwen-Max, Qwen-Plus)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a typo here. envariable should be environment variable. Also, the description could be rephrased for better clarity.

Suggested change
- MC_STORE_CLUSTER_NAME: Note: the envariable identify the LLM model name (e.g., Qwen-Max, Qwen-Plus)
- MC_STORE_CLUSTER_NAME: This environment variable identifies the LLM model or cluster name (e.g., Qwen-Max, Qwen-Plus).

Comment on lines +85 to +97
bool has_label_value(const std::string& value) override {
auto map = Base::copy();
for (auto& e : map) {
auto& label_value = e->label;
if (auto it =
std::find(label_value.begin(), label_value.end(), value);
it != label_value.end()) {
return true;
}
}

return false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This implementation of has_label_value can be inefficient. It calls Base::copy(), which likely copies the entire underlying map of metrics, and then performs a linear scan. For a large number of metrics, this could become a performance bottleneck if called frequently. If this function is in a hot path, consider a more optimized approach, perhaps using an auxiliary data structure for faster lookups.

Comment on lines +300 to +302
if (value == 0) {
continue;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The check if (value == 0) to skip serialization might be incorrect. If negative values can be observed, it's possible for the sum (value) to be zero while the count is non-zero. In such cases, the histogram should still be serialized. It would be more robust to check if the count is zero instead of the sum.

.append("\"} ");
}

count += counter->value(labels_value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Inside the serialize method, counter->value(labels_value) is called within a loop for every bucket. This leads to multiple map lookups for the same labels_value, which is inefficient. For better performance, consider calculating the total count for all buckets with a single pass or optimizing the lookup strategy.


using hybrid_histogram_5t = basic_hybrid_histogram<int64_t, 5>;
using hybrid_histogram_5d = basic_hybrid_histogram<double, 5>;
} // namespace ylt::metric No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This file is missing a newline at the end. It's a common convention to end files with a newline to avoid issues with some tools and scripts.

}  // namespace ylt::metric

// Static labels include but are not limited to machine address, cluster name,
// etc. These labels remain constant during the lifetime of the application
const std::string kInstanceID = get_env_or_default("MC_STORE_NODE_IP");
const std::string kClusterName = get_env_or_default("MC_STORE_CLUSTER_NAME");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not define this variable as parameter in the master side

Copy link
Contributor Author

@cocktail828 cocktail828 Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The master isn’t intended to handle this—it only exports RPC-related metrics.
Node IP, cluster name, transfer engine related metric and the like are client-side metric.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For instance, to troubleshoot transfer latency issues, node IP information is essential for pinpointing single-point failures—otherwise, we only have global metrics or non-specific errors with no visibility into which node is at fault.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants