-
Notifications
You must be signed in to change notification settings - Fork 433
[Store] Add support for static labels (host IP/cluster name) in client metrics #1081
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @cocktail828, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly improves the observability of client metrics by integrating essential static context information, specifically the host IP and cluster name. By introducing a new 'hybrid' metric mechanism, these labels are seamlessly attached to existing metrics without disrupting their core functionality. This enhancement is crucial for effective production-grade monitoring and troubleshooting, allowing operators to quickly identify the source and context of metric data. The configuration is straightforward, relying on standard environment variables, and is thoroughly documented. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a valuable feature by adding static labels like host IP and cluster name to client metrics, which will significantly improve observability and troubleshooting. The implementation adds a new hybrid_metric.h to support metrics with both static and dynamic labels, and the changes are accompanied by relevant test updates. My review identifies a critical bug in the serialization logic of the new hybrid histogram, along with several opportunities for performance improvements and minor corrections in hybrid_metric.h. I have also included a suggestion to improve the clarity of the new documentation.
| void serialize(std::string& str) override { | ||
| auto value_map = sum_->copy(); | ||
| if (value_map.empty()) { | ||
| return; | ||
| } | ||
|
|
||
| serialize_head(str); | ||
|
|
||
| std::string value_str; | ||
| auto bucket_counts = get_bucket_counts(); | ||
| for (auto& e : value_map) { | ||
| auto& labels_value = e->label; | ||
| auto& value = e->value; | ||
| if (value == 0) { | ||
| continue; | ||
| } | ||
|
|
||
| value_type count = 0; | ||
| for (size_t i = 0; i < bucket_counts.size(); i++) { | ||
| auto counter = bucket_counts[i]; | ||
| value_str.append(name_).append("_bucket{"); | ||
| if (!labels_name_.empty()) { | ||
| build_label_string_with_static(value_str, labels_name_, | ||
| labels_value); | ||
| value_str.append(","); | ||
| } | ||
|
|
||
| if (i == bucket_boundaries_.size()) { | ||
| value_str.append("le=\"").append("+Inf").append("\"} "); | ||
| } else { | ||
| value_str.append("le=\"") | ||
| .append(std::to_string(bucket_boundaries_[i])) | ||
| .append("\"} "); | ||
| } | ||
|
|
||
| count += counter->value(labels_value); | ||
| value_str.append(std::to_string(count)); | ||
| value_str.append("\n"); | ||
| } | ||
|
|
||
| str.append(value_str); | ||
|
|
||
| str.append(name_); | ||
| str.append("_sum{"); | ||
| build_label_string_with_static(str, sum_->labels_name(), | ||
| labels_value); | ||
| str.append("} "); | ||
|
|
||
| str.append(std::to_string(value)); | ||
| str.append("\n"); | ||
|
|
||
| str.append(name_).append("_count{"); | ||
| build_label_string_with_static(str, sum_->labels_name(), | ||
| labels_value); | ||
| str.append("} "); | ||
| str.append(std::to_string(count)); | ||
| str.append("\n"); | ||
| } | ||
| if (value_str.empty()) { | ||
| str.clear(); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a critical bug in the serialize method. The value_str is declared outside the main loop and never cleared, causing it to accumulate bucket data from all label sets. This results in incorrect and duplicated metric output. Additionally, the if (value_str.empty()) check at the end is logically flawed and will cause a compilation error once value_str is correctly scoped inside the loop.
The suggested change fixes this by moving value_str inside the loop and removing the problematic final check.
void serialize(std::string& str) override {
auto value_map = sum_->copy();
if (value_map.empty()) {
return;
}
serialize_head(str);
auto bucket_counts = get_bucket_counts();
for (auto& e : value_map) {
auto& labels_value = e->label;
auto& value = e->value;
if (value == 0) {
continue;
}
std::string value_str;
value_type count = 0;
for (size_t i = 0; i < bucket_counts.size(); i++) {
auto counter = bucket_counts[i];
value_str.append(name_).append("_bucket{");
if (!labels_name_.empty()) {
build_label_string_with_static(value_str, labels_name_,
labels_value);
value_str.append(",");
}
if (i == bucket_boundaries_.size()) {
value_str.append("le=\"").append("+Inf").append("\"} ");
} else {
value_str.append("le=\"")
.append(std::to_string(bucket_boundaries_[i]))
.append("\"} ");
}
count += counter->value(labels_value);
value_str.append(std::to_string(count));
value_str.append("\n");
}
str.append(value_str);
str.append(name_);
str.append("_sum{");
build_label_string_with_static(str, sum_->labels_name(),
labels_value);
str.append("} ");
str.append(std::to_string(value));
str.append("\n");
str.append(name_).append("_count{");
build_label_string_with_static(str, sum_->labels_name(),
labels_value);
str.append("} ");
str.append(std::to_string(count));
str.append("\n");
}
}
doc/en/mooncake-store.md
Outdated
| - MC_STORE_CLUSTER_ID: Identify the metadata when multiple cluster share the same master, default 'mooncake' | ||
| - MC_STORE_MEMCPY: Enables or disables local memcpy optimization, set to 1/true to enable, 0/false to disable | ||
| - MC_STORE_NODE_IP: Used by client metrics, the node's IP address | ||
| - MC_STORE_CLUSTER_NAME: Note: the envariable identify the LLM model name (e.g., Qwen-Max, Qwen-Plus) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a typo here. envariable should be environment variable. Also, the description could be rephrased for better clarity.
| - MC_STORE_CLUSTER_NAME: Note: the envariable identify the LLM model name (e.g., Qwen-Max, Qwen-Plus) | |
| - MC_STORE_CLUSTER_NAME: This environment variable identifies the LLM model or cluster name (e.g., Qwen-Max, Qwen-Plus). |
| bool has_label_value(const std::string& value) override { | ||
| auto map = Base::copy(); | ||
| for (auto& e : map) { | ||
| auto& label_value = e->label; | ||
| if (auto it = | ||
| std::find(label_value.begin(), label_value.end(), value); | ||
| it != label_value.end()) { | ||
| return true; | ||
| } | ||
| } | ||
|
|
||
| return false; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation of has_label_value can be inefficient. It calls Base::copy(), which likely copies the entire underlying map of metrics, and then performs a linear scan. For a large number of metrics, this could become a performance bottleneck if called frequently. If this function is in a hot path, consider a more optimized approach, perhaps using an auxiliary data structure for faster lookups.
| if (value == 0) { | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check if (value == 0) to skip serialization might be incorrect. If negative values can be observed, it's possible for the sum (value) to be zero while the count is non-zero. In such cases, the histogram should still be serialized. It would be more robust to check if the count is zero instead of the sum.
| .append("\"} "); | ||
| } | ||
|
|
||
| count += counter->value(labels_value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inside the serialize method, counter->value(labels_value) is called within a loop for every bucket. This leads to multiple map lookups for the same labels_value, which is inefficient. For better performance, consider calculating the total count for all buckets with a single pass or optimizing the lookup strategy.
|
|
||
| using hybrid_histogram_5t = basic_hybrid_histogram<int64_t, 5>; | ||
| using hybrid_histogram_5d = basic_hybrid_histogram<double, 5>; | ||
| } // namespace ylt::metric No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Static labels include but are not limited to machine address, cluster name, | ||
| // etc. These labels remain constant during the lifetime of the application | ||
| const std::string kInstanceID = get_env_or_default("MC_STORE_NODE_IP"); | ||
| const std::string kClusterName = get_env_or_default("MC_STORE_CLUSTER_NAME"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not define this variable as parameter in the master side
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The master isn’t intended to handle this—it only exports RPC-related metrics.
Node IP, cluster name, transfer engine related metric and the like are client-side metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For instance, to troubleshoot transfer latency issues, node IP information is essential for pinpointing single-point failures—otherwise, we only have global metrics or non-specific errors with no visibility into which node is at fault.
Description
Current client metrics lack necessary context information. In actual troubleshooting, key details like host IP and cluster name can effectively improve problem-solving efficiency, which are essential supplements for production-grade instrumentation.
Changes in This PR