Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions doc/en/mooncake-store.md
Original file line number Diff line number Diff line change
Expand Up @@ -570,6 +570,15 @@ The HTTP metadata server can be configured using the following parameters:

- **`http_metadata_server_host`** (string, default: `"0.0.0.0"`): Specifies the host address for the HTTP metadata server to bind to. Use `"0.0.0.0"` to listen on all available network interfaces, or specify a specific IP address for security purposes.

#### Environment Variables

- MC_STORE_CLUSTER_ID: Identify the metadata when multiple cluster share the same master, default 'mooncake'.
- MC_STORE_MEMCPY: Enables or disables local memcpy optimization, set to 1/true to enable, 0/false to disable.
- MC_STORE_NODE_IP: Used by client metrics, the node's IP address.
- MC_STORE_CLUSTER_NAME: This environment variable identifies the LLM model or cluster name (e.g., Qwen-Max, Qwen-Plus).
- MC_STORE_CLIENT_METRIC: Enables client metric reporting, enabled by default; set to 0/false to disable.
- MC_STORE_CLIENT_METRIC_INTERVAL: Reporting interval in seconds, default 0 (collects but does not report).

#### Usage Example

To start the master service with the HTTP metadata server enabled:
Expand Down
9 changes: 9 additions & 0 deletions doc/zh/mooncake-store.md
Original file line number Diff line number Diff line change
Expand Up @@ -570,6 +570,15 @@ HTTP 元数据服务器可通过以下参数进行配置:

- **`http_metadata_server_host`**(字符串,默认值:`"0.0.0.0"`):指定 HTTP 元数据服务器绑定的主机地址。使用 `"0.0.0.0"` 可监听所有可用网络接口,或指定特定 IP 地址以提高安全性。

#### 环境变量说明

- **MC_STORE_CLUSTER_ID**: 在多集群复用 master 场景下标识元数据, 默认 'mooncake'
- **MC_STORE_MEMCPY**: 控制是否启用本地 memcpy 优化, 1/true 启用, 0/false 禁用
- **MC_STORE_NODE_IP**: 客户端指标使用, 节点 IP 地址
- **MC_STORE_CLUSTER_NAME**: 注意: 此环境变量标识 LLM 模型/集群名称, 如 Qwen-Max、Qwen-Plus
- **MC_STORE_CLIENT_METRIC**: 启用客户端指标上报, 默认启用;设为 0/false 禁用
- **MC_STORE_CLIENT_METRIC_INTERVAL**: 指标上报间隔(秒), 默认 0(仅收集不上报)

#### 使用示例

要使用启用了 HTTP 元数据服务器的主服务,请运行:
Expand Down
43 changes: 32 additions & 11 deletions mooncake-store/include/client_metric.h
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
#include <ylt/metric/histogram.hpp>
#include <ylt/metric/summary.hpp>
#include "utils.h"
#include "hybrid_metric.h"

namespace mooncake {

Expand All @@ -21,23 +22,42 @@ const std::vector<double> kLatencyBucket = {
// safeguards for long tails
50000, 100000, 200000, 500000, 1000000};

static inline std::string get_env_or_default(
const char* env_var, const std::string& default_val = "") {
const char* val = getenv(env_var);
return val ? val : default_val;
}

// In production mode, more labels are needed for monitoring and troubleshooting
// Static labels include but are not limited to machine address, cluster name,
// etc. These labels remain constant during the lifetime of the application
const std::string kInstanceID = get_env_or_default("MC_STORE_NODE_IP");
const std::string kClusterName = get_env_or_default("MC_STORE_CLUSTER_NAME");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not define this variable as parameter in the master side

Copy link
Contributor Author

@cocktail828 cocktail828 Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The master isn’t intended to handle this—it only exports RPC-related metrics.
Node IP, cluster name, transfer engine related metric and the like are client-side metric.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For instance, to troubleshoot transfer latency issues, node IP information is essential for pinpointing single-point failures—otherwise, we only have global metrics or non-specific errors with no visibility into which node is at fault.

const std::map<std::string, std::string> static_labels = {
// instance id is node ip
{"instance_id", kInstanceID},
// NOTE: this is not cluster_id(which identify mooncake cluster), but LLM
// cluster name, ie.. Qwen-Max, Qwen-Plus
{"cluster_name", kClusterName},
};

struct TransferMetric {
ylt::metric::counter_t total_read_bytes{"mooncake_transfer_read_bytes",
"Total bytes read"};
ylt::metric::counter_t total_write_bytes{"mooncake_transfer_write_bytes",
"Total bytes written"};
"Total bytes read", static_labels};
ylt::metric::counter_t total_write_bytes{
"mooncake_transfer_write_bytes", "Total bytes written", static_labels};
ylt::metric::histogram_t batch_put_latency_us{
"mooncake_transfer_batch_put_latency",
"Batch Put transfer latency (us)", kLatencyBucket};
"Batch Put transfer latency (us)", kLatencyBucket, static_labels};
ylt::metric::histogram_t batch_get_latency_us{
"mooncake_transfer_batch_get_latency",
"Batch Get transfer latency (us)", kLatencyBucket};
"Batch Get transfer latency (us)", kLatencyBucket, static_labels};
ylt::metric::histogram_t get_latency_us{"mooncake_transfer_get_latency",
"Get transfer latency (us)",
kLatencyBucket};
kLatencyBucket, static_labels};
ylt::metric::histogram_t put_latency_us{"mooncake_transfer_put_latency",
"Put transfer latency (us)",
kLatencyBucket};
kLatencyBucket, static_labels};

void serialize(std::string& str) {
total_read_bytes.serialize(str);
Expand Down Expand Up @@ -137,13 +157,14 @@ struct MasterClientMetric {

MasterClientMetric()
: rpc_count("mooncake_client_rpc_count",
"Total number of RPC calls made by the client", rpc_names),
"Total number of RPC calls made by the client",
static_labels, rpc_names),
rpc_latency("mooncake_client_rpc_latency",
"Latency of RPC calls made by the client (in us)",
kLatencyBucket, rpc_names) {}
kLatencyBucket, static_labels, rpc_names) {}

ylt::metric::dynamic_counter_1t rpc_count;
ylt::metric::dynamic_histogram_1t rpc_latency;
ylt::metric::hybrid_counter_1t rpc_count;
ylt::metric::hybrid_histogram_1t rpc_latency;
void serialize(std::string& str) {
rpc_count.serialize(str);
rpc_latency.serialize(str);
Expand Down
Loading
Loading