-
Notifications
You must be signed in to change notification settings - Fork 26
[draft] add warmup and startup metrics #69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to vLLM support on Spyre.
Or this can be done with
Now you are good to go 🚀 |
Export the startup variable configuration, and warmup metrics to the /metrics endpoint. The gauges are prefixed vllm:plugin:spyre in hopes of avoiding namespace conflicts with other users. For vllm-project#58 TODO: also port to V1 plugin. Signed-off-by: Paul Murphy <[email protected]>
The github actions don't seem to entirely work on ppc64le:
|
@tdoublep fyi. First stab at pulling startup metrics. |
CI usage leads to recycling gauges. Attempt to clean them up before installing them.
prometheus_client.REGISTRY.unregister(collector) | ||
labels = {} | ||
for k in envs_spyre.environment_variables.keys(): | ||
labels[k] = str(envs_spyre.environment_variables[k]()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, I think this is kinda cool to dump out all the environment variables as labels, but I think this will be pretty hard to interpret since the values that we're interested in are in the labels and not the gauge values. If I understand this correctly, we're just gonna have a bunch of gauges all set to the value 1
, with a lot of labels attached to them saying what the values for these environment variables are, and trying to visualize that info with a tool like instana or grafana sounds like it would be pretty clunky at best and confusing at worst. For instance, if I tried to filter on one of these config values, I might expect to see all the runtime metrics filtered by pods with that configuration, but instead I would just get this one gauge.
I'm not sure that prometheus is the best tool to use to report static deployment configuration like this. Shouldn't the logs already report configuration, and the user's CD solution should have their deployment configs checked in as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only label gauge, so far. I should be consistent with other usage within vllm which exposes configuration options. It's much less fickle than parsing logs, which may not be available to testing clients.
The desired usage here isn't to fill out a dashboard, but to provide performance testing a means to query the data in a structured way. Parsing the contents of curl localhost:8000/metrics
is preferable to finding and reconstructing logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, gotcha. I've never seen config exposed this way for automated perf tests before, but that does sound easier than trying to inject new rest endpoints into the running app to expose the config.
Since this is designed specifically for use during performance testing, it would be really great to have that documented in a comment here. It'd also be great to have the ability to turn this off when we're not running performance tests, so that we can reduce the load on prometheus and avoid confusing users with any of these labels
# Capture a oneshot metric (e.x warmup) | ||
def _spyre_metric_oneshot(name, labels, value): | ||
name = "vllm:plugin:spyre:" + name | ||
# Cleanup existing, for CI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this cleanup required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The prometheus client is independent of the vllm objects being created and destroyed while inside the same python vm. The original PR didn't clean up, and the result was many failures due to already registered gauges. vllm handles this in a similar way.
prometheus_client.REGISTRY.unregister(collector) | ||
gauge = prometheus_client.Gauge(name=name, | ||
documentation="Oneshot spyre metric", | ||
multiprocess_mode='mostrecent', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
v0 uses multiprocess mode to collect metrics from processes that aren't the main server process, but my last understanding of the prometheus integration in v1 was that it didn't use multiprocessing. I thought it instead collected all the stats to be reported as part of the EngineCoreOutputs
class and then reported those from the frontend process.
If that's still true, then we're gonna have to do something a lot different to capture any extra metrics for v1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the default multiprocess mode adds a pid label to each gauge. Instead, rank is used to differentiate each sampling. It can be removed without harm if we don't mind the extra noise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, what I'm saying is that I know this works for v0 because multiprocessing mode is enabled, but I think that trying this same strategy for v1 might just fail because vllm doesn't set up prometheus multiprocessing at all. So your TODO: also port to V1 plugin
might require a very different design
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking into the V1 changes, I don't think we can use prometheus unless the multiprocess support is turned back on. I am not sure if or when that would happen given it is turned off for performance reasons.
I am not sure what other options beyond finding and scraping logs exist when using V1. Any hints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dang, yeah that's what I suspected would be the case. I also doubt the multiprocess support will be turned back on, though we could ask.
We could write out the info to files? Then you could mount a shared remote storage when you deploy and run, and collect the data you need from there.
collector = prometheus_client.REGISTRY._names_to_collectors[name] | ||
prometheus_client.REGISTRY.unregister(collector) | ||
gauge = prometheus_client.Gauge(name=name, | ||
documentation="Oneshot spyre metric", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation about the specific metric would be great for users here
"prompt_len": str(prompt_len), | ||
"num_decode_tokens": str(num_decode_tokens), | ||
"batch_size": str(batch_size) | ||
}, time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of emitting a uniquely labeled gauge for each warmup pass, it might be sufficient to use a histogram metric here without all the labels. The system logs will still show the time spent in each warmup pass, so if the histogram reveals some warmup passes are too slow we'll still be able to identify which pass it was.
Having lots of labels is probably fine for smaller deployments, but this adds 3 * len(warmup_shapes)
new timeseries for each replica, instead of just one, which could cause problems with prometheus at scale
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many shapes do you think will be used at one time? Similarly, specific timing needs to be maintained to line up with performance logging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the thoughts on adding another envvar to enable metrics, ex VLLM_SPYRE_WARMUP_ENABLE_METRICS
or similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many shapes do you think will be used at one time?
I don't know, but once this is all actually ready for production I imagine there could be quite a lot of shapes for some models. If these are all meant just to enable performance testing then I'm in favor of putting all of them behind one single flag, maybe something like VLLM_SPYRE_PERF_TESTING_METRICS
to be specific about it.
For regular deployment monitoring, I'd expect that all deployments would used cached graphs so this warmup will always be fast, and maybe we'd only need the single total warmup time metric to potentially alert on any invalid caches
Back to the drawing board... |
Export the startup variable configuration, and warmup metrics to the /metrics endpoint.
The gauges are prefixed vllm:plugin:spyre in hopes of avoiding namespace conflicts with other users.
For #58
TODO: also port to V1 plugin.