-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Empty result when step is larger than 300s #705
Comments
@arkbriar I am working on upgrading promxy to implement native histogram ([change],(#696) discussion) and this is based on latest prometheus (post nov 2024) I am facing the same issue but with very high step functions (like 3h). do find any resolution around this? |
@rishabhkumar92 It was the same on my side. High steps are much easier for reproducing the issue, though I was able to reproduce with my tiny query with step of 500s.
Solution, no. I found only a workaround which I posted in the description.
|
@arkbriar There were 2 key things mentioned above:
did you find any workaround for the second case? |
This is the trigger of the exact issue here: the |
@arkbriar i validated that delta was indeed set correctly but problem was due to max lookback limit. |
Sorry, I'm a bit confused. How the jump happens when t0 := t - b.delta
if b.valueType != chunkenc.ValNone && t0 > b.lastTime { |
I think I'm actually a bit lost on this error report. The subject seems to indicate that there are empty results with a large step (>300s); but I am able to get data with a 5000s step spread over the course of a day (equivalent upstream query Upstream 5000s There is also some mention of time alignment with grafana; which I played around with the start/end times but seemingly can't get it to return an empty dataset either. Is it possible to create a repro case using some public datasource? If not, could we attach some trace log output (which would include the query and data)? |
Sorry for the confusion... I'm pretty sure I can reproduce it with my VM setup since I scripted the MRE. Let me have a try with the open data source. |
I found I couldn't reproduce with the open datasource simply because it's a Proemtheus, which returns a result of aligned timestamps starting from the query's start so it avoids the problem. The branch will be correctly skipped if However, in the case I was and am able to reproduce, timestamps don't align. FYI, the following is the line from the trace log with all the values.
The query start is I think this is a VictoriaMetrics specific problem. Let me see if there's an open VM datasource. |
Reproduced with the public VM playground.
promxy:
server_groups:
- static_configs:
- targets: ["play.victoriametrics.com"]
path_prefix: /select/0/prometheus
scheme: https
http_client:
dial_timeout: 1s
tls_config:
insecure_skip_verify: true
Result from VM playground: Result from promxy: empty. Trace logs:
|
Thanks for providing a repro case; SUPER helpful in these situations. So some bad news and good news. Bad news: this is actually a known issue with VictoriaMetrics (where it doesn't return the correct timestamp) -- #202 Good news: there is a trivial workaround to this issue, I added a feature in promxy to pass in query parameters downstream and VM added a |
@jacksontj is fix applicable only for victoria metrics backend? |
The fix I linked above only works on VM AFAIK. The underlying issue is that the downstream (VM in this case) isn't actually honoring the range from the query. To illustrate a bit; lets say the range query was from 100-200 with a step of 10. The API dictates (and prometheus implements) returning 100, 110, 120... through 200. VM has some logic internally to cache datapoints -- but will return "incorrect" times. So instead of the above VM might return something like 101, 111, 121... -- which is 1s "in the future". This mis-alignment is what causes the interaction with the lookback delta -- because the range query should have a datapoint EXACTLY at the time (no need for lookback). Note: technically this means that VM (with caching enabled) is not conformant to prometheus So if you are seeing similar issues in mimir I would expect that they have a similar caching/alignment issue. You'd be able to see this in the trace log results from promxy -- as the data returned from the downstream wouldn't align with the range/step that the query sent. I'm not SUPER familiar with the specifics in mimir -- but this seems to be the case based on some searching I just did (seemingly this -- https://grafana.com/docs/mimir/latest/references/architecture/components/query-frontend/#caching). So in VM there is an ability to disable this per-query (hence the no-cache query param); its unclear if mimir has that -- or if it requires the config flag. |
Seems the commit is missing from the forked prometheus
prometheus/prometheus#9946. Therefore, there's a high chance no data can be popped when a step is larger than 300s (which is the default look back delta).
An example:
start=1736607436000
,end=1736780236000
,step=1800s
The overall result from the query engine was empty. I traced through the code and found data was jumped because of this, where the delta is the default value
300000
.A workaround to this is to align the
start
andend
tostep
, which Grafana does. And the fact that Grafana always works while my program doesn't confused me a long time until I captured the HTTP request :(The text was updated successfully, but these errors were encountered: