Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in getting metrics & prometheus plugin after bumping to the 3.7.1 release #14160

Open
1 task done
rodolfobrunner opened this issue Jan 14, 2025 · 3 comments
Open
1 task done

Comments

@rodolfobrunner
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Kong version ($ kong version)

3.7.1 / 3.9.0

Current Behavior

I am having problems with metrics & prometheus plugin after bumping to the 3.7.1 release. (I already bumped Kong until 3.9.0 and the issue still persists)

I have the following entry in my logs:
[lua] prometheus.lua:1020: log_error(): Error getting 'request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00080.0"}': nil, client: 10.145.40.1, server: kong_status, request: "GET /metrics HTTP/1.1", host: "10.145.12.54:8100"

Interesting facts:

  • it is always the same service and that route has that error. In my case it is always the same two routes for the same bucket.
  • When we revert back to 3.6.1 and the problem goes away
  • After a few months, we bumped Kong to version 3.9.0 and the problem started happening again after a couple of hours for the same routes + buckets.
  • Goes away with a pod rotation but comes back after a while.

I already tried:

  • nginx_http_lua_shared_dict: 'prometheus_metrics 15m' Memory now stands at +-20%

One pod contains:

kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="50"} 1
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="80"} 1
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="100"} 1
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="250"} 2
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="400"} 2
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="700"} 2
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="1000"} 2
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="2000"} 2
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="5000"} 2
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="10000"} 2
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="30000"} 2
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="60000"} 2
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="+Inf"} 2

While another is missing the le "80"

kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="50"} 1
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="100"} 2
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="250"} 2
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="400"} 3
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="700"} 3
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="1000"} 3
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="2000"} 3
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="5000"} 3
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="10000"} 3
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="30000"} 3
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="60000"} 3
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="+Inf"} 3

We are running our Kong in AWS EKS, upgraded from 3.6.1

Expected Behavior

The bucket should not disappear, but if it does for any reason I would expect Kong to be able to recover from an inconsistent state. (maybe metric reset?)

Steps To Reproduce

No response

Anything else?

No response

@ProBrian
Copy link
Contributor

@rodolfobrunner Does this issue happen while using the same deployment as #14144? I'm trying to reproduce it.

@brunomiguelsantos
Copy link

@rodolfobrunner Does this issue happen while using the same deployment as #14144? I'm trying to reproduce it.

Hey @ProBrian, I am part of the same team as @rodolfobrunner. Yes, it's the same deployment.

@jmadureira
Copy link

Hello @ProBrian some additional info on what we're seeing. We added at some point some debug instructions to figure out what was being stored. Something like:

-- Adapted from the prometheus metric_data function
local function collect()
  ngx.header["Content-Type"] = "text/plain; charset=UTF-8"
  ngx.header["Kong-NodeId"] = node_id

  local prometheus = exporter.get_prometheus()

  local write_fn = ngx.print

  local keys = prometheus.dict:get_keys(0)   <--- this is the ngx.shared["prometheus_metrics"]
  local count = #keys

  table_sort(keys)

  local output = buffer.new(DATA_BUFFER_SIZE_HINT)
  local output_count = 0

  local function buffered_print(fmt, ...)
    if fmt then
      output_count = output_count + 1
      output:putf(fmt, ...)
    end

    if output_count >= 100 or not fmt then
      write_fn(output:get())  -- consume the whole buffer
      output_count = 0
    end
  end

  for i = 1, count do
    local key = keys[i]
    local value = prometheus.dict[key]

    buffered_print("%s: %s\n", key, value)
  end

  buffered_print(nil)

  output:free()
end

... which outputs (when the error occurs):

request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00080.0"} null <----------
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00080.0"} null <----------
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00080.0"} null <----------
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00080.0"} null <----------
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00080.0"} null <----------
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00080.0"} null <----------
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00080.0"} null <----------
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00050.0"} null
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00100.0"} null
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00250.0"} null
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00400.0"} null
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00700.0"} null
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="01000.0"} null
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="02000.0"} null
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="05000.0"} null
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="10000.0"} null
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="30000.0"} null
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="60000.0"} null
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="Inf"} null

How can a dictionary support duplicate keys? Even if it's a shared dictionary?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants