Error in getting metrics & prometheus plugin after bumping to the 3.7.1 release #14160

rodolfobrunner · 2025-01-14T11:49:54Z

Is there an existing issue for this?

I have searched the existing issues

Kong version (`$ kong version`)

3.7.1 / 3.9.0

Current Behavior

I am having problems with metrics & prometheus plugin after bumping to the 3.7.1 release. (I already bumped Kong until 3.9.0 and the issue still persists)

I have the following entry in my logs:
[lua] prometheus.lua:1020: log_error(): Error getting 'request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00080.0"}': nil, client: 10.145.40.1, server: kong_status, request: "GET /metrics HTTP/1.1", host: "10.145.12.54:8100"

Interesting facts:

it is always the same service and that route has that error. In my case it is always the same two routes for the same bucket.
When we revert back to 3.6.1 and the problem goes away
After a few months, we bumped Kong to version 3.9.0 and the problem started happening again after a couple of hours for the same routes + buckets.
Goes away with a pod rotation but comes back after a while.

I already tried:

nginx_http_lua_shared_dict: 'prometheus_metrics 15m' Memory now stands at +-20%

One pod contains:

kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="50"} 1
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="80"} 1
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="100"} 1
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="250"} 2
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="400"} 2
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="700"} 2
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="1000"} 2
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="2000"} 2
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="5000"} 2
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="10000"} 2
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="30000"} 2
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="60000"} 2
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="+Inf"} 2

While another is missing the le "80"

kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="50"} 1
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="100"} 2
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="250"} 2
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="400"} 3
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="700"} 3
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="1000"} 3
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="2000"} 3
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="5000"} 3
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="10000"} 3
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="30000"} 3
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="60000"} 3
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="+Inf"} 3

We are running our Kong in AWS EKS, upgraded from 3.6.1

Expected Behavior

The bucket should not disappear, but if it does for any reason I would expect Kong to be able to recover from an inconsistent state. (maybe metric reset?)

Steps To Reproduce

No response

Anything else?

No response

The text was updated successfully, but these errors were encountered:

ProBrian · 2025-01-17T01:57:33Z

@rodolfobrunner Does this issue happen while using the same deployment as #14144? I'm trying to reproduce it.

brunomiguelsantos · 2025-01-20T10:29:59Z

@rodolfobrunner Does this issue happen while using the same deployment as #14144? I'm trying to reproduce it.

Hey @ProBrian, I am part of the same team as @rodolfobrunner. Yes, it's the same deployment.

jmadureira · 2025-02-07T16:06:18Z

Hello @ProBrian some additional info on what we're seeing. We added at some point some debug instructions to figure out what was being stored. Something like:

-- Adapted from the prometheus metric_data function
local function collect()
  ngx.header["Content-Type"] = "text/plain; charset=UTF-8"
  ngx.header["Kong-NodeId"] = node_id

  local prometheus = exporter.get_prometheus()

  local write_fn = ngx.print

  local keys = prometheus.dict:get_keys(0)   <--- this is the ngx.shared["prometheus_metrics"]
  local count = #keys

  table_sort(keys)

  local output = buffer.new(DATA_BUFFER_SIZE_HINT)
  local output_count = 0

  local function buffered_print(fmt, ...)
    if fmt then
      output_count = output_count + 1
      output:putf(fmt, ...)
    end

    if output_count >= 100 or not fmt then
      write_fn(output:get())  -- consume the whole buffer
      output_count = 0
    end
  end

  for i = 1, count do
    local key = keys[i]
    local value = prometheus.dict[key]

    buffered_print("%s: %s\n", key, value)
  end

  buffered_print(nil)

  output:free()
end

... which outputs (when the error occurs):

request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00080.0"} null <----------
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00080.0"} null <----------
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00080.0"} null <----------
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00080.0"} null <----------
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00080.0"} null <----------
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00080.0"} null <----------
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00080.0"} null <----------
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00050.0"} null
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00100.0"} null
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00250.0"} null
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00400.0"} null
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00700.0"} null
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="01000.0"} null
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="02000.0"} null
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="05000.0"} null
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="10000.0"} null
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="30000.0"} null
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="60000.0"} null
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="Inf"} null

How can a dictionary support duplicate keys? Even if it's a shared dictionary?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in getting metrics & prometheus plugin after bumping to the 3.7.1 release #14160

Error in getting metrics & prometheus plugin after bumping to the 3.7.1 release #14160

rodolfobrunner commented Jan 14, 2025

ProBrian commented Jan 17, 2025

brunomiguelsantos commented Jan 20, 2025

jmadureira commented Feb 7, 2025

Error in getting metrics & prometheus plugin after bumping to the 3.7.1 release #14160

Error in getting metrics & prometheus plugin after bumping to the 3.7.1 release #14160

Comments

rodolfobrunner commented Jan 14, 2025

Is there an existing issue for this?

Kong version ($ kong version)

Current Behavior

Expected Behavior

Steps To Reproduce

Anything else?

ProBrian commented Jan 17, 2025

brunomiguelsantos commented Jan 20, 2025

jmadureira commented Feb 7, 2025

Kong version (`$ kong version`)