[metrics]: Exporter randomly detached from service-account #670

b-lancaster · 2024-03-15T17:39:23Z

Context information

AWS service: Opensearch and Lambda
CloudWatch namespace: AWS/ES and AWS/Lambda
Link to metrics documentation for this service: Opensearch and Lambda
AWS region of the exporter: us-east-1
AWS region of the service: us-east-1
Exporter version: 0.15.5

Exporter configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: cloudwatch-monitoring-general
  namespace: monitoring
data:
  config.yml: |
      ---
      region: us-east-1
      delay_seconds: 0
      set_timestamp: false
      use_get_metric_data: true
      metrics:
      - aws_namespace: AWS/Lambda
        aws_metric_name: Errors
        aws_dimensions: [FunctionName]
        aws_statistics: [Sum]

      - aws_namespace: AWS/Lambda
        aws_metric_name: Invocations
        aws_dimensions: [FunctionName]
        aws_statistics: [Sum]

      - aws_namespace: AWS/Lambda
        aws_metric_name: Duration
        aws_dimensions: [FunctionName]
        aws_statistics: [Average]

      - aws_namespace: AWS/Lambda
        aws_metric_name: Throttles
        aws_dimensions: [FunctionName]
        aws_statistics: [Sum]

      - aws_namespace: AWS/Lambda
        aws_metric_name: OffsetLag
        aws_dimensions: [FunctionName]
        aws_statistics: [Maximum]

      - aws_namespace: AWS/ES
        aws_metric_name: ThreadpoolIndexQueue
        aws_dimensions: [ClientId, DomainName]
        aws_extended_statistics: [p95]

      - aws_namespace: AWS/ES
        aws_metric_name: ThreadpoolWriteQueue
        aws_dimensions: [ClientId, DomainName]
        aws_extended_statistics: [p95]

      - aws_namespace: AWS/ES
        aws_metric_name: ThreadpoolSearchQueue
        aws_dimensions: [ClientId, DomainName]
        aws_extended_statistics: [p95]

      - aws_namespace: AWS/ES
        aws_metric_name: ThreadpoolIndexQueue
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Average]

      - aws_namespace: AWS/ES
        aws_metric_name: ThreadpoolWriteQueue
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Average]

      - aws_namespace: AWS/ES
        aws_metric_name: ThreadpoolSearchQueue
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Average]

      - aws_namespace: AWS/ES
        aws_metric_name: WriteLatency
        aws_dimensions: [ClientId, DomainName]
        aws_extended_statistics: [p95]

      - aws_namespace: AWS/ES
        aws_metric_name: WriteLatency
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Average]

      - aws_namespace: AWS/ES
        aws_metric_name: ReadLatency
        aws_dimensions: [ClientId, DomainName]
        aws_extended_statistics: [p95]

      - aws_namespace: AWS/ES
        aws_metric_name: SearchLatency
        aws_dimensions: [ClientId, DomainName]
        aws_extended_statistics: [p95]

      - aws_namespace: AWS/ES
        aws_metric_name: SearchLatency
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Average]

      - aws_namespace: AWS/ES
        aws_metric_name: IndexingLatency
        aws_dimensions: [ClientId, DomainName]
        aws_extended_statistics: [p95]

      - aws_namespace: AWS/ES
        aws_metric_name: IndexingLatency
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Average]

      - aws_namespace: AWS/ES
        aws_metric_name: IndexingRate
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Average]

      - aws_namespace: AWS/ES
        aws_metric_name: IndexingRate
        aws_dimensions: [ClientId, DomainName]
        aws_extended_statistics: [p95]

      - aws_namespace: AWS/ES
        aws_metric_name: SearchRate
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Average]

      - aws_namespace: AWS/ES
        aws_metric_name: SearchRate
        aws_dimensions: [ClientId, DomainName]
        aws_extended_statistics: [p95]

      - aws_namespace: AWS/ES
        aws_metric_name: 5xx
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Sum]

      - aws_namespace: AWS/ES
        aws_metric_name: 2xx
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Sum]

      - aws_namespace: AWS/ES
        aws_metric_name: 3xx
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Sum]

      - aws_namespace: AWS/ES
        aws_metric_name: 4xx
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Sum]

      - aws_namespace: AWS/ES
        aws_metric_name: ClusterStatus.red
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Maximum]

      - aws_namespace: AWS/ES
        aws_metric_name: ClusterStatus.yellow
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Maximum]

      - aws_namespace: AWS/ES
        aws_metric_name: ClusterIndexWritesBlocked
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Average]

      - aws_namespace: AWS/ES
        aws_metric_name: Nodes
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Minimum]

      - aws_namespace: AWS/ES
        aws_metric_name: AutomatedSnapshotFailure
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Maximum]

      - aws_namespace: AWS/ES
        aws_metric_name: KibanaHealthyNodes
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Minimum]

      - aws_namespace: AWS/ES
        aws_metric_name: CPUUtilization
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Average]

      - aws_namespace: AWS/ES
        aws_metric_name: FreeStorageSpace
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Minimum]

      - aws_namespace: AWS/ES
        aws_metric_name: JVMMemoryPressure
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Maximum]

Service Account

apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/name: cloudwatch-exporter
  name: cloudwatch-exporter
  namespace: monitoring 
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::<REDACTED_ACCOUNT_NUMBER>:role/CloudWatchMetricsReadOnlyRole

IAM Role

  # Role for cloudwatch metrics exporter
  rCloudWatchMetricsReadOnlyRole:
    Type: 'AWS::IAM::Role'
    Properties:
      AssumeRolePolicyDocument: !Sub
        - |
          {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Principal": {
                        "Federated": "${IamOidcProviderArn}"
                    },
                    "Action": "sts:AssumeRoleWithWebIdentity",
                    "Condition": {
                        "StringEquals": {
                            "${OidcProviderEndpoint}:sub":  "system:serviceaccount:monitoring:cloudwatch-exporter"
                        }
                    }
                }
            ]
          }
        - {
            "IamOidcProviderArn": !Ref pOidcProviderArn,
            "OidcProviderEndpoint": !Ref pIssuerHostPath
          }
      Path: /
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/CloudWatchReadOnlyAccess

Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cloudwatch-metrics-exporter-general
  labels:
    app.kubernetes.io/name: cloudwatch-metrics-exporter
    app.kubernetes.io/instance: cloudwatch-metrics-exporter-general
  namespace: monitoring
  annotations:
    reloader.stakater.com/auto: "true"
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: cloudwatch-metrics-exporter
      app.kubernetes.io/instance: cloudwatch-metrics-exporter-general
  template:
    metadata:
      labels:
        app.kubernetes.io/name: cloudwatch-metrics-exporter
        app.kubernetes.io/instance: cloudwatch-metrics-exporter-general
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: k8s.swacorp.com/instancegroup
                    operator: In
                    values:
                      - operations-job-nodes
                      - arm-operations-job-nodes
                  - key: kubernetes.io/arch
                    operator: In
                    values:
                      - arm64
                      - amd64
          preferredDuringSchedulingIgnoredDuringExecution:
              - weight: 100
                preference:
                  matchExpressions:
                    - key: kubernetes.io/arch
                      operator: In
                      values:
                      - arm64
      tolerations:
        - effect: NoExecute
          key: k8s.swacorp.com/dedicated
          operator: Equal
          value: operations-server
        - effect: NoSchedule
          key: kubernetes.io/arch
          operator: Equal
          value: arm64
      serviceAccountName: cloudwatch-exporter
      containers:
        - name: cloudwatch-metrics-exporter
          image: quay.io/prometheus/cloudwatch-exporter:v0.15.5
          ports:
            - containerPort: 9106
          resources:
            requests:
              cpu: 100m
              memory: 600Mi
          volumeMounts:
            - mountPath: /config
              name: cloudwatch-metric-general
      volumes:
        - configMap:
            name: cloudwatch-monitoring-general
          name: cloudwatch-metric-general

Exporter logs

Mar 15, 2024 4:15:46 PM io.prometheus.cloudwatch.CloudWatchCollector collect
WARNING: CloudWatch scrape failed
software.amazon.awssdk.services.cloudwatch.model.CloudWatchException: User: arn:aws:sts::<REDACTED_ACCOUNT_NUMBER>:assumed-role/<REDACTED_NODE_IAM_ROLE_NAME>/<REDACTED_INSTANCE_ID> is not authorized to perform: cloudwatch:GetMetricData because no identity-based policy allows the cloudwatch:GetMetricData action (Service: CloudWatch, Status Code: 403, Request ID: 7079164c-7404-48cf-98e2-8b13d5ccf27a)
	at software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handleErrorResponse(CombinedResponseHandler.java:125)
	at software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handleResponse(CombinedResponseHandler.java:82)
	at software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handle(CombinedResponseHandler.java:60)
	at software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handle(CombinedResponseHandler.java:41)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:50)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:38)
	at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:72)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:42)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:78)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:40)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:55)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:39)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:81)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:36)
	at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
	at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:56)
	at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:36)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.executeWithTimer(ApiCallTimeoutTrackingStage.java:80)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:60)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:42)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:50)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:32)
	at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
	at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:37)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26)
	at software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:224)
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:103)
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:173)
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:80)
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:182)
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:74)
	at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45)
	at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:53)
	at software.amazon.awssdk.services.cloudwatch.DefaultCloudWatchClient.getMetricData(DefaultCloudWatchClient.java:1249)
	at io.prometheus.cloudwatch.GetMetricDataDataGetter.fetchAllDataPoints(GetMetricDataDataGetter.java:138)
	at io.prometheus.cloudwatch.GetMetricDataDataGetter.<init>(GetMetricDataDataGetter.java:185)
	at io.prometheus.cloudwatch.CloudWatchCollector.scrape(CloudWatchCollector.java:486)
	at io.prometheus.cloudwatch.CloudWatchCollector.collect(CloudWatchCollector.java:642)
	at io.prometheus.client.Collector.collect(Collector.java:45)
	at io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.findNextElement(CollectorRegistry.java:204)
	at io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.<init>(CollectorRegistry.java:162)
	at io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.<init>(CollectorRegistry.java:190)
	at io.prometheus.client.CollectorRegistry.metricFamilySamples(CollectorRegistry.java:129)
	at io.prometheus.client.servlet.common.exporter.Exporter.doGet(Exporter.java:75)
	at io.prometheus.client.servlet.jakarta.exporter.MetricsServlet.doGet(MetricsServlet.java:52)
	at jakarta.servlet.http.HttpServlet.service(HttpServlet.java:500)
	at jakarta.servlet.http.HttpServlet.service(HttpServlet.java:587)
	at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:764)
	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:529)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:221)
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1381)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:176)
	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:484)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:174)
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1303)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:129)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
	at org.eclipse.jetty.server.Server.handle(Server.java:563)
	at org.eclipse.jetty.server.HttpChannel$RequestDispatchable.dispatch(HttpChannel.java:1598)
	at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:753)
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:501)
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:287)
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:314)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:100)
	at org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:53)
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.runTask(AdaptiveExecutionStrategy.java:421)
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.consumeTask(AdaptiveExecutionStrategy.java:390)
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.tryProduce(AdaptiveExecutionStrategy.java:277)
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.run(AdaptiveExecutionStrategy.java:199)
	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:411)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:969)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.doRunJob(QueuedThreadPool.java:1194)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1149)
	at java.base/java.lang.Thread.run(Unknown Source)

What do you expect to happen?

I expected the cloudwatch-exporter to use the attached service-account with the permissions necessary to retrieve metric data.

What happened instead?

What actually happened was the cloudwatch-exporter stopped using the service-account and tried to use the k8s nodes IAM role. Nothing changed, we just stopped recieving the metrics in prometheus and then found the logs.

Restarting the deployment fixed the problem and it started using the service-account again, but the problem is if this would have happened in a production environment, the prometheus alerts we've setup to monitor these metrics wouldnt have met the threshold needed to fire.

Also, without looking at the logs, the pod appeared to be running as normal.

The text was updated successfully, but these errors were encountered:

saarw-opti · 2024-03-19T15:45:25Z

Same here; I don't know why it's not using the service account, and it prefers to use the Node role.

Ca-moes · 2024-03-26T09:04:00Z

Same here, on a fresh deployment using https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus-cloudwatch-exporter

Restarting the Deployment does not fix it. It uses the Karpenter role

matthiasr · 2024-12-13T13:55:21Z

Hmm, the exporter itself does not do anything special with the configuration you provided – as long as you don't configure AWS authentication in the exporter config, it uses the default authentication chain in the AWS SDK. The only explanation I have is that retrieving or exchanging the service account OIDC token for AWS credentials failed, and the SDK proceeded further in the chain, eventually finding the node credentials.

Are there any earlier logs related to this, possibly when running with debug log verbosity? Unfortunately I am very unfamiliar with and eternally confused by the Java logging ecosystem, so I'm not sure how exactly to achieve that.

Considering this both broke and recovered without changes to the exporter or configuration, I'm not sure what we can do about it tbh.

b-lancaster added the metrics-configuration How to configure specific metrics for collection label Mar 15, 2024

b-lancaster changed the title ~~[metrics]: Exporter detached from service-account and~~ [metrics]: Exporter randomly detached from service-account Mar 15, 2024

matthiasr removed the metrics-configuration How to configure specific metrics for collection label Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[metrics]: Exporter randomly detached from service-account #670

[metrics]: Exporter randomly detached from service-account #670

b-lancaster commented Mar 15, 2024

saarw-opti commented Mar 19, 2024 •

edited

Loading

Ca-moes commented Mar 26, 2024

matthiasr commented Dec 13, 2024

[metrics]: Exporter randomly detached from service-account #670

[metrics]: Exporter randomly detached from service-account #670

Comments

b-lancaster commented Mar 15, 2024

Context information

What do you expect to happen?

What happened instead?

saarw-opti commented Mar 19, 2024 • edited Loading

Ca-moes commented Mar 26, 2024

matthiasr commented Dec 13, 2024

saarw-opti commented Mar 19, 2024 •

edited

Loading