Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[metrics]: Exporter randomly detached from service-account #670

Open
b-lancaster opened this issue Mar 15, 2024 · 3 comments
Open

[metrics]: Exporter randomly detached from service-account #670

b-lancaster opened this issue Mar 15, 2024 · 3 comments

Comments

@b-lancaster
Copy link

Context information

  • AWS service: Opensearch and Lambda
  • CloudWatch namespace: AWS/ES and AWS/Lambda
  • Link to metrics documentation for this service: Opensearch and Lambda
  • AWS region of the exporter: us-east-1
  • AWS region of the service: us-east-1
  • Exporter version: 0.15.5
Exporter configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: cloudwatch-monitoring-general
  namespace: monitoring
data:
  config.yml: |
      ---
      region: us-east-1
      delay_seconds: 0
      set_timestamp: false
      use_get_metric_data: true
      metrics:
      - aws_namespace: AWS/Lambda
        aws_metric_name: Errors
        aws_dimensions: [FunctionName]
        aws_statistics: [Sum]

      - aws_namespace: AWS/Lambda
        aws_metric_name: Invocations
        aws_dimensions: [FunctionName]
        aws_statistics: [Sum]

      - aws_namespace: AWS/Lambda
        aws_metric_name: Duration
        aws_dimensions: [FunctionName]
        aws_statistics: [Average]

      - aws_namespace: AWS/Lambda
        aws_metric_name: Throttles
        aws_dimensions: [FunctionName]
        aws_statistics: [Sum]

      - aws_namespace: AWS/Lambda
        aws_metric_name: OffsetLag
        aws_dimensions: [FunctionName]
        aws_statistics: [Maximum]

      - aws_namespace: AWS/ES
        aws_metric_name: ThreadpoolIndexQueue
        aws_dimensions: [ClientId, DomainName]
        aws_extended_statistics: [p95]

      - aws_namespace: AWS/ES
        aws_metric_name: ThreadpoolWriteQueue
        aws_dimensions: [ClientId, DomainName]
        aws_extended_statistics: [p95]

      - aws_namespace: AWS/ES
        aws_metric_name: ThreadpoolSearchQueue
        aws_dimensions: [ClientId, DomainName]
        aws_extended_statistics: [p95]

      - aws_namespace: AWS/ES
        aws_metric_name: ThreadpoolIndexQueue
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Average]

      - aws_namespace: AWS/ES
        aws_metric_name: ThreadpoolWriteQueue
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Average]

      - aws_namespace: AWS/ES
        aws_metric_name: ThreadpoolSearchQueue
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Average]

      - aws_namespace: AWS/ES
        aws_metric_name: WriteLatency
        aws_dimensions: [ClientId, DomainName]
        aws_extended_statistics: [p95]

      - aws_namespace: AWS/ES
        aws_metric_name: WriteLatency
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Average]

      - aws_namespace: AWS/ES
        aws_metric_name: ReadLatency
        aws_dimensions: [ClientId, DomainName]
        aws_extended_statistics: [p95]

      - aws_namespace: AWS/ES
        aws_metric_name: SearchLatency
        aws_dimensions: [ClientId, DomainName]
        aws_extended_statistics: [p95]

      - aws_namespace: AWS/ES
        aws_metric_name: SearchLatency
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Average]

      - aws_namespace: AWS/ES
        aws_metric_name: IndexingLatency
        aws_dimensions: [ClientId, DomainName]
        aws_extended_statistics: [p95]

      - aws_namespace: AWS/ES
        aws_metric_name: IndexingLatency
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Average]

      - aws_namespace: AWS/ES
        aws_metric_name: IndexingRate
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Average]

      - aws_namespace: AWS/ES
        aws_metric_name: IndexingRate
        aws_dimensions: [ClientId, DomainName]
        aws_extended_statistics: [p95]

      - aws_namespace: AWS/ES
        aws_metric_name: SearchRate
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Average]

      - aws_namespace: AWS/ES
        aws_metric_name: SearchRate
        aws_dimensions: [ClientId, DomainName]
        aws_extended_statistics: [p95]

      - aws_namespace: AWS/ES
        aws_metric_name: 5xx
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Sum]

      - aws_namespace: AWS/ES
        aws_metric_name: 2xx
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Sum]

      - aws_namespace: AWS/ES
        aws_metric_name: 3xx
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Sum]

      - aws_namespace: AWS/ES
        aws_metric_name: 4xx
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Sum]

      - aws_namespace: AWS/ES
        aws_metric_name: ClusterStatus.red
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Maximum]

      - aws_namespace: AWS/ES
        aws_metric_name: ClusterStatus.yellow
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Maximum]

      - aws_namespace: AWS/ES
        aws_metric_name: ClusterIndexWritesBlocked
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Average]

      - aws_namespace: AWS/ES
        aws_metric_name: Nodes
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Minimum]

      - aws_namespace: AWS/ES
        aws_metric_name: AutomatedSnapshotFailure
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Maximum]

      - aws_namespace: AWS/ES
        aws_metric_name: KibanaHealthyNodes
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Minimum]

      - aws_namespace: AWS/ES
        aws_metric_name: CPUUtilization
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Average]

      - aws_namespace: AWS/ES
        aws_metric_name: FreeStorageSpace
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Minimum]

      - aws_namespace: AWS/ES
        aws_metric_name: JVMMemoryPressure
        aws_dimensions: [ClientId, DomainName]
        aws_statistics: [Maximum]
Service Account
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/name: cloudwatch-exporter
  name: cloudwatch-exporter
  namespace: monitoring 
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::<REDACTED_ACCOUNT_NUMBER>:role/CloudWatchMetricsReadOnlyRole
IAM Role
  # Role for cloudwatch metrics exporter
  rCloudWatchMetricsReadOnlyRole:
    Type: 'AWS::IAM::Role'
    Properties:
      AssumeRolePolicyDocument: !Sub
        - |
          {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Principal": {
                        "Federated": "${IamOidcProviderArn}"
                    },
                    "Action": "sts:AssumeRoleWithWebIdentity",
                    "Condition": {
                        "StringEquals": {
                            "${OidcProviderEndpoint}:sub":  "system:serviceaccount:monitoring:cloudwatch-exporter"
                        }
                    }
                }
            ]
          }
        - {
            "IamOidcProviderArn": !Ref pOidcProviderArn,
            "OidcProviderEndpoint": !Ref pIssuerHostPath
          }
      Path: /
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/CloudWatchReadOnlyAccess
Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cloudwatch-metrics-exporter-general
  labels:
    app.kubernetes.io/name: cloudwatch-metrics-exporter
    app.kubernetes.io/instance: cloudwatch-metrics-exporter-general
  namespace: monitoring
  annotations:
    reloader.stakater.com/auto: "true"
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: cloudwatch-metrics-exporter
      app.kubernetes.io/instance: cloudwatch-metrics-exporter-general
  template:
    metadata:
      labels:
        app.kubernetes.io/name: cloudwatch-metrics-exporter
        app.kubernetes.io/instance: cloudwatch-metrics-exporter-general
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: k8s.swacorp.com/instancegroup
                    operator: In
                    values:
                      - operations-job-nodes
                      - arm-operations-job-nodes
                  - key: kubernetes.io/arch
                    operator: In
                    values:
                      - arm64
                      - amd64
          preferredDuringSchedulingIgnoredDuringExecution:
              - weight: 100
                preference:
                  matchExpressions:
                    - key: kubernetes.io/arch
                      operator: In
                      values:
                      - arm64
      tolerations:
        - effect: NoExecute
          key: k8s.swacorp.com/dedicated
          operator: Equal
          value: operations-server
        - effect: NoSchedule
          key: kubernetes.io/arch
          operator: Equal
          value: arm64
      serviceAccountName: cloudwatch-exporter
      containers:
        - name: cloudwatch-metrics-exporter
          image: quay.io/prometheus/cloudwatch-exporter:v0.15.5
          ports:
            - containerPort: 9106
          resources:
            requests:
              cpu: 100m
              memory: 600Mi
          volumeMounts:
            - mountPath: /config
              name: cloudwatch-metric-general
      volumes:
        - configMap:
            name: cloudwatch-monitoring-general
          name: cloudwatch-metric-general
Exporter logs
Mar 15, 2024 4:15:46 PM io.prometheus.cloudwatch.CloudWatchCollector collect
WARNING: CloudWatch scrape failed
software.amazon.awssdk.services.cloudwatch.model.CloudWatchException: User: arn:aws:sts::<REDACTED_ACCOUNT_NUMBER>:assumed-role/<REDACTED_NODE_IAM_ROLE_NAME>/<REDACTED_INSTANCE_ID> is not authorized to perform: cloudwatch:GetMetricData because no identity-based policy allows the cloudwatch:GetMetricData action (Service: CloudWatch, Status Code: 403, Request ID: 7079164c-7404-48cf-98e2-8b13d5ccf27a)
	at software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handleErrorResponse(CombinedResponseHandler.java:125)
	at software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handleResponse(CombinedResponseHandler.java:82)
	at software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handle(CombinedResponseHandler.java:60)
	at software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handle(CombinedResponseHandler.java:41)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:50)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:38)
	at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:72)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:42)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:78)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:40)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:55)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:39)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:81)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:36)
	at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
	at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:56)
	at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:36)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.executeWithTimer(ApiCallTimeoutTrackingStage.java:80)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:60)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:42)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:50)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:32)
	at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
	at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:37)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26)
	at software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:224)
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:103)
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:173)
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:80)
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:182)
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:74)
	at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45)
	at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:53)
	at software.amazon.awssdk.services.cloudwatch.DefaultCloudWatchClient.getMetricData(DefaultCloudWatchClient.java:1249)
	at io.prometheus.cloudwatch.GetMetricDataDataGetter.fetchAllDataPoints(GetMetricDataDataGetter.java:138)
	at io.prometheus.cloudwatch.GetMetricDataDataGetter.<init>(GetMetricDataDataGetter.java:185)
	at io.prometheus.cloudwatch.CloudWatchCollector.scrape(CloudWatchCollector.java:486)
	at io.prometheus.cloudwatch.CloudWatchCollector.collect(CloudWatchCollector.java:642)
	at io.prometheus.client.Collector.collect(Collector.java:45)
	at io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.findNextElement(CollectorRegistry.java:204)
	at io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.<init>(CollectorRegistry.java:162)
	at io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.<init>(CollectorRegistry.java:190)
	at io.prometheus.client.CollectorRegistry.metricFamilySamples(CollectorRegistry.java:129)
	at io.prometheus.client.servlet.common.exporter.Exporter.doGet(Exporter.java:75)
	at io.prometheus.client.servlet.jakarta.exporter.MetricsServlet.doGet(MetricsServlet.java:52)
	at jakarta.servlet.http.HttpServlet.service(HttpServlet.java:500)
	at jakarta.servlet.http.HttpServlet.service(HttpServlet.java:587)
	at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:764)
	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:529)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:221)
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1381)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:176)
	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:484)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:174)
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1303)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:129)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
	at org.eclipse.jetty.server.Server.handle(Server.java:563)
	at org.eclipse.jetty.server.HttpChannel$RequestDispatchable.dispatch(HttpChannel.java:1598)
	at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:753)
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:501)
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:287)
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:314)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:100)
	at org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:53)
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.runTask(AdaptiveExecutionStrategy.java:421)
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.consumeTask(AdaptiveExecutionStrategy.java:390)
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.tryProduce(AdaptiveExecutionStrategy.java:277)
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.run(AdaptiveExecutionStrategy.java:199)
	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:411)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:969)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.doRunJob(QueuedThreadPool.java:1194)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1149)
	at java.base/java.lang.Thread.run(Unknown Source)

What do you expect to happen?

I expected the cloudwatch-exporter to use the attached service-account with the permissions necessary to retrieve metric data.

What happened instead?

What actually happened was the cloudwatch-exporter stopped using the service-account and tried to use the k8s nodes IAM role. Nothing changed, we just stopped recieving the metrics in prometheus and then found the logs.

Restarting the deployment fixed the problem and it started using the service-account again, but the problem is if this would have happened in a production environment, the prometheus alerts we've setup to monitor these metrics wouldnt have met the threshold needed to fire.

Also, without looking at the logs, the pod appeared to be running as normal.

@b-lancaster b-lancaster added the metrics-configuration How to configure specific metrics for collection label Mar 15, 2024
@b-lancaster b-lancaster changed the title [metrics]: Exporter detached from service-account and [metrics]: Exporter randomly detached from service-account Mar 15, 2024
@saarw-opti
Copy link

saarw-opti commented Mar 19, 2024

Same here; I don't know why it's not using the service account, and it prefers to use the Node role.

@Ca-moes
Copy link

Ca-moes commented Mar 26, 2024

Same here, on a fresh deployment using https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus-cloudwatch-exporter

Restarting the Deployment does not fix it. It uses the Karpenter role

@matthiasr matthiasr removed the metrics-configuration How to configure specific metrics for collection label Dec 13, 2024
@matthiasr
Copy link
Contributor

Hmm, the exporter itself does not do anything special with the configuration you provided – as long as you don't configure AWS authentication in the exporter config, it uses the default authentication chain in the AWS SDK. The only explanation I have is that retrieving or exchanging the service account OIDC token for AWS credentials failed, and the SDK proceeded further in the chain, eventually finding the node credentials.

Are there any earlier logs related to this, possibly when running with debug log verbosity? Unfortunately I am very unfamiliar with and eternally confused by the Java logging ecosystem, so I'm not sure how exactly to achieve that.

Considering this both broke and recovered without changes to the exporter or configuration, I'm not sure what we can do about it tbh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants