-
Notifications
You must be signed in to change notification settings - Fork 1.8k
OBSDOCS-1801: Loki - Adjust Queries for both OTEL and ViaQ #94828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| <title>Logging alerts</title> | ||
| <productname>{product-title}</productname> | ||
| <productnumber>{product-version}</productnumber> | ||
| <subtitle>Configuring logging alerts.</subtitle> | ||
| <abstract> | ||
| <para>This document provides information about configuring logging alerts. | ||
| </para> | ||
| </abstract> | ||
| <authorgroup> | ||
| <orgname>Red Hat OpenShift Documentation Team</orgname> | ||
| </authorgroup> | ||
| <xi:include href="Common_Content/Legal_Notice.xml" xmlns:xi="http://www.w3.org/2001/XInclude" /> |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,12 +1,12 @@ | ||
| // Module included in the following assemblies: | ||
| // | ||
| // * logging/logging_alerts/default-logging-alerts.adoc | ||
| // * logging_alerts/default-logging-alerts.adoc | ||
|
|
||
| :_content-type: REFERENCE | ||
| [id="logging-collector-alerts_{context}"] | ||
| = Logging collector alerts | ||
|
|
||
| In logging 5.8 and later versions, the following alerts are generated by the {clo}. You can view these alerts in the {ocp-product-title} web console. | ||
| The following alerts are generated by the {clo}. You can view these alerts in the {ocp-product-title} web console. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. CollectorHighErrorRate and CollectorVeryHighErrorRate is for 5.x only.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This file is not included so the content won't appear in the docs. |
||
|
|
||
| [cols="4", options="header"] | ||
| |=== | ||
|
|
||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -1,6 +1,6 @@ | ||||||
| // Module included in the following assemblies: | ||||||
| // | ||||||
| // * observability/logging/logging_alerts/custom-logging-alerts.adoc | ||||||
| // * logging_alerts/custom-logging-alerts.adoc | ||||||
|
|
||||||
| :_mod-docs-content-type: PROCEDURE | ||||||
| [id="logging-enabling-loki-alerts_{context}"] | ||||||
|
|
@@ -12,14 +12,14 @@ The `AlertingRule` CR contains a set of specifications and webhook validation de | |||||
| * If an `AlertingRule` CR includes an invalid `for` period, it is an invalid alerting rule. | ||||||
| * If an `AlertingRule` CR includes an invalid LogQL `expr`, it is an invalid alerting rule. | ||||||
| * If an `AlertingRule` CR includes two groups with the same name, it is an invalid alerting rule. | ||||||
| * If none of above applies, an alerting rule is considered valid. | ||||||
| * If none of the above applies, an alerting rule is considered valid. | ||||||
|
|
||||||
| [options="header"] | ||||||
| |================================================ | ||||||
| | Tenant type | Valid namespaces for `AlertingRule` CRs | ||||||
| | application | | ||||||
| | audit | `openshift-logging` | ||||||
| | infrastructure | `openshift-/\*`, `kube-/\*`, `default` | ||||||
| | infrastructure | `openshift-\*`, `kube-*`, `default` | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The "\" is an escape sequence character. This renders as:
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, I'm not sure what it is escaping then, because if it's the |
||||||
| | application | All other namespaces. | ||||||
| |================================================ | ||||||
|
|
||||||
| .Prerequisites | ||||||
|
|
@@ -38,30 +38,30 @@ The `AlertingRule` CR contains a set of specifications and webhook validation de | |||||
| kind: AlertingRule | ||||||
| metadata: | ||||||
| name: loki-operator-alerts | ||||||
| namespace: openshift-operators-redhat <1> | ||||||
| labels: <2> | ||||||
| openshift.io/<label_name>: "true" | ||||||
| namespace: openshift-operators-redhat #<1> | ||||||
| labels: #<2> | ||||||
| openshift.io/cluster-monitoring: "true" | ||||||
| spec: | ||||||
| tenantID: "infrastructure" <3> | ||||||
| tenantID: infrastructure #<3> | ||||||
| groups: | ||||||
| - name: LokiOperatorHighReconciliationError | ||||||
| rules: | ||||||
| - alert: HighPercentageError | ||||||
| expr: | <4> | ||||||
| expr: | #<4> | ||||||
| sum(rate({kubernetes_namespace_name="openshift-operators-redhat", kubernetes_pod_name=~"loki-operator-controller-manager.*"} |= "error" [1m])) by (job) | ||||||
| / | ||||||
| sum(rate({kubernetes_namespace_name="openshift-operators-redhat", kubernetes_pod_name=~"loki-operator-controller-manager.*"}[1m])) by (job) | ||||||
| > 0.01 | ||||||
| for: 10s | ||||||
| labels: | ||||||
| severity: critical <5> | ||||||
| severity: critical #<5> | ||||||
| annotations: | ||||||
| summary: High Loki Operator Reconciliation Errors <6> | ||||||
| description: High Loki Operator Reconciliation Errors <7> | ||||||
| summary: High Loki Operator Reconciliation Errors #<6> | ||||||
| description: High Loki Operator Reconciliation Errors #<7> | ||||||
| ---- | ||||||
| <1> The namespace where this `AlertingRule` CR is created must have a label matching the LokiStack `spec.rules.namespaceSelector` definition. | ||||||
| <2> The `labels` block must match the LokiStack `spec.rules.selector` definition. | ||||||
| <3> `AlertingRule` CRs for `infrastructure` tenants are only supported in the `openshift-\*`, `kube-\*`, or `default` namespaces. | ||||||
| <3> `AlertingRule` CRs for `infrastructure` tenants are only supported in the `openshift-\*`, `kube-*`, or `default` namespaces. | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same as above |
||||||
| <4> The value for `kubernetes_namespace_name:` must match the value for `metadata.namespace`. | ||||||
| <5> The value of this mandatory field must be `critical`, `warning`, or `info`. | ||||||
| <6> This field is mandatory. | ||||||
|
|
@@ -74,23 +74,23 @@ The `AlertingRule` CR contains a set of specifications and webhook validation de | |||||
| kind: AlertingRule | ||||||
| metadata: | ||||||
| name: app-user-workload | ||||||
| namespace: app-ns <1> | ||||||
| labels: <2> | ||||||
| openshift.io/<label_name>: "true" | ||||||
| namespace: app-ns #<1> | ||||||
| labels: #<2> | ||||||
| openshift.io/cluster-monitoring: "true" | ||||||
| spec: | ||||||
| tenantID: "application" | ||||||
| tenantID: application | ||||||
| groups: | ||||||
| - name: AppUserWorkloadHighError | ||||||
| rules: | ||||||
| - alert: | ||||||
| expr: | <3> | ||||||
| sum(rate({kubernetes_namespace_name="app-ns", kubernetes_pod_name=~"podName.*"} |= "error" [1m])) by (job) | ||||||
| expr: | #<3> | ||||||
| sum(rate({kubernetes_namespace_name="app-ns", kubernetes_pod_name=~"podName.*"} |= "error" [1m])) by (job) | ||||||
| for: 10s | ||||||
| labels: | ||||||
| severity: critical <4> | ||||||
| severity: critical #<4> | ||||||
| annotations: | ||||||
| summary: <5> | ||||||
| description: <6> | ||||||
| summary: This is an example summary. #<5> | ||||||
| description: This is an example description. #<6> | ||||||
| ---- | ||||||
| <1> The namespace where this `AlertingRule` CR is created must have a label matching the LokiStack `spec.rules.namespaceSelector` definition. | ||||||
| <2> The `labels` block must match the LokiStack `spec.rules.selector` definition. | ||||||
|
|
||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,36 +1,30 @@ | ||
| // Module included in the following assemblies: | ||
| // | ||
| // * observability/logging/logging_alerts/default-logging-alerts.adoc | ||
| // * logging_alerts/default-logging-alerts.adoc | ||
|
|
||
| :_mod-docs-content-type: REFERENCE | ||
| [id="logging-vector-collector-alerts_{context}"] | ||
| = Vector collector alerts | ||
| = {clo} alerts | ||
|
|
||
| In logging 5.7 and later versions, the following alerts are generated by the Vector collector. You can view these alerts in the {ocp-product-title} web console. | ||
| The following alerts are generated by the Vector collector. You can view these alerts in the {ocp-product-title} web console. | ||
|
|
||
| .Vector collector alerts | ||
| [cols="2,2,2,1",options="header"] | ||
| |=== | ||
| |Alert |Message |Description |Severity | ||
|
|
||
| |`CollectorHighErrorRate` | ||
| |`<value> of records have resulted in an error by vector <instance>.` | ||
| |The number of vector output errors is high, by default more than 10 in the previous 15 minutes. | ||
| |Warning | ||
|
|
||
| |`CollectorNodeDown` | ||
| |`Prometheus could not scrape vector <instance> for more than 10m.` | ||
| |Vector is reporting that Prometheus could not scrape a specific Vector instance. | ||
| |Critical | ||
|
|
||
| |`CollectorVeryHighErrorRate` | ||
| |`<value> of records have resulted in an error by vector <instance>.` | ||
| |The number of Vector component errors are very high, by default more than 25 in the previous 15 minutes. | ||
| |Critical | ||
|
|
||
| |`FluentdQueueLengthIncreasing` | ||
| |`In the last 1h, fluentd <instance> buffer queue length constantly increased more than 1. Current value is <value>.` | ||
| |Fluentd is reporting that the queue size is increasing. | ||
| |`DiskBufferUsage` | ||
| |`Collectors potentially consuming too much node disk, <value>` | ||
| |Collectors are consuming too much node disk on the host. | ||
| |Warning | ||
|
|
||
| |`CollectorHigh403ForbiddenResponseRate` | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This alert is for 6.3 only now.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok, i revised the the target for this PR to 6.3. I'll create a separate PR for 6.2 and older versions. |
||
| |`High rate of "HTTP 403 Forbidden" responses detected for collector <instance> in namespace <namespace> for output <label>. The rate of 403 responses is <rate> over the last 2 minutes, persisting for more than 5 minutes. This could indicate an authorization issue.` | ||
| |At least 10% of sent requests responded with "HTTP 403 Forbidden" for collector "<intance>" in namespace <namespace> for the output "<output>". | ||
| |Critical | ||
| |=== | ||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,71 @@ | ||||||
| // Module included in the following assemblies: | ||||||
| // | ||||||
| // * logging_alerts/default-logging-alerts.adoc | ||||||
|
|
||||||
| :_mod-docs-content-type: REFERENCE | ||||||
| [id="loki-alerts_{context}"] | ||||||
| = {loki-op} alerts | ||||||
|
|
||||||
| The following alerts are generated by the {loki-op}. You can view these alerts in the {ocp-product-title} web console. | ||||||
|
|
||||||
| .{loki-op} alerts | ||||||
| [cols="2,2,2,1",options="header"] | ||||||
| |=== | ||||||
| |Alert |Message |Description |Severity | ||||||
|
|
||||||
| |`LokiRequestErrors` | ||||||
| |`{{ $labels.job }} {{ $labels.route }} is experiencing <value>% errors.` | ||||||
| |At least 10% of requests result in `5xx` server errors. | ||||||
| |critical | ||||||
|
|
||||||
| |`LokiStackWriteRequestErrors` | ||||||
| |`<value>% of write requests from {{ $labels.job }} in <namespace> are returned with server errors.` | ||||||
| |At least 10% of write requests to the lokistack-gateway result in `5xx` server errors. | ||||||
| |critical | ||||||
|
|
||||||
| |`LokiStackReadRequestErrors` | ||||||
| |`<value>% of query requests from {{ $labels.job }} in <namespace> are returned with server errors.` | ||||||
| |At least 10% of query requests to the lokistack-gateway result in `5xx` server errors. | ||||||
| |critical | ||||||
|
|
||||||
| |`LokiRequestPanics` | ||||||
| |`{{ $labels.job }} is experiencing an increase of <value> panics.` | ||||||
| |A panic was triggered. | ||||||
| |critical | ||||||
|
|
||||||
| |`LokiRequestLatency` | ||||||
| |`{{ $labels.job }} {{ $labels.route }} is experiencing <value>s 99th percentile latency.` | ||||||
| |The 99th percentile is experiencing latency higher than 1 second. | ||||||
| |critical | ||||||
|
|
||||||
| |`LokiTenantRateLimit` | ||||||
| |`{{ $labels.job }} {{ $labels.route }} is experiencing 429 errors.` | ||||||
| |At least 10% of requests are received the rate limit error code. | ||||||
| |warning | ||||||
|
|
||||||
| |`LokiStorageSlowWrite` | ||||||
| |`The storage path is experiencing slow write response rates.` | ||||||
| |The storage path is experiencing slow read response rates. | ||||||
| |warning | ||||||
|
|
||||||
| |`LokiWritePathHighLoad` | ||||||
| |`The write path is experiencing high load.`` | ||||||
| |The write path is experiencing high load causing backpressure storage flushing. | ||||||
| |warning | ||||||
|
|
||||||
| |`LokiReadPathHighLoad` | ||||||
| |`The read path is experiencing high load.` | ||||||
| |The read path has a high volume of queries, causing longer response times. | ||||||
| |warning | ||||||
|
|
||||||
| |`LokiDiscardedSamplesWarning` | ||||||
| |`Loki in namespace "<namespace>" is discarding samples in the "<tenant>" tenant during ingestion. Samples are discarded because of "<reason>" at a rate of <value> samples per second.` | ||||||
| |Loki is discarding samples during ingestion because they fail validation. | ||||||
| |warning | ||||||
|
|
||||||
| |`LokistackSchemaUpgradesRequired` | ||||||
| |`The LokiStack "{{ $labels.stack_name }}" in namespace "<namespace>" is using a storage schema | ||||||
| configuration that does not contain the latest schema version. It is recommended to update the schema configuration to update the schema version to the latest` | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure about this one, but the sentence is confusing to me.
Suggested change
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good catch, @mburke5678! This is the output text. I believe this should first be fixed upstream first, cc @xperimental |
||||||
| |One or more of the deployed LokiStacks contains an outdated storage schema configuration. | ||||||
| |warning | ||||||
| |=== | ||||||
Uh oh!
There was an error while loading. Please reload this page.