Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 9 additions & 9 deletions _topic_maps/_topic_map.yml
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,6 @@ Topics:
# File: logging-affinity-and-anti-afinity
#---

#---
#Name: Configuring your Logging deployment
#Dir: config
#Distros: openshift-logging
Expand Down Expand Up @@ -103,14 +102,15 @@ Topics:
# File: cluster-logging-dashboards
#- Name: Log visualization with Kibana
# File: logging-kibana
#---
#Name: Logging alerts
#Dir: logging_alerts
#Topics:
#- Name: Default logging alerts
# File: default-logging-alerts
#- Name: Custom logging alerts
# File: custom-logging-alerts
---
Name: Logging alerts
Dir: logging_alerts
Distros: openshift-logging
Topics:
- Name: Default logging alerts
File: default-logging-alerts
- Name: Custom logging alerts
File: custom-logging-alerts
#---
#Name: Performance and reliability tuning
#Dir: performance_reliability
Expand Down
6 changes: 3 additions & 3 deletions logging_alerts/custom-logging-alerts.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@ include::_attributes/common-attributes.adoc[]

toc::[]

In logging 5.7 and later versions, users can configure the LokiStack deployment to produce customized alerts and recorded metrics. If you want to use customized link:https://grafana.com/docs/loki/latest/alert/[alerting and recording rules], you must enable the LokiStack ruler component.
You can configure the LokiStack deployment to produce customized alerts and recorded metrics. If you want to use customized link:https://grafana.com/docs/loki/latest/alert/[alerting and recording rules], you must enable the LokiStack ruler component.

LokiStack log-based alerts and recorded metrics are triggered by providing link:https://grafana.com/docs/loki/latest/query/[LogQL] expressions to the ruler component. The {loki-op} manages a ruler that is optimized for the selected LokiStack size, which can be `1x.extra-small`, `1x.small`, or `1x.medium`.
LokiStack log-based alerts and recorded metrics are triggered by providing link:https://grafana.com/docs/loki/latest/query/[LogQL] (Grafana documentation) expressions to the ruler component.

To provide these expressions, you must create an `AlertingRule` custom resource (CR) containing Prometheus-compatible link:https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/[alerting rules], or a `RecordingRule` CR containing Prometheus-compatible link:https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/[recording rules].
To provide these expressions, you must create an `AlertingRule` custom resource (CR) containing link:https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/[alerting rules], or a `RecordingRule` CR containing Prometheus-compatible link:https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/[recording rules] (Prometheus documentation).

Administrators can configure log-based alerts or recorded metrics for `application`, `audit`, or `infrastructure` tenants. Users without administrator permissions can configure log-based alerts or recorded metrics for `application` tenants of the applications that they have access to.

Expand Down
8 changes: 6 additions & 2 deletions logging_alerts/default-logging-alerts.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,15 @@ Logging alerts are installed as part of the {clo} installation. Alerts depend on
Default logging alerts are sent to the {ocp-product-title} monitoring stack Alertmanager in the `openshift-monitoring` namespace, unless you have disabled the local Alertmanager instance.

// TODO MONITORING REMOVE DEPENDENCY
include::modules/monitoring-accessing-the-alerting-ui.adoc[leveloffset=+1]
include::modules/logging-collector-alerts.adoc[leveloffset=+1]
include::modules/monitoring-accessing-the-alerting-ui.adoc[leveloffset=+1,tag=ADM]
//include::modules/logging-collector-alerts.adoc[leveloffset=+1]
include::modules/logging-vector-collector-alerts.adoc[leveloffset=+1]
include::modules/loki-alerts.adoc[leveloffset=+1]

////
include::modules/logging-fluentd-collector-alerts.adoc[leveloffset=+1]
include::modules/cluster-logging-elasticsearch-rules.adoc[leveloffset=+1]
////

[role="_additional-resources"]
[id="additional-resources_default-logging-alerts"]
Expand Down
12 changes: 12 additions & 0 deletions logging_alerts/docinfo.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
<title>Logging alerts</title>
<productname>{product-title}</productname>
<productnumber>{product-version}</productnumber>
<subtitle>Configuring logging alerts.</subtitle>
<abstract>
<para>This document provides information about configuring logging alerts.
</para>
</abstract>
<authorgroup>
<orgname>Red Hat OpenShift Documentation Team</orgname>
</authorgroup>
<xi:include href="Common_Content/Legal_Notice.xml" xmlns:xi="http://www.w3.org/2001/XInclude" />
20 changes: 11 additions & 9 deletions modules/configuring-logging-loki-ruler.adoc
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
// Module included in the following assemblies:
//
// * observability/logging/logging_alerts/custom-logging-alerts.adoc
// * logging_alerts/custom-logging-alerts.adoc

:_mod-docs-content-type: PROCEDURE
[id="configuring-logging-loki-ruler_{context}"]
= Configuring the ruler

When the LokiStack ruler component is enabled, users can define a group of link:https://grafana.com/docs/loki/latest/query/[LogQL] expressions that trigger logging alerts or recorded metrics.
When the `LokiStack` ruler component is enabled, users can define a group of link:https://grafana.com/docs/loki/latest/query/[LogQL] (Grafana documentation) expressions that trigger logging alerts or recorded metrics.

Administrators can enable the ruler by modifying the `LokiStack` custom resource (CR).

Expand All @@ -18,7 +18,7 @@ Administrators can enable the ruler by modifying the `LokiStack` custom resource

.Procedure

* Enable the ruler by ensuring that the `LokiStack` CR contains the following spec configuration:
* Enable the ruler by ensuring that the `LokiStack` CR has the following spec configuration:
+
[source,yaml]
----
Expand All @@ -30,14 +30,16 @@ metadata:
spec:
# ...
rules:
enabled: true <1>
selector:
enabled: true #<1>
selector: #<2>
matchLabels:
openshift.io/<label_name>: "true" <2>
namespaceSelector:
<label_name>: "true" #<3>
namespaceSelector: #<4>
matchLabels:
openshift.io/<label_name>: "true" <3>
<label_name>: "true" #<5>
----
<1> Enable Loki alerting and recording rules in your cluster.
<2> Add a custom label that can be added to namespaces where you want to enable the use of logging alerts and metrics.
<2> Specify the selector for the alerting and recording resources.
<3> Add a custom label that can be added to namespaces where you want to enable the use of logging alerts and metrics.
<4> Specify the namespaces in which the alerting and recording rules are defined for the {loki-op}. If undefined, only the rules defined in the same namespace as the `LokiStack` are used.
<5> Add a custom label that can be added to namespaces where you want to enable the use of logging alerts and metrics.
4 changes: 2 additions & 2 deletions modules/logging-collector-alerts.adoc
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
// Module included in the following assemblies:
//
// * logging/logging_alerts/default-logging-alerts.adoc
// * logging_alerts/default-logging-alerts.adoc

:_content-type: REFERENCE
[id="logging-collector-alerts_{context}"]
= Logging collector alerts

In logging 5.8 and later versions, the following alerts are generated by the {clo}. You can view these alerts in the {ocp-product-title} web console.
The following alerts are generated by the {clo}. You can view these alerts in the {ocp-product-title} web console.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CollectorHighErrorRate and CollectorVeryHighErrorRate is for 5.x only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is not included so the content won't appear in the docs.


[cols="4", options="header"]
|===
Expand Down
44 changes: 22 additions & 22 deletions modules/logging-enabling-loki-alerts.adoc
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
// Module included in the following assemblies:
//
// * observability/logging/logging_alerts/custom-logging-alerts.adoc
// * logging_alerts/custom-logging-alerts.adoc

:_mod-docs-content-type: PROCEDURE
[id="logging-enabling-loki-alerts_{context}"]
Expand All @@ -12,14 +12,14 @@ The `AlertingRule` CR contains a set of specifications and webhook validation de
* If an `AlertingRule` CR includes an invalid `for` period, it is an invalid alerting rule.
* If an `AlertingRule` CR includes an invalid LogQL `expr`, it is an invalid alerting rule.
* If an `AlertingRule` CR includes two groups with the same name, it is an invalid alerting rule.
* If none of above applies, an alerting rule is considered valid.
* If none of the above applies, an alerting rule is considered valid.

[options="header"]
|================================================
| Tenant type | Valid namespaces for `AlertingRule` CRs
| application |
| audit | `openshift-logging`
| infrastructure | `openshift-/\*`, `kube-/\*`, `default`
| infrastructure | `openshift-\*`, `kube-*`, `default`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| infrastructure | `openshift-\*`, `kube-*`, `default`
| infrastructure | `openshift-*`, `kube-*`, `default`

Copy link
Contributor Author

@theashiot theashiot Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "\" is an escape sequence character. This renders as:

openshift-*, kube-*, default

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'm not sure what it is escaping then, because if it's the * we should need one for the second occurrence as well. I'm also not sure if the current preview is up-to-date with the state of the PR.

| application | All other namespaces.
|================================================

.Prerequisites
Expand All @@ -38,30 +38,30 @@ The `AlertingRule` CR contains a set of specifications and webhook validation de
kind: AlertingRule
metadata:
name: loki-operator-alerts
namespace: openshift-operators-redhat <1>
labels: <2>
openshift.io/<label_name>: "true"
namespace: openshift-operators-redhat #<1>
labels: #<2>
openshift.io/cluster-monitoring: "true"
spec:
tenantID: "infrastructure" <3>
tenantID: infrastructure #<3>
groups:
- name: LokiOperatorHighReconciliationError
rules:
- alert: HighPercentageError
expr: | <4>
expr: | #<4>
sum(rate({kubernetes_namespace_name="openshift-operators-redhat", kubernetes_pod_name=~"loki-operator-controller-manager.*"} |= "error" [1m])) by (job)
/
sum(rate({kubernetes_namespace_name="openshift-operators-redhat", kubernetes_pod_name=~"loki-operator-controller-manager.*"}[1m])) by (job)
> 0.01
for: 10s
labels:
severity: critical <5>
severity: critical #<5>
annotations:
summary: High Loki Operator Reconciliation Errors <6>
description: High Loki Operator Reconciliation Errors <7>
summary: High Loki Operator Reconciliation Errors #<6>
description: High Loki Operator Reconciliation Errors #<7>
----
<1> The namespace where this `AlertingRule` CR is created must have a label matching the LokiStack `spec.rules.namespaceSelector` definition.
<2> The `labels` block must match the LokiStack `spec.rules.selector` definition.
<3> `AlertingRule` CRs for `infrastructure` tenants are only supported in the `openshift-\*`, `kube-\*`, or `default` namespaces.
<3> `AlertingRule` CRs for `infrastructure` tenants are only supported in the `openshift-\*`, `kube-*`, or `default` namespaces.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<3> `AlertingRule` CRs for `infrastructure` tenants are only supported in the `openshift-\*`, `kube-*`, or `default` namespaces.
<3> `AlertingRule` CRs for `infrastructure` tenants are only supported in the `openshift-*`, `kube-*`, or `default` namespaces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

<4> The value for `kubernetes_namespace_name:` must match the value for `metadata.namespace`.
<5> The value of this mandatory field must be `critical`, `warning`, or `info`.
<6> This field is mandatory.
Expand All @@ -74,23 +74,23 @@ The `AlertingRule` CR contains a set of specifications and webhook validation de
kind: AlertingRule
metadata:
name: app-user-workload
namespace: app-ns <1>
labels: <2>
openshift.io/<label_name>: "true"
namespace: app-ns #<1>
labels: #<2>
openshift.io/cluster-monitoring: "true"
spec:
tenantID: "application"
tenantID: application
groups:
- name: AppUserWorkloadHighError
rules:
- alert:
expr: | <3>
sum(rate({kubernetes_namespace_name="app-ns", kubernetes_pod_name=~"podName.*"} |= "error" [1m])) by (job)
expr: | #<3>
sum(rate({kubernetes_namespace_name="app-ns", kubernetes_pod_name=~"podName.*"} |= "error" [1m])) by (job)
for: 10s
labels:
severity: critical <4>
severity: critical #<4>
annotations:
summary: <5>
description: <6>
summary: This is an example summary. #<5>
description: This is an example description. #<6>
----
<1> The namespace where this `AlertingRule` CR is created must have a label matching the LokiStack `spec.rules.namespaceSelector` definition.
<2> The `labels` block must match the LokiStack `spec.rules.selector` definition.
Expand Down
26 changes: 10 additions & 16 deletions modules/logging-vector-collector-alerts.adoc
Original file line number Diff line number Diff line change
@@ -1,36 +1,30 @@
// Module included in the following assemblies:
//
// * observability/logging/logging_alerts/default-logging-alerts.adoc
// * logging_alerts/default-logging-alerts.adoc

:_mod-docs-content-type: REFERENCE
[id="logging-vector-collector-alerts_{context}"]
= Vector collector alerts
= {clo} alerts

In logging 5.7 and later versions, the following alerts are generated by the Vector collector. You can view these alerts in the {ocp-product-title} web console.
The following alerts are generated by the Vector collector. You can view these alerts in the {ocp-product-title} web console.

.Vector collector alerts
[cols="2,2,2,1",options="header"]
|===
|Alert |Message |Description |Severity

|`CollectorHighErrorRate`
|`<value> of records have resulted in an error by vector <instance>.`
|The number of vector output errors is high, by default more than 10 in the previous 15 minutes.
|Warning

|`CollectorNodeDown`
|`Prometheus could not scrape vector <instance> for more than 10m.`
|Vector is reporting that Prometheus could not scrape a specific Vector instance.
|Critical

|`CollectorVeryHighErrorRate`
|`<value> of records have resulted in an error by vector <instance>.`
|The number of Vector component errors are very high, by default more than 25 in the previous 15 minutes.
|Critical

|`FluentdQueueLengthIncreasing`
|`In the last 1h, fluentd <instance> buffer queue length constantly increased more than 1. Current value is <value>.`
|Fluentd is reporting that the queue size is increasing.
|`DiskBufferUsage`
|`Collectors potentially consuming too much node disk, <value>`
|Collectors are consuming too much node disk on the host.
|Warning

|`CollectorHigh403ForbiddenResponseRate`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This alert is for 6.3 only now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i revised the the target for this PR to 6.3. I'll create a separate PR for 6.2 and older versions.

|`High rate of "HTTP 403 Forbidden" responses detected for collector <instance> in namespace <namespace> for output <label>. The rate of 403 responses is <rate> over the last 2 minutes, persisting for more than 5 minutes. This could indicate an authorization issue.`
|At least 10% of sent requests responded with "HTTP 403 Forbidden" for collector "<intance>" in namespace <namespace> for the output "<output>".
|Critical
|===
71 changes: 71 additions & 0 deletions modules/loki-alerts.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
// Module included in the following assemblies:
//
// * logging_alerts/default-logging-alerts.adoc

:_mod-docs-content-type: REFERENCE
[id="loki-alerts_{context}"]
= {loki-op} alerts

The following alerts are generated by the {loki-op}. You can view these alerts in the {ocp-product-title} web console.

.{loki-op} alerts
[cols="2,2,2,1",options="header"]
|===
|Alert |Message |Description |Severity

|`LokiRequestErrors`
|`{{ $labels.job }} {{ $labels.route }} is experiencing <value>% errors.`
|At least 10% of requests result in `5xx` server errors.
|critical

|`LokiStackWriteRequestErrors`
|`<value>% of write requests from {{ $labels.job }} in <namespace> are returned with server errors.`
|At least 10% of write requests to the lokistack-gateway result in `5xx` server errors.
|critical

|`LokiStackReadRequestErrors`
|`<value>% of query requests from {{ $labels.job }} in <namespace> are returned with server errors.`
|At least 10% of query requests to the lokistack-gateway result in `5xx` server errors.
|critical

|`LokiRequestPanics`
|`{{ $labels.job }} is experiencing an increase of <value> panics.`
|A panic was triggered.
|critical

|`LokiRequestLatency`
|`{{ $labels.job }} {{ $labels.route }} is experiencing <value>s 99th percentile latency.`
|The 99th percentile is experiencing latency higher than 1 second.
|critical

|`LokiTenantRateLimit`
|`{{ $labels.job }} {{ $labels.route }} is experiencing 429 errors.`
|At least 10% of requests are received the rate limit error code.
|warning

|`LokiStorageSlowWrite`
|`The storage path is experiencing slow write response rates.`
|The storage path is experiencing slow read response rates.
|warning

|`LokiWritePathHighLoad`
|`The write path is experiencing high load.``
|The write path is experiencing high load causing backpressure storage flushing.
|warning

|`LokiReadPathHighLoad`
|`The read path is experiencing high load.`
|The read path has a high volume of queries, causing longer response times.
|warning

|`LokiDiscardedSamplesWarning`
|`Loki in namespace "<namespace>" is discarding samples in the "<tenant>" tenant during ingestion. Samples are discarded because of "<reason>" at a rate of <value> samples per second.`
|Loki is discarding samples during ingestion because they fail validation.
|warning

|`LokistackSchemaUpgradesRequired`
|`The LokiStack "{{ $labels.stack_name }}" in namespace "<namespace>" is using a storage schema
configuration that does not contain the latest schema version. It is recommended to update the schema configuration to update the schema version to the latest`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this one, but the sentence is confusing to me.

Suggested change
configuration that does not contain the latest schema version. It is recommended to update the schema configuration to update the schema version to the latest`
configuration that does not contain the latest schema version. It is recommended to update the schema configuration with the correct schema version to the latest`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, @mburke5678! This is the output text. I believe this should first be fixed upstream first, cc @xperimental

|One or more of the deployed LokiStacks contains an outdated storage schema configuration.
|warning
|===
4 changes: 2 additions & 2 deletions modules/loki-rbac-rules-permissions.adoc
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
// Module is included in the following assemblies:
// Module included in the following assemblies:
//
// * configuring/configuring-the-log-store.adoc
// * logging_alerts/custom-logging-alerts.adoc

:_mod-docs-content-type: REFERENCE
[id="loki-rbac-rules-permissions_{context}"]
Expand Down
Loading