Skip to content
259 changes: 144 additions & 115 deletions software-engineering-policies/Logging/LoggingPolicy.md
Original file line number Diff line number Diff line change
@@ -1,173 +1,193 @@
# Logging Policy

This policy aims to provide guidance to software engineering teams on how the
services they develop or support should be logging. Given our logging
repository of choice is ElasticSearch, some of the guidance will be influenced
by that service.
This policy defines the logging requirements for services developed and
supported by software engineering teams at UKHO.

## Disclaimer
## Contents of Logs

Not all systems will be using or able to use Elastic logging due to
environmental issues. In these cases the policy is not relevant.
Logs must contain the following information:
- projectName (The overarching project that the service belongs to)
- serviceName (The name of an individual service)
- environment (One of "Development", "Test", "PreProduction", "Production")
- level ("Information", "Warning", "Error", "Fatal")
- Information is for regular messages that can be aggregated into a metric
- Warning is for expected problems that the service can recover from
- Error is for for unexpected problems that require investigation that should trigger an alert
- Fatal is for situations where the service is broken and requires immediate human intervention
- message (The data being logged. This should be a JSON formatted blob)
- traceId (If the process that generates the log was started from a request external to the project,
the traceId should be included to allow the logs to be correlated with the request)

## Log types and levels

This section will cover the different log types that services are expected to
produce, what log levels they should be at, how they should be indexed in
ElasticSearch, retention periods and any other related guidance.
## Log types and levels

Services can produce logs that fall into three main categories: diagnostic,
audit, and request/response. Each type will have an expected level and
ElasticSearch index pattern that is named according to the following
convention: **FullServiceName-environment-category**.
Services should produce logs that fall into three main categories: diagnostic,
audit/metric, and request/response.

### Diagnostic logs - Error, warning logs
### Diagnostic logs

- Expected minimum log levels for Production: Warning or Error
These are logs that provide information about the service's operation and are
generally useful for debugging. Projects need to balance the need to provide
useful information with the risk of logging too much information.

- Should ingest into a dedicated ElasticSearch index named as e.g.,
`SalesCatalogueService-Dev1-Diagnostic`
Expected minimum log levels for Production: Warning or Error

### Audit/Metric logs

- Expected minimum log levels for Production: Information, but may differ based
on the individual type of audit or metric log
These are logs that provide information about the service's operation and are
generally useful for auditing and metrics. Where possible these logs should be
numeric or small enums to allow for aggregation.

- Should ingest into a dedicated ElasticSearch index named as e.g.,
`SalesCatalogueService-Dev1-Audit`
Expected minimum log levels for Production: Information, but may differ based
on the individual type of audit or metric log

### Request/response logging

- Expected minimum log levels for Production: Information
These are logs that record requests to and responses from a service. These logs
are useful to trace a request through different services and to identify
problems with requests.

- Should ingest into a dedicated ElasticSearch index named as e.g.,
`SalesCatalogueService-Dev1-HTTP`
Expected minimum log levels for Production: Information

### Further related guidance
### General logging guidance

- Healthcheck logging is not essential and can cause log saturation. We should
prefer instead to use healthcheck endpoints that can be polled.
- Health check logging is not essential and can cause log saturation. Prefer
using health check endpoints that can be polled.

- Consider the value of logs – e.g., we don’t need to log that an Azure Event
Hub is in existence and healthy.
- Consider the value of logs – avoid logging information that provides no
diagnostic or audit value.

- Avoid logging binary data to ElasticSearch. We should instead log a reference
to the object that is stored in blob storage.
- Consider the GPDR - avoid logging customer information.

- Consideration needs to be given to what is included in logs. Teams should
avoid logging many different properties in the hope that they will then have
captured everything.
- Avoid logging binary data or large files. Log references to objects stored in
appropriate storage instead.

- Be consistent with data that is being logged across a project.

- Teams should be selective about what is included in logs. Avoid logging
many different properties in the hope of capturing everything.

- On-premise services need to have known mitigations for failures that may
occur upon trying to ingest logs to Elastic (e.g. log to EventViewer, send an
email notification to supporting team).
occur upon trying to ingest logs (e.g., log to EventViewer, send an email
notification to supporting team).

***
## Security

## Retention
When implementing logging, consider the following secure design practices:

Data logged to Elastic Cloud comes under the following retention policy:
- Use structured logging instead of string concatenation to avoid log injection vulnerabilities.

- Logs are ingested into the "Hot" tier. This tier provides the best
indexing and search performance.
- Ensure that no sensitive information gets stored in logs, for example,
passwords, secret keys, and session IDs.

- After **2 days** logs are automatically moved to the "Cold" tier. This tier
is optimal for data that is still likely to be searched, but infrequently
- updated.
- Ensure that no personal information gets stored in logs, for example,
customer names, addresses, and email addresses.

- After **7 days**, logs in **non-live** are deleted, and can not be recovered.
## Retention

- After **90 days**, logs in **live** are deleted, and can not be recovered.
Log retention is manged by the Observability team and is set to:

***
- **Non-live environments**: 7 days
- **Live environments**: 90 days

## Testing and assuring logging
Logs are not recoverable after this period.

This section will cover how to test the logging implemented by a service.
## Testing and assuring logging

Teams should look to leverage their definitions of Ready and Done to drive
their logging practices:
Teams should leverage their definitions of Ready and Done to drive logging
practices:

**Definition of Ready** to include:
**Definition of Ready** should include:

- Agreeing a common field or value, for example a TraceId or CorrelationId, and
how to test for this property across logs
- A common dictionary for log messages ensuring that values are comparable
across services within a project.

- A consideration of the different log types and how to develop towards that

### What to log where
**Definition of Done** should include:

- Ensure log levels are correct for each environment

### Implementation

Teams should use the [UKHO.Logging.Serilog](https://github.com/UKHO/UKHO.Logging.Serilog)
package to implement logging in their services. This package provides a
standardised way of logging across services, and provides a number of built-in
log enrichments that will be useful for searching and filtering logs in Elastic.
standardised way of logging across services and provides built-in log
enrichments.

**Definition of Done** to include:
**Note**: Teams should prefer UKHO.Logging.Serilog over the legacy
UKHO.EventHubLogging provider. Existing projects using the legacy provider
should plan to migrate.

- Ensure log levels are correct for each environment
### Testing requirements

### Test Approach and TSR documents
- **Test Approach and TSR documents**: Teams must demonstrate observability for
the service and prove this is working as expected.

Teams will be expected to demonstrate observability for the service and prove
this is working as expected.
- **Support team handovers**: Support/CI teams will ensure that good logging
practice has been adhered to, and this should be demonstrated.

### Support team handovers
- **Smoke test/monitor log ingestion**: Teams should ensure logs have ingested
successfully and are discoverable. This could be via an automated test or a
manual check.

When receiving a service, support/CI teams will ensure that good logging
practice has been adhered to, and this should be demonstrated.
- **Unit tests**: Logging is a first class citizen. Unit tests should assert
that logs are logging to the expected level.

### Smoke test/monitor log ingestion
- **Load testing**: Load tests should use Production level logging, so the
capacity generated from logs targeting Production is understood. The load
testing environment should be as live-like as possible.

Teams should ensure when creating logs in Elastic that they have ingested
successfully to the correct index and are discoverable. This could be via an
automated test or a manual check.

If teams are using the legacy Azure Event Hub > LogStash > on-premise
ElasticSearch pattern for ingesting logs, they must check LogStash for errors
at every stage of the development process, using the DDC Grafana monitor set up
for this purpose.
***

### Unit tests
# Guidance for Elastic

Logging is a first class citizen when it comes to unit testing. At a code
level, unit tests should assert that logs are logging to the level they should
be.
This section provides technical guidance for implementing the Logging Policy
using ElasticSearch and Elastic Cloud.

In later environments as log levels become more restrictive, teams should test
that the correct log levels are being used in accordance with the environment.
## Technology Choice

### Load testing
Elastic is the tool of choice for log aggregation and analysis. If there are
technical considerations which prevent use of Elastic, consider appropriate
alternatives and detail how these make a best effort to meet this policy in
your design documentation.

Load tests should be set to Production level logging, so the capacity generated
from logs targeting Production is understood. The load testing environment
should be as live-like as possible.
## Index naming convention

***
Logs should be indexed in ElasticSearch according to the following convention:
**FullServiceName-environment-category**

## Security
### Diagnostic logs

When implementing logging into a solution, it is essential to consider the
following secure design practices:
Should ingest into a dedicated ElasticSearch index named as e.g.,
`SalesCatalogueService-Dev1-Diagnostic`

- Encode and validate any dangerous inputs before storing the log to prevent
[log injection](https://owasp.org/www-community/attacks/Log_Injection) or log
forging attacks.
### Audit/Metric logs

- Ensure that no sensitive information gets stored in logs, for example,
passwords, secret keys, and session IDs.
Should ingest into a dedicated ElasticSearch index named as e.g.,
`SalesCatalogueService-Dev1-Audit`

### Request/response logging

- Forward any logs to a centralised, secure logging system that implements a
proper failover system. A load-balanced logging system will ensure that no log
data is lost if a node is compromised.
Should ingest into a dedicated ElasticSearch index named as e.g.,
`SalesCatalogueService-Dev1-HTTP`

- Protect log integrity by ensuring that log files cannot be tampered with, as
a malicious attacker usually carries this out to cover up an attack. You can
confirm this by implementing proper user permissions and logging into an
immutable data store (such as Kibana).
## Retention implementation in Elastic Cloud

***
Data logged to Elastic Cloud follows this retention implementation:

## Available log ingestion patterns
- Logs are ingested into the "Hot" tier for best indexing and search
performance.

- After **2 days** logs are automatically moved to the "Cold" tier, optimal for
data that is still likely to be searched but infrequently updated.

- After **7 days**, logs in **non-live** are deleted and cannot be recovered.

- After **90 days**, logs in **live** are deleted and cannot be recovered.

## Log ingestion patterns

### Cloud services - Elastic Cloud

Expand All @@ -176,8 +196,7 @@ Elastic Agent policy has an integration that pulls from all Event Hubs, using a
dedicated storage account container to track the processing of logs.

An automated process discovers new Event Hubs, adds them to the Elastic Agent
policy, and sets up necessary indexes and index lifecycle management (as per
the Elastic Cloud retention details above).
policy, and sets up necessary indexes and index lifecycle management.

#### Cloud native logs

Expand All @@ -192,19 +211,29 @@ becoming cloud services. The
[UKHO.Logging.Serilog](https://github.com/UKHO/UKHO.Logging.Serilog) package
provides support for logging to Event Hubs.

#### Legacy - LogShipper and on-premise ElasticSearch
## Legacy patterns and migration

### Legacy - LogShipper and on-premise ElasticSearch

Using LogShipper to ingest logs from on-premise services to on-premise
ElasticSearch has been depreciated and should no longer be used in new
code. Existing projects using this pattern should look to migrate to Elastic Cloud
as soon as possible.
ElasticSearch has been deprecated and should no longer be used in new code.
Existing projects using this pattern should look to migrate to Elastic Cloud as
soon as possible.

***
### Legacy - UKHO.EventHubLogging provider

## Migrating services to Elastic Cloud
The UKHO.EventHubLogging provider has been superseded by UKHO.Logging.Serilog.
Teams using the legacy provider should plan their migration path.

Existing services that are currently using the legacy Azure Event Hub >
LogStash > on-premise ElasticSearch pattern should have no errors (and no
warnings) in LogStash before they are migrated to Elastic Cloud. Furthermore,
these services should adhere to the logging policy before migration.
### Migrating services to Elastic Cloud

Existing services currently using the legacy Azure Event Hub > LogStash >
on-premise ElasticSearch pattern should have no errors (and no warnings) in
LogStash before they are migrated to Elastic Cloud.

If teams are using the legacy Azure Event Hub > LogStash > on-premise
ElasticSearch pattern for ingesting logs, they must check LogStash for errors
at every stage of the development process, using the DDC Grafana monitor set up
for this purpose.

Services should adhere to the logging policy before migration.