-
Notifications
You must be signed in to change notification settings - Fork 36
Document monitoring, metrics, tracing, observability and alertingΒ #4797
Description
As an engineer on GOV.UK,
I want to know how to configure monitoring, metrics, tracing, observability and alerting for my applications,
so that we can enable proactive detection and resolution of issues, ensure optimal performance and enhance reliability by providing real time insight into applications health and behaviour, as well as to inform product decisions.
Current documentation
Logging
How logging works on GOV.UK
Request tracing
Monitoring
Debug underperforming search - I've asked Search team to review and probably remove this
How we handle errors
Pingdom
Sentry
Alerting
Pingdom Bouncer canary check
Router error ratio too high
Travel Advice or Drug and Medical Device email alerts not sent
Signon API user token expires soon
PagerDuty
Things that may contact on-call - I suggest the specifics get taken out of here and instead link to the relevant pages
[WIP] Missing documentation
- Grafana - looks like there used to be one at https://docs.publishing.service.gov.uk/manual/grafana.html. Access to Grafana is mentioned here . (example steps to create dashboards for Prometheus metrics are included in [this card])(https://trello.com/c/SQro2G8f/3476-monitor-how-long-content-datas-csv-exports-take-5)
- App metrics, Prometheus
- configuring Alertmanager alerts
[WIP] Documentation that could do with a refresh
- Pagerduty alerts section, AlertManager alerts section and the Monitoring section - consolidate all of this and more under a new section called
Monitoring and alerting.