Skip to content

Commit e2001a2

Browse files
committed
[doc] Add monitoring and observability guidelines
Signed-off-by: Jibin Varghese <[email protected]>
1 parent e9f032a commit e2001a2

File tree

2 files changed

+196
-0
lines changed

2 files changed

+196
-0
lines changed

Observability-Guidelines.md

Lines changed: 196 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,196 @@
1+
# GitHub Runners Observability Guidelines
2+
3+
> [!NOTE]
4+
> This guidelines are an unreleased DRAFT
5+
6+
## Overview
7+
8+
PyTorch's CI infrastructure serves as the backbone for continuous integration across multiple cloud providers, supporting thousands of developers and contributors worldwide. The reliability of this infrastructure directly impacts PyTorch's development velocity, code quality, and release cadence. Without comprehensive monitoring, issues such as runner failures, performance degradation, or capacity bottlenecks can go undetected, leading to delayed builds, frustrated contributors, and potential release delays. Monitoring enables proactive identification of problems, ensures optimal resource utilization, and provides the data necessary for capacity planning and infrastructure optimization. This is especially critical given PyTorch's position as a leading machine learning framework where build reliability directly affects the broader AI/ML ecosystem.
9+
10+
This document defines the mandatory monitoring and observability requirements for GitHub runners added to the Pytorch multi-cloud CI infrastructure. All runners must comply with these guidelines to ensure proper tracking of health, performance, and availability.
11+
This document is split into two parts --
12+
13+
1. [Requirements](#requirements) : which provides guidelines on `what` is reqired to onboard a new runner system and
14+
2. [Implementation](#implementation) : which provides guidelines on `how` to fulfill the above requriements in a manner consisent with the rest of pytorch CI infra
15+
16+
## Requirements
17+
18+
### Runner Pool Stability
19+
20+
A candidate runner pool must:
21+
22+
- Undergo stability assessment before deployment in critical CI/CD workflows
23+
- Maintain performance metrics during test jobs
24+
- Track resource utilization and stability patterns
25+
- Document baseline performance metrics for each runner type
26+
27+
### Incident Management
28+
29+
Runner pools must:
30+
31+
- Implement real-time status monitoring
32+
- Configure automated alerts for:
33+
- Runner pool offline events
34+
- Capacity reduction incidents
35+
- Performance degradation
36+
- Resource exhaustion
37+
- Establish alert routing to:
38+
- CI infrastructure team
39+
- Community maintainers
40+
- System administrators
41+
42+
### Metrics Requirements
43+
44+
All runners must collect and expose the following metrics on [hud.pytorch.org/metrics](https://hud.pytorch.org/metrics)
45+
46+
#### Lifecycle Metrics
47+
48+
Runners must track:
49+
50+
- Registration/unregistration events
51+
- Job start/completion times
52+
- Queue wait times
53+
- Job execution duration
54+
- Resource utilization during jobs
55+
- Error rates and types
56+
57+
#### Health Metrics
58+
59+
Runners must monitor:
60+
61+
- Heartbeat status
62+
- System resource usage (CPU, Memory, Disk)
63+
- Network connectivity
64+
- GitHub API response times
65+
- Runner process health
66+
67+
## Technical Requirements
68+
69+
### OpenTelemetry Integration
70+
71+
All monitoring implementations must:
72+
73+
- Expose metrics in OpenTelemetry format
74+
- Follow standardized metric naming conventions
75+
- Use consistent labeling across all runners
76+
- Implement proper metric aggregation and sampling
77+
78+
### Service Level Requirements
79+
80+
Production runners must maintain:
81+
82+
- Minimum uptime of 99.9%
83+
- Maximum job queue time of 5 minutes
84+
- Job execution time variance within ±10% of baseline
85+
- Response time to critical alerts within 15 minutes
86+
- Maximum capacity reduction of 10%
87+
88+
### Dashboard Requirements
89+
90+
### HUD Integration
91+
92+
The PyTorch CI HUD is a dashboard that consolidates metrics and dashboards for tracking the Continuous Integration (CI) system of PyTorch, including metrics related to runners.
93+
The HUD provides a centralized view of these metrics, with dashboards like the Metrics Dashboard, Flambeau and Benchmark Metrics offering insights into runner performance and CI health
94+
95+
Teams providing runners to the pool must
96+
97+
- Implementing OpenTelemetry data source integration to HUD
98+
- Support real-time status overview
99+
- Support resource utilization graphs
100+
- Alert history and status
101+
- Runner pool capacity visualization
102+
103+
### Alternative Dashboards
104+
105+
Teams may implement:
106+
107+
- Grafana dashboard implementation
108+
- Custom metrics visualization
109+
- Alert management interface
110+
- Performance reporting
111+
112+
### Documentation Requirements
113+
114+
Teams must:
115+
116+
- Maintain up-to-date monitoring documentation
117+
- Doccuemnt the architecture diagram detailing their runner CI infrastructure setup
118+
- Document all custom metrics, monitoring endpoints, and esclalation routes
119+
- Document thresholds for raising and resolving alerts
120+
- Document alert response procedures and playbooks for internal SRE/manitainers to follow to resolve the alerts
121+
- Conduct regular review and updates of the documentation to prevent documentation from getting outdated
122+
123+
### Maintenance Requirements
124+
125+
Teams must:
126+
127+
- Conduct regular metric review
128+
- Perform alert threshold tuning
129+
- Optimize performance
130+
- Plan for capacity
131+
132+
### Compliance Requirements
133+
134+
Teams must:
135+
136+
- Conduct regular review of monitoring effectiveness
137+
- Perform quarterly metric analysis
138+
- Update monitoring strategy annually
139+
- Implement continuous improvement process
140+
141+
## Implementation
142+
143+
### System Architecture
144+
145+
In order to provide a clear separation between the pytorch foundation runners and community/partner runners, the following guidelines must be followed.
146+
For details on getting started with onboarding a new runner, please refer to the [Partners Pytorch CI Runners](https://github.com/pytorch/test-infra/blob/main/docs/partners_pytorch_ci_runners.md) guide.
147+
148+
#### PyTorch Runners
149+
150+
Must implement:
151+
152+
- Dedicated monitoring namespace
153+
- Resource quotas and limits
154+
- Custom metrics for PyTorch-specific workloads
155+
- Integration with existing PyTorch monitoring infrastructure
156+
157+
#### Community Runners
158+
159+
Must implement:
160+
161+
- Separate monitoring namespace
162+
- Basic resource monitoring
163+
- Job execution metrics
164+
- Error tracking and reporting
165+
166+
### Alerting
167+
168+
All CI Runners should post alerts to [#pytorch-infra-alerts](https://pytorch.slack.com/archives/C082SHB006Q) channel in case of service degradation.
169+
170+
Teams must define clear alert thresholds as part of the runner [documentation requirements](#documentation-requirements)
171+
172+
Alerts may be of three types --
173+
174+
1. Raise `warning` alers when the expected values degrade below the P50 threshold nominal values
175+
2. Raise `error` alerts when the expected values degrade below the P90 threshold nominal values
176+
3. Raise `critical` alerts when the expected values degrade below the P99 threshold nominal values
177+
178+
Alerts need to have a raise threshold and a clear threshold defined.
179+
180+
[<img src="assets/threshold-setting.png" width="400"/>](threshold-setting.png)
181+
182+
In general the high raise threshold must be greater than the high clear threshold, and low raise threshold values would be less than the low clear threshold.
183+
A common use case for high raise thresholds would be runner HTTP 5XX error rate, and low raise threshold would be metrics such as runner node disk space.
184+
185+
In addition to the above, for successfully managing alerts, teams must:
186+
187+
- Implement best-effort alert deduplication so as to reduce redundant posts in the channel
188+
- Establish proper escalation paths for tagging maintainers.
189+
190+
### Metric Collection
191+
192+
Pytorch HUD uses a ClickHouse Cloud database as the data source for the dashboards, whose schema is defined [here](https://github.com/pytorch/test-infra/tree/main/clickhouse_db_schema).
193+
All runners must publish metrics marked `mandatory` in the tables below.
194+
195+
> [!NOTE]
196+
> TODO :: Fill in this section based on current state of metrics from WG meeting.

assets/threshold-setting.png

104 KB
Loading

0 commit comments

Comments
 (0)