feat: Prometheus metrics #860

matanbaruch · 2025-05-24T15:53:46Z

This pull request introduces a comprehensive metrics collection and monitoring system for the STF (Smartphone Test Farm) application. The changes include adding Prometheus metrics support, implementing a metrics collector, and integrating real-time hooks for updating metrics. Additionally, the ESLint configuration has been updated for modern JavaScript support.

Metrics Collection and Monitoring:

Prometheus Metrics Integration:
- Added Prometheus prom-client library for metrics collection (package.json, package.jsonR87).
- Implemented metrics.js to define and manage custom metrics, including devices, users, and groups (lib/util/metrics.js, lib/util/metrics.jsR1-R161).
- Created a /metrics endpoint to serve Prometheus-compatible metrics (lib/units/api/controllers/metrics.js, [1]; lib/units/api/swagger/api_v1.yaml, [2].
Metrics Collection Service:
- Added MetricsCollector class to periodically gather metrics from the database and update Prometheus metrics (lib/util/metrics-collector.js, lib/util/metrics-collector.jsR1-R148).
- Integrated the metrics collector into the API lifecycle, starting and stopping it with the application (lib/units/api/index.js, [1] [2].
Real-Time Metrics Hooks:
- Introduced MetricsHooks to update metrics in response to changes in devices, users, and groups (lib/util/metrics-hooks.js, lib/util/metrics-hooks.jsR1-R115).
- Added support for real-time updates to group metrics in the frontend (res/app/group-list/group-list-controller.js, res/app/group-list/group-list-controller.jsR67-R75).

Codebase Enhancements:

ESLint Configuration Update:
- Updated .eslintrc to support ES6+ features by enabling the es6 environment and setting ecmaVersion to 2017 (.eslintrc, .eslintrcL4-R8).
Database API Enhancements:
- Added getDeviceMetrics function to securely aggregate device statistics without exposing sensitive data (lib/db/api.js, lib/db/api.jsR1664-R1715).

These changes collectively enhance the observability and maintainability of the STF system by providing detailed metrics for monitoring system health and usage.

…al-time metrics tracking Signed-off-by: matanbaruch <[email protected]>

… metrics collector code formatting Signed-off-by: matanbaruch <[email protected]>

Signed-off-by: matanbaruch <[email protected]>

Copilot

Pull Request Overview

This PR adds a Prometheus-based metrics system to STF, including real-time hooks, periodic collection, and a new /metrics endpoint.

Integrates prom-client with custom gauges and helper functions (lib/util/metrics.js)
Adds MetricsCollector service for periodic DB metric collection and lifecycle integration (lib/util/metrics-collector.js, lib/units/api/index.js)
Implements real-time update hooks (lib/util/metrics-hooks.js), frontend reporting (group-list-controller.js), and exposes /metrics via a controller and Swagger docs

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
res/app/group-list/group-list-controller.js	Sends group stats from the UI to backend for metrics
package.json	Adds `prom-client` dependency
lib/util/metrics.js	Defines Prometheus metrics and update helpers
lib/util/metrics-hooks.js	Implements real-time hooks for entity changes
lib/util/metrics-collector.js	Collects metrics periodically from DB
lib/units/api/swagger/api_v1.yaml	Documents the `/metrics` endpoint
lib/units/api/index.js	Integrates `MetricsCollector` into the API lifecycle
lib/units/api/controllers/metrics.js	Serves Prometheus metrics via `/metrics`
lib/db/api.js	Exports `getDevices()` for metrics collection
.eslintrc	Updates ESLint config to ES6/2017

lib/util/metrics-hooks.js

lib/util/metrics-collector.js

res/app/group-list/group-list-controller.js

lib/units/api/index.js

denis99999 · 2025-05-25T02:04:05Z

I have to review seriously this PR (I saw some problems, e.g. you can't export getDevice() and bypass the device access control, etc.), but later because no time currently.

…cs collector Signed-off-by: matanbaruch <[email protected]>

…improved performance Signed-off-by: matanbaruch <[email protected]>

…etter compatibility Signed-off-by: matanbaruch <[email protected]>

Signed-off-by: matanbaruch <[email protected]>

matanbaruch · 2025-05-25T21:46:17Z

I have to review seriously this PR (I saw some problems, e.g. you can't export getDevice() and bypass the device access control, etc.), but later because no time currently.

You are right. Fixed.

Copilot

Pull Request Overview

This PR integrates Prometheus-based metrics into the STF platform, adding real-time hooks, a periodic collector service, and an exposed /metrics endpoint.

Adds prom-client and defines custom gauges in lib/util/metrics.js
Implements MetricsCollector for scheduled metric aggregation and MetricsHooks for real-time updates
Exposes a new /metrics API endpoint and updates frontend to emit group metrics

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
res/app/group-list/group-list-controller.js	Sends group counts to backend for metrics
package.json	Adds `prom-client` dependency
lib/util/metrics.js	Defines Prometheus registry and custom gauges
lib/util/metrics-hooks.js	Updates metrics on entity events in real time
lib/util/metrics-collector.js	Periodically gathers and updates metrics from DB
lib/units/api/swagger/api_v1.yaml	Defines `/metrics` path in Swagger spec
lib/units/api/index.js	Starts/stops `MetricsCollector` in app lifecycle
lib/units/api/controllers/metrics.js	Implements the `/metrics` controller
lib/db/api.js	Adds `getDeviceMetrics` aggregation function
.eslintrc	Updates parser options for ES6

Comments suppressed due to low confidence (2)

lib/util/metrics-collector.js:13

There are no tests covering the MetricsCollector class or its collectMetrics and lifecycle behavior. Adding unit tests for these methods will help ensure reliability of the new metrics feature.

class MetricsCollector {

lib/db/api.js:1705

The log variable is not defined in this scope. Please require the logger and create a log instance (e.g., const logger = require('../util/logger'); const log = logger.createLogger('dbapi');) at the top of the file.

log.error('Error getting device metrics:', error)

lib/util/metrics-hooks.js

lib/units/api/controllers/metrics.js

Signed-off-by: matanbaruch <[email protected]>

denis99999 · 2025-05-26T09:07:30Z

@matanbaruch , your PR raises some concerns, and I would need more time to respond, which I currently lack; I will have more time later this summer or later.

To quickly summarize, certain elements make it unacceptable in its current state.

Here are some comments in bulk:

I think you have not understood how access control works for groups, devices and users, in particular a simple user can only see the objects that he is authorized to see, but in your case as implemented your Prometheus server will see all the objects which is forbidden, unless the metrics API is tagged as a privileged operation with the admin tag, and therefore only the admin of the STF platform will be able to use it. In this case, it is possible, in your code, to directly use the getDevices() function. Besides, your implementation of getDeviceMetrics() did not solve any problem and is useless. You can therefore remove it.
It is not allowed to put your metrics service code in the controller/index.js file, you should ideally put it elsewhere as an independent service with which your API can communicate through the ZMQ bus, or at worst put everything in the metrics controller code..
it is not allowed to leave log.debug messages lying around in the code of a PR, and even less so when these messages appear every 30s.
The master branch of STF is not a testing ground, a PR must provide a complete functionality that's ready to use in production mode and therefore must not contain debug code or dead code. In your case, I may have misunderstood, but I haven't seen where the code from lib/util/metrics-hooks.js is used elsewhere. It may be in anticipation of a subsequent PR, but that's not acceptable.
Normally, you should not push directly from your master branch but from a branch dedicated to your PR, providing if possible a single commit, otherwise it quickly becomes unmanageable.
for this PR which assumes a third-party element which is a Prometheus server and possibly Grafana, and which I do not know a priori, you would have to provide the basic instructions to quickly set up a minimal test environment, otherwise we cannot observe anything and therefore validate anything because we do not necessarily have the time for that, it is up to you to show your white paw on the test of the functionality that you propose.
I'm also wondering about the merits of integrating this Prometheus API into the STF code, or will it be necessary to integrate other alternative solutions to meet specific needs? What need does your request meet since we can't see anything on your fork?
when you insert a copyright in your files, put your company and/or your name with the email, not the name of the feature
your API does not return JSON, this is not acceptable a priori
Your code has nothing to do with the res/.../group-list-controller.js file, and more importantly, we don't understand what it's for. In your comment, it's supposed to cause an update of the metrics on the backend side, but we don't see how this can be done via the window object? Normally, UI actions on STF objects are passed through the websocket or the API that actually performs the operations, so we don't need anything else a priori. We can nevertheless imagine the display of metrics on the UI, but in this case, we must create a new Metrics tab and not pollute the existing tabs.

If you agree with all my comments, I think it would be better to close this PR and propose a more mature one later taking into account all my observations?

matanbaruch · 2025-05-28T01:49:18Z

@matanbaruch , your PR raises some concerns, and I would need more time to respond, which I currently lack; I will have more time later this summer or later.

To quickly summarize, certain elements make it unacceptable in its current state.

Thank you for the detailed feedback. I appreciate you taking the time despite your limited availability. Below are my responses to your comments.

I think you have not understood how access control works for groups, devices and users, in particular a simple user can only see the objects that he is authorized to see, but in your case as implemented your Prometheus server will see all the objects which is forbidden, unless the metrics API is tagged as a privileged operation with the admin tag, and therefore only the admin of the STF platform will be able to use it. In this case, it is possible, in your code, to directly use the getDevices() function. Besides, your implementation of getDeviceMetrics() did not solve any problem and is useless. You can therefore remove it.

I see your point regarding getDeviceMetrics, and I agree. I can remove it and instead expose the Metrics API only behind basic authentication for the admin user. The intention is that this endpoint be used strictly by the platform admin to monitor devices—not by regular users.

It is not allowed to put your metrics service code in the controller/index.js file, you should ideally put it elsewhere as an independent service with which your API can communicate through the ZMQ bus, or at worst put everything in the metrics controller code.

Understood. I’ll refactor it to reside in a more appropriate location as per your suggestion.

it is not allowed to leave log.debug messages lying around in the code of a PR, and even less so when these messages appear every 30s.

You're right. I will clean up the debug logs accordingly.

The master branch of STF is not a testing ground, a PR must provide a complete functionality that's ready to use in production mode and therefore must not contain debug code or dead code. In your case, I may have misunderstood, but I haven't seen where the code from lib/util/metrics-hooks.js is used elsewhere. It may be in anticipation of a subsequent PR, but that's not acceptable.

I understand. I'll remove the unused code and ensure the PR reflects only production-ready functionality.

Normally, you should not push directly from your master branch but from a branch dedicated to your PR, providing if possible a single commit, otherwise it quickly becomes unmanageable.

Thanks for the note. While I typically rely on squash-and-merge to keep history clean, I understand the importance of using dedicated branches for clarity. I’ll adopt that approach going forward.

for this PR which assumes a third-party element which is a Prometheus server and possibly Grafana, and which I do not know a priori, you would have to provide the basic instructions to quickly set up a minimal test environment, otherwise we cannot observe anything and therefore validate anything because we do not necessarily have the time for that, it is up to you to show your white paw on the test of the functionality that you propose.

That makes sense. I’ll include a README.md with setup instructions to make it easy to test the feature. Prometheus has become a widely adopted CNCF Graduated project, so the integration should feel familiar to many, but I agree clear guidance is still essential.

I'm also wondering about the merits of integrating this Prometheus API into the STF code, or will it be necessary to integrate other alternative solutions to meet specific needs? What need does your request meet since we can't see anything on your fork?

The goal here is to expose basic device-level metrics in a standardized way using Prometheus, which is a common tool in many monitoring stacks. It provides a strong starting point, and being open-source and extensible, it can accommodate further integrations if needed.

when you insert a copyright in your files, put your company and/or your name with the email, not the name of the feature

Understood, I’ll update the copyright accordingly. (Since it's an opensource didn't knew contributors names are inside the code)

your API does not return JSON, this is not acceptable a priori

Prometheus has a specific plain text exposition format designed for scraping, as outlined in their [data model documentation]. That said, I’ll double-check to ensure the output conforms strictly to their expected format.

Your code has nothing to do with the res/.../group-list-controller.js file, and more importantly, we don't understand what it's for. In your comment, it's supposed to cause an update of the metrics on the backend side, but we don't see how this can be done via the window object? Normally, UI actions on STF objects are passed through the websocket or the API that actually performs the operations, so we don't need anything else a priori. We can nevertheless imagine the display of metrics on the UI, but in this case, we must create a new Metrics tab and not pollute the existing tabs.

Got it. To clarify, the Metrics API is not meant for the STF UI at all—it's purely backend-facing and intended to be scraped by Prometheus. I’ll ensure there are no UI hooks or unrelated changes.

If you agree with all my comments, I think it would be better to close this PR and propose a more mature one later taking into account all my observations?

I agree with most of the feedback and will incorporate all necessary changes. However, I’m not sure how closing and reopening the PR would help in this case—especially as the ongoing discussion and review context would be lost. I’d prefer to revise this PR directly, clean up the commit history, and continue reviewing here. Once the work is complete, we can squash and merge as needed.

DaniilSmirnov · 2025-06-04T07:19:29Z

@matanbaruch, If you’d like, feel free to open the same PR on our fork of OpenSTF: https://github.com/VKCOM/devicehub.
We’re really interested in these changes, as our goal is to collect statistics on device usage.

matanbaruch added 3 commits May 24, 2025 18:55

feat: Implement Prometheus metrics collection and update hooks for re…

0a783ad

…al-time metrics tracking Signed-off-by: matanbaruch <[email protected]>

fix: Update ESLint configuration for better compatibility and improve…

145db2e

… metrics collector code formatting Signed-off-by: matanbaruch <[email protected]>

feat: Expose getDevices function in dbapi module

bd23bc0

Signed-off-by: matanbaruch <[email protected]>

matanbaruch force-pushed the master branch from 069cf62 to bd23bc0 Compare May 24, 2025 15:55

koral-- requested a review from Copilot May 25, 2025 00:49

Copilot AI reviewed May 25, 2025

View reviewed changes

denis99999 self-requested a review May 25, 2025 02:00

matanbaruch added 4 commits May 26, 2025 00:33

feat: Add secure device metrics aggregation function and update metri…

c73c31f

…cs collector Signed-off-by: matanbaruch <[email protected]>

fix: Update device metrics query to use group instead of groupBy for …

c44a40c

…improved performance Signed-off-by: matanbaruch <[email protected]>

fix: Update metrics collection to use $window instead of window for b…

8e22469

…etter compatibility Signed-off-by: matanbaruch <[email protected]>

Fix Copilot suggestions in metrics-hooks.js

ffcb820

Signed-off-by: matanbaruch <[email protected]>

matanbaruch requested a review from Copilot May 25, 2025 21:45

Copilot AI reviewed May 25, 2025

View reviewed changes

lib/util/metrics-hooks.js Show resolved Hide resolved

lib/units/api/controllers/metrics.js Show resolved Hide resolved

fix: Add $window dependency to GroupListCtrl for improved compatibility

89225ad

Signed-off-by: matanbaruch <[email protected]>

Merge branch 'master' into master

83ef628

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Prometheus metrics #860

feat: Prometheus metrics #860

Uh oh!

matanbaruch commented May 24, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

denis99999 commented May 25, 2025 •

edited

Loading

Uh oh!

matanbaruch commented May 25, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

denis99999 commented May 26, 2025 •

edited

Loading

Uh oh!

matanbaruch commented May 28, 2025 •

edited

Loading

Uh oh!

DaniilSmirnov commented Jun 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: Prometheus metrics #860

Are you sure you want to change the base?

feat: Prometheus metrics #860

Uh oh!

Conversation

matanbaruch commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Metrics Collection and Monitoring:

Codebase Enhancements:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

denis99999 commented May 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matanbaruch commented May 25, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

denis99999 commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matanbaruch commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DaniilSmirnov commented Jun 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

matanbaruch commented May 24, 2025 •

edited

Loading

denis99999 commented May 25, 2025 •

edited

Loading

denis99999 commented May 26, 2025 •

edited

Loading

matanbaruch commented May 28, 2025 •

edited

Loading