Skip to content

Conversation

matanbaruch
Copy link

@matanbaruch matanbaruch commented May 24, 2025

This pull request introduces a comprehensive metrics collection and monitoring system for the STF (Smartphone Test Farm) application. The changes include adding Prometheus metrics support, implementing a metrics collector, and integrating real-time hooks for updating metrics. Additionally, the ESLint configuration has been updated for modern JavaScript support.

Metrics Collection and Monitoring:

  • Prometheus Metrics Integration:

    • Added Prometheus prom-client library for metrics collection (package.json, package.jsonR87).
    • Implemented metrics.js to define and manage custom metrics, including devices, users, and groups (lib/util/metrics.js, lib/util/metrics.jsR1-R161).
    • Created a /metrics endpoint to serve Prometheus-compatible metrics (lib/units/api/controllers/metrics.js, [1]; lib/units/api/swagger/api_v1.yaml, [2].
  • Metrics Collection Service:

    • Added MetricsCollector class to periodically gather metrics from the database and update Prometheus metrics (lib/util/metrics-collector.js, lib/util/metrics-collector.jsR1-R148).
    • Integrated the metrics collector into the API lifecycle, starting and stopping it with the application (lib/units/api/index.js, [1] [2].
  • Real-Time Metrics Hooks:

Codebase Enhancements:

  • ESLint Configuration Update:

    • Updated .eslintrc to support ES6+ features by enabling the es6 environment and setting ecmaVersion to 2017 (.eslintrc, .eslintrcL4-R8).
  • Database API Enhancements:

    • Added getDeviceMetrics function to securely aggregate device statistics without exposing sensitive data (lib/db/api.js, lib/db/api.jsR1664-R1715).

These changes collectively enhance the observability and maintainability of the STF system by providing detailed metrics for monitoring system health and usage.

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a Prometheus-based metrics system to STF, including real-time hooks, periodic collection, and a new /metrics endpoint.

  • Integrates prom-client with custom gauges and helper functions (lib/util/metrics.js)
  • Adds MetricsCollector service for periodic DB metric collection and lifecycle integration (lib/util/metrics-collector.js, lib/units/api/index.js)
  • Implements real-time update hooks (lib/util/metrics-hooks.js), frontend reporting (group-list-controller.js), and exposes /metrics via a controller and Swagger docs

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
res/app/group-list/group-list-controller.js Sends group stats from the UI to backend for metrics
package.json Adds prom-client dependency
lib/util/metrics.js Defines Prometheus metrics and update helpers
lib/util/metrics-hooks.js Implements real-time hooks for entity changes
lib/util/metrics-collector.js Collects metrics periodically from DB
lib/units/api/swagger/api_v1.yaml Documents the /metrics endpoint
lib/units/api/index.js Integrates MetricsCollector into the API lifecycle
lib/units/api/controllers/metrics.js Serves Prometheus metrics via /metrics
lib/db/api.js Exports getDevices() for metrics collection
.eslintrc Updates ESLint config to ES6/2017

@denis99999 denis99999 self-requested a review May 25, 2025 02:00
@denis99999
Copy link

denis99999 commented May 25, 2025

I have to review seriously this PR (I saw some problems, e.g. you can't export getDevice() and bypass the device access control, etc.), but later because no time currently.

@matanbaruch matanbaruch requested a review from Copilot May 25, 2025 21:45
@matanbaruch
Copy link
Author

I have to review seriously this PR (I saw some problems, e.g. you can't export getDevice() and bypass the device access control, etc.), but later because no time currently.

You are right. Fixed.

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR integrates Prometheus-based metrics into the STF platform, adding real-time hooks, a periodic collector service, and an exposed /metrics endpoint.

  • Adds prom-client and defines custom gauges in lib/util/metrics.js
  • Implements MetricsCollector for scheduled metric aggregation and MetricsHooks for real-time updates
  • Exposes a new /metrics API endpoint and updates frontend to emit group metrics

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
res/app/group-list/group-list-controller.js Sends group counts to backend for metrics
package.json Adds prom-client dependency
lib/util/metrics.js Defines Prometheus registry and custom gauges
lib/util/metrics-hooks.js Updates metrics on entity events in real time
lib/util/metrics-collector.js Periodically gathers and updates metrics from DB
lib/units/api/swagger/api_v1.yaml Defines /metrics path in Swagger spec
lib/units/api/index.js Starts/stops MetricsCollector in app lifecycle
lib/units/api/controllers/metrics.js Implements the /metrics controller
lib/db/api.js Adds getDeviceMetrics aggregation function
.eslintrc Updates parser options for ES6
Comments suppressed due to low confidence (2)

lib/util/metrics-collector.js:13

  • There are no tests covering the MetricsCollector class or its collectMetrics and lifecycle behavior. Adding unit tests for these methods will help ensure reliability of the new metrics feature.
class MetricsCollector {

lib/db/api.js:1705

  • The log variable is not defined in this scope. Please require the logger and create a log instance (e.g., const logger = require('../util/logger'); const log = logger.createLogger('dbapi');) at the top of the file.
log.error('Error getting device metrics:', error)

@denis99999
Copy link

denis99999 commented May 26, 2025

@matanbaruch , your PR raises some concerns, and I would need more time to respond, which I currently lack; I will have more time later this summer or later.

To quickly summarize, certain elements make it unacceptable in its current state.

Here are some comments in bulk:

  • I think you have not understood how access control works for groups, devices and users, in particular a simple user can only see the objects that he is authorized to see, but in your case as implemented your Prometheus server will see all the objects which is forbidden, unless the metrics API is tagged as a privileged operation with the admin tag, and therefore only the admin of the STF platform will be able to use it. In this case, it is possible, in your code, to directly use the getDevices() function. Besides, your implementation of getDeviceMetrics() did not solve any problem and is useless. You can therefore remove it.
  • It is not allowed to put your metrics service code in the controller/index.js file, you should ideally put it elsewhere as an independent service with which your API can communicate through the ZMQ bus, or at worst put everything in the metrics controller code..
  • it is not allowed to leave log.debug messages lying around in the code of a PR, and even less so when these messages appear every 30s.
  • The master branch of STF is not a testing ground, a PR must provide a complete functionality that's ready to use in production mode and therefore must not contain debug code or dead code. In your case, I may have misunderstood, but I haven't seen where the code from lib/util/metrics-hooks.js is used elsewhere. It may be in anticipation of a subsequent PR, but that's not acceptable.
  • Normally, you should not push directly from your master branch but from a branch dedicated to your PR, providing if possible a single commit, otherwise it quickly becomes unmanageable.
  • for this PR which assumes a third-party element which is a Prometheus server and possibly Grafana, and which I do not know a priori, you would have to provide the basic instructions to quickly set up a minimal test environment, otherwise we cannot observe anything and therefore validate anything because we do not necessarily have the time for that, it is up to you to show your white paw on the test of the functionality that you propose.
  • I'm also wondering about the merits of integrating this Prometheus API into the STF code, or will it be necessary to integrate other alternative solutions to meet specific needs? What need does your request meet since we can't see anything on your fork?
  • when you insert a copyright in your files, put your company and/or your name with the email, not the name of the feature
  • your API does not return JSON, this is not acceptable a priori
  • Your code has nothing to do with the res/.../group-list-controller.js file, and more importantly, we don't understand what it's for. In your comment, it's supposed to cause an update of the metrics on the backend side, but we don't see how this can be done via the window object? Normally, UI actions on STF objects are passed through the websocket or the API that actually performs the operations, so we don't need anything else a priori. We can nevertheless imagine the display of metrics on the UI, but in this case, we must create a new Metrics tab and not pollute the existing tabs.

If you agree with all my comments, I think it would be better to close this PR and propose a more mature one later taking into account all my observations?

@matanbaruch
Copy link
Author

matanbaruch commented May 28, 2025

@matanbaruch , your PR raises some concerns, and I would need more time to respond, which I currently lack; I will have more time later this summer or later.

To quickly summarize, certain elements make it unacceptable in its current state.

Thank you for the detailed feedback. I appreciate you taking the time despite your limited availability. Below are my responses to your comments.

  • I think you have not understood how access control works for groups, devices and users, in particular a simple user can only see the objects that he is authorized to see, but in your case as implemented your Prometheus server will see all the objects which is forbidden, unless the metrics API is tagged as a privileged operation with the admin tag, and therefore only the admin of the STF platform will be able to use it. In this case, it is possible, in your code, to directly use the getDevices() function. Besides, your implementation of getDeviceMetrics() did not solve any problem and is useless. You can therefore remove it.

I see your point regarding getDeviceMetrics, and I agree. I can remove it and instead expose the Metrics API only behind basic authentication for the admin user. The intention is that this endpoint be used strictly by the platform admin to monitor devices—not by regular users.

  • It is not allowed to put your metrics service code in the controller/index.js file, you should ideally put it elsewhere as an independent service with which your API can communicate through the ZMQ bus, or at worst put everything in the metrics controller code.

Understood. I’ll refactor it to reside in a more appropriate location as per your suggestion.

  • it is not allowed to leave log.debug messages lying around in the code of a PR, and even less so when these messages appear every 30s.

You're right. I will clean up the debug logs accordingly.

  • The master branch of STF is not a testing ground, a PR must provide a complete functionality that's ready to use in production mode and therefore must not contain debug code or dead code. In your case, I may have misunderstood, but I haven't seen where the code from lib/util/metrics-hooks.js is used elsewhere. It may be in anticipation of a subsequent PR, but that's not acceptable.

I understand. I'll remove the unused code and ensure the PR reflects only production-ready functionality.

  • Normally, you should not push directly from your master branch but from a branch dedicated to your PR, providing if possible a single commit, otherwise it quickly becomes unmanageable.

Thanks for the note. While I typically rely on squash-and-merge to keep history clean, I understand the importance of using dedicated branches for clarity. I’ll adopt that approach going forward.

  • for this PR which assumes a third-party element which is a Prometheus server and possibly Grafana, and which I do not know a priori, you would have to provide the basic instructions to quickly set up a minimal test environment, otherwise we cannot observe anything and therefore validate anything because we do not necessarily have the time for that, it is up to you to show your white paw on the test of the functionality that you propose.

That makes sense. I’ll include a README.md with setup instructions to make it easy to test the feature. Prometheus has become a widely adopted CNCF Graduated project, so the integration should feel familiar to many, but I agree clear guidance is still essential.

  • I'm also wondering about the merits of integrating this Prometheus API into the STF code, or will it be necessary to integrate other alternative solutions to meet specific needs? What need does your request meet since we can't see anything on your fork?

The goal here is to expose basic device-level metrics in a standardized way using Prometheus, which is a common tool in many monitoring stacks. It provides a strong starting point, and being open-source and extensible, it can accommodate further integrations if needed.

  • when you insert a copyright in your files, put your company and/or your name with the email, not the name of the feature

Understood, I’ll update the copyright accordingly. (Since it's an opensource didn't knew contributors names are inside the code)

  • your API does not return JSON, this is not acceptable a priori

Prometheus has a specific plain text exposition format designed for scraping, as outlined in their [data model documentation]. That said, I’ll double-check to ensure the output conforms strictly to their expected format.

  • Your code has nothing to do with the res/.../group-list-controller.js file, and more importantly, we don't understand what it's for. In your comment, it's supposed to cause an update of the metrics on the backend side, but we don't see how this can be done via the window object? Normally, UI actions on STF objects are passed through the websocket or the API that actually performs the operations, so we don't need anything else a priori. We can nevertheless imagine the display of metrics on the UI, but in this case, we must create a new Metrics tab and not pollute the existing tabs.

Got it. To clarify, the Metrics API is not meant for the STF UI at all—it's purely backend-facing and intended to be scraped by Prometheus. I’ll ensure there are no UI hooks or unrelated changes.

If you agree with all my comments, I think it would be better to close this PR and propose a more mature one later taking into account all my observations?

I agree with most of the feedback and will incorporate all necessary changes. However, I’m not sure how closing and reopening the PR would help in this case—especially as the ongoing discussion and review context would be lost. I’d prefer to revise this PR directly, clean up the commit history, and continue reviewing here. Once the work is complete, we can squash and merge as needed.

@DaniilSmirnov
Copy link

@matanbaruch, If you’d like, feel free to open the same PR on our fork of OpenSTF: https://github.com/VKCOM/devicehub.
We’re really interested in these changes, as our goal is to collect statistics on device usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants