Skip to content

Avicted/hpc-job-observability-service

Repository files navigation

HPC Job Observability Service

A microservice for tracking and monitoring HPC (High Performance Computing) job resource utilization with Prometheus metrics export.

This service integrates with Slurm via prolog/epilog scripts to provide real-time job tracking and resource monitoring. It collects CPU, memory, and GPU metrics from running jobs using Linux cgroups v2, stores historical data in PostgreSQL, and exports metrics in Prometheus format for Grafana dashboards.

The Problem

In many HPC clusters, it is difficult for both administrators and users to understand how resources are actually utilized during a job’s execution. Common questions such as “Is my job using the GPU?” or “Did it fail due to running out of memory?” are surprisingly hard to answer without detailed, job-scoped time-series metrics.

Traditional monitoring solutions typically aggregate metrics at the host or node level, which obscures the behavior of individual jobs and makes debugging performance issues or failures time-consuming and error-prone.

Features

  • Slurm Integration: Event-based integration via prolog/epilog scripts
  • Real-time Job Tracking: Jobs created instantly when Slurm starts them
  • Accurate State Detection: Exit codes and signals captured for correct state mapping
  • Resource Metrics: CPU, memory, and GPU usage collected via Linux cgroups v2
  • Prometheus Export: Metrics in Prometheus format for Grafana dashboards
  • PostgreSQL Storage: Persistent storage with full audit trail
  • GPU Support: NVIDIA and AMD GPU metrics via vendor tools
  • Demo Mode: Mock backend with sample data for testing

Architecture

┌─────────────────┐                          ┌─────────────────┐
│   HPC Cluster   │   Lifecycle Events       │   Prometheus    │
│  (Slurm Jobs)   │ ─────────────────────┐   │   + Grafana     │
└─────────────────┘                      │   └────────┬────────┘
        │                                │            │ scrape
        │ prolog: job-started            ▼            ▼
        │ epilog: job-finished    ┌─────────────────────────┐
        │ collector: metrics      │  Observability Service  │
        └────────────────────────▶│     (Go + REST API)     │
                                  └───────────┬─────────────┘
                                              │
                                              ▼
                                  ┌─────────────────────────┐
                                  │      PostgreSQL         │
                                  │  (jobs, metrics, audit) │
                                  └─────────────────────────┘

Quick Start

Prerequisites

  • Go 1.22+
  • Docker and Docker Compose (optional)

Running Locally

# Clone and build
go build -o server ./cmd/server

# Copy and configure environment variables
cp .env.example .env
# Edit .env with your settings

# Run with PostgreSQL
DATABASE_URL="postgres://user:pass@localhost/hpc?sslmode=disable" ./server

# Run with demo data (mock backend only)
SEED_DEMO=true SCHEDULER_BACKEND=mock DATABASE_URL="postgres://user:pass@localhost/hpc?sslmode=disable" ./server

Security Note: Never commit .env files containing secrets to version control. The .env file is already in .gitignore.

Running with Docker Compose

# Copy environment file (required for secrets)
cp .env.example .env
# Edit .env with your settings (especially passwords!)

# Start the full stack with Slurm integration
docker-compose --profile slurm up --build --force-recreate

# (Optional) Seed demo data when using mock backend
# Note: demo seeding is ignored when SCHEDULER_BACKEND=slurm
SEED_DEMO=true SCHEDULER_BACKEND=mock docker-compose up --build

# Or start only the Slurm container (for testing scheduler module)
docker-compose --profile slurm up slurm

# View Prometheus at http://localhost:9090
# View Grafana at http://localhost:3000 (credentials from .env)
# View app at http://localhost:8080

Grafana Dashboards

The project ships with a pre-provisioned Grafana dashboard. Below are example views from the example dashboard:

Job Metrics

Job metrics dashboard

Node Metrics (overview)

Node metrics dashboard overview

Node Metrics (detail)

Node metrics dashboard detail

Storage Metrics Storage metrics dashboard

Go Runtime Metrics Go runtime metrics dashboard

Configuration

Server Configuration

Variable Default Description
PORT 8080 Server port
HOST 0.0.0.0 Server host

Database Configuration

Variable Default Description
DATABASE_URL postgres://... PostgreSQL connection string
POSTGRES_USER hpc PostgreSQL username (Docker)
POSTGRES_PASSWORD - PostgreSQL password (Docker)
POSTGRES_DB hpc_jobs PostgreSQL database name (Docker)

Metrics & Grafana Configuration

Variable Default Description
METRICS_RETENTION_DAYS 7 Days to retain metrics before cleanup
GF_SECURITY_ADMIN_USER admin Grafana admin username
GF_SECURITY_ADMIN_PASSWORD - Grafana admin password

Demo Data

Use the SEED_DEMO environment variable to seed demo data on startup (mock backend only).

Development

Running Tests

Slurm Integration

The service uses an event-based architecture for Slurm integration:

Component Endpoint Purpose
Prolog script POST /v1/events/job-started Creates job when Slurm starts it
Epilog script POST /v1/events/job-finished Updates job with exit code/signal
Collector POST /v1/jobs/{id}/metrics Records CPU/memory/GPU metrics

State detection uses exit codes and signals:

  • Exit code 0 = completed
  • Exit code non-zero = failed
  • Signal 9 (SIGKILL) or 15 (SIGTERM) = cancelled

For detailed setup instructions, see:

Further development

  • Run collector on compute nodes:
    • Deploy a collector agent on each Slurm compute node that pushes metrics to the central service via API.
    • Development has been done on one machine; further work is needed to containerize and deploy across nodes.

Documentation

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

License

Apache 2.0 License. See LICENSE for details.

About

A microservice for tracking and monitoring HPC (High Performance Computing) job resource utilization with Prometheus metrics export.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Contributors