HPC Job Observability Service

A microservice for tracking and monitoring HPC (High Performance Computing) job resource utilization with Prometheus metrics export.

This service integrates with Slurm via prolog/epilog scripts to provide real-time job tracking and resource monitoring. It collects CPU, memory, and GPU metrics from running jobs using Linux cgroups v2, stores historical data in PostgreSQL, and exports metrics in Prometheus format for Grafana dashboards.

The Problem

In many HPC clusters, it is difficult for both administrators and users to understand how resources are actually utilized during a job’s execution. Common questions such as “Is my job using the GPU?” or “Did it fail due to running out of memory?” are surprisingly hard to answer without detailed, job-scoped time-series metrics.

Traditional monitoring solutions typically aggregate metrics at the host or node level, which obscures the behavior of individual jobs and makes debugging performance issues or failures time-consuming and error-prone.

Features

Slurm Integration: Event-based integration via prolog/epilog scripts
Real-time Job Tracking: Jobs created instantly when Slurm starts them
Accurate State Detection: Exit codes and signals captured for correct state mapping
Resource Metrics: CPU, memory, and GPU usage collected via Linux cgroups v2
Prometheus Export: Metrics in Prometheus format for Grafana dashboards
PostgreSQL Storage: Persistent storage with full audit trail
GPU Support: NVIDIA and AMD GPU metrics via vendor tools
Demo Mode: Mock backend with sample data for testing

Architecture

┌─────────────────┐                          ┌─────────────────┐
│   HPC Cluster   │   Lifecycle Events       │   Prometheus    │
│  (Slurm Jobs)   │ ─────────────────────┐   │   + Grafana     │
└─────────────────┘                      │   └────────┬────────┘
        │                                │            │ scrape
        │ prolog: job-started            ▼            ▼
        │ epilog: job-finished    ┌─────────────────────────┐
        │ collector: metrics      │  Observability Service  │
        └────────────────────────▶│     (Go + REST API)     │
                                  └───────────┬─────────────┘
                                              │
                                              ▼
                                  ┌─────────────────────────┐
                                  │      PostgreSQL         │
                                  │  (jobs, metrics, audit) │
                                  └─────────────────────────┘

Quick Start

Prerequisites

Go 1.22+
Docker and Docker Compose (optional)

Running Locally

# Clone and build
go build -o server ./cmd/server

# Copy and configure environment variables
cp .env.example .env
# Edit .env with your settings

# Run with PostgreSQL
DATABASE_URL="postgres://user:pass@localhost/hpc?sslmode=disable" ./server

# Run with demo data (mock backend only)
SEED_DEMO=true SCHEDULER_BACKEND=mock DATABASE_URL="postgres://user:pass@localhost/hpc?sslmode=disable" ./server

Security Note: Never commit .env files containing secrets to version control. The .env file is already in .gitignore.

Running with Docker Compose

# Copy environment file (required for secrets)
cp .env.example .env
# Edit .env with your settings (especially passwords!)

# Start the full stack with Slurm integration
docker-compose --profile slurm up --build --force-recreate

# (Optional) Seed demo data when using mock backend
# Note: demo seeding is ignored when SCHEDULER_BACKEND=slurm
SEED_DEMO=true SCHEDULER_BACKEND=mock docker-compose up --build

# Or start only the Slurm container (for testing scheduler module)
docker-compose --profile slurm up slurm

# View Prometheus at http://localhost:9090
# View Grafana at http://localhost:3000 (credentials from .env)
# View app at http://localhost:8080

Grafana Dashboards

The project ships with a pre-provisioned Grafana dashboard. Below are example views from the example dashboard:

Job Metrics

Node Metrics (overview)

Node Metrics (detail)

Storage Metrics

Go Runtime Metrics

Configuration

Server Configuration

Variable	Default	Description
`PORT`	`8080`	Server port
`HOST`	`0.0.0.0`	Server host

Database Configuration

Variable	Default	Description
`DATABASE_URL`	`postgres://...`	PostgreSQL connection string
`POSTGRES_USER`	`hpc`	PostgreSQL username (Docker)
`POSTGRES_PASSWORD`	-	PostgreSQL password (Docker)
`POSTGRES_DB`	`hpc_jobs`	PostgreSQL database name (Docker)

Metrics & Grafana Configuration

Variable	Default	Description
`METRICS_RETENTION_DAYS`	`7`	Days to retain metrics before cleanup
`GF_SECURITY_ADMIN_USER`	`admin`	Grafana admin username
`GF_SECURITY_ADMIN_PASSWORD`	-	Grafana admin password

Demo Data

Use the SEED_DEMO environment variable to seed demo data on startup (mock backend only).

Development

Development Guide

Running Tests

Slurm Integration

The service uses an event-based architecture for Slurm integration:

Component	Endpoint	Purpose
Prolog script	`POST /v1/events/job-started`	Creates job when Slurm starts it
Epilog script	`POST /v1/events/job-finished`	Updates job with exit code/signal
Collector	`POST /v1/jobs/{id}/metrics`	Records CPU/memory/GPU metrics

State detection uses exit codes and signals:

Exit code 0 = completed
Exit code non-zero = failed
Signal 9 (SIGKILL) or 15 (SIGTERM) = cancelled

For detailed setup instructions, see:

Further development

Run collector on compute nodes:
- Deploy a collector agent on each Slurm compute node that pushes metrics to the central service via API.
- Development has been done on one machine; further work is needed to containerize and deploy across nodes.

Documentation

Architecture - System design and component overview
API Reference - Detailed endpoint documentation
Development Guide - Setup and contribution guidelines

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

License

Apache 2.0 License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
.vscode		.vscode
cmd/server		cmd/server
config		config
docs		docs
internal		internal
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
go.mod		go.mod
go.sum		go.sum
tools.go		tools.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HPC Job Observability Service

The Problem

Features

Architecture

Quick Start

Prerequisites

Running Locally

Running with Docker Compose

Grafana Dashboards

Configuration

Server Configuration

Database Configuration

Metrics & Grafana Configuration

Demo Data

Development

Running Tests

Slurm Integration

Further development

Documentation

Contributing

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HPC Job Observability Service

The Problem

Features

Architecture

Quick Start

Prerequisites

Running Locally

Running with Docker Compose

Grafana Dashboards

Configuration

Server Configuration

Database Configuration

Metrics & Grafana Configuration

Demo Data

Development

Running Tests

Slurm Integration

Further development

Documentation

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages