A microservice for tracking and monitoring HPC (High Performance Computing) job resource utilization with Prometheus metrics export.
This service integrates with Slurm via prolog/epilog scripts to provide real-time job tracking and resource monitoring. It collects CPU, memory, and GPU metrics from running jobs using Linux cgroups v2, stores historical data in PostgreSQL, and exports metrics in Prometheus format for Grafana dashboards.
In many HPC clusters, it is difficult for both administrators and users to understand how resources are actually utilized during a job’s execution. Common questions such as “Is my job using the GPU?” or “Did it fail due to running out of memory?” are surprisingly hard to answer without detailed, job-scoped time-series metrics.
Traditional monitoring solutions typically aggregate metrics at the host or node level, which obscures the behavior of individual jobs and makes debugging performance issues or failures time-consuming and error-prone.
- Slurm Integration: Event-based integration via prolog/epilog scripts
- Real-time Job Tracking: Jobs created instantly when Slurm starts them
- Accurate State Detection: Exit codes and signals captured for correct state mapping
- Resource Metrics: CPU, memory, and GPU usage collected via Linux cgroups v2
- Prometheus Export: Metrics in Prometheus format for Grafana dashboards
- PostgreSQL Storage: Persistent storage with full audit trail
- GPU Support: NVIDIA and AMD GPU metrics via vendor tools
- Demo Mode: Mock backend with sample data for testing
┌─────────────────┐ ┌─────────────────┐
│ HPC Cluster │ Lifecycle Events │ Prometheus │
│ (Slurm Jobs) │ ─────────────────────┐ │ + Grafana │
└─────────────────┘ │ └────────┬────────┘
│ │ │ scrape
│ prolog: job-started ▼ ▼
│ epilog: job-finished ┌─────────────────────────┐
│ collector: metrics │ Observability Service │
└────────────────────────▶│ (Go + REST API) │
└───────────┬─────────────┘
│
▼
┌─────────────────────────┐
│ PostgreSQL │
│ (jobs, metrics, audit) │
└─────────────────────────┘
- Go 1.22+
- Docker and Docker Compose (optional)
# Clone and build
go build -o server ./cmd/server
# Copy and configure environment variables
cp .env.example .env
# Edit .env with your settings
# Run with PostgreSQL
DATABASE_URL="postgres://user:pass@localhost/hpc?sslmode=disable" ./server
# Run with demo data (mock backend only)
SEED_DEMO=true SCHEDULER_BACKEND=mock DATABASE_URL="postgres://user:pass@localhost/hpc?sslmode=disable" ./serverSecurity Note: Never commit .env files containing secrets to version control. The .env file is already in .gitignore.
# Copy environment file (required for secrets)
cp .env.example .env
# Edit .env with your settings (especially passwords!)
# Start the full stack with Slurm integration
docker-compose --profile slurm up --build --force-recreate
# (Optional) Seed demo data when using mock backend
# Note: demo seeding is ignored when SCHEDULER_BACKEND=slurm
SEED_DEMO=true SCHEDULER_BACKEND=mock docker-compose up --build
# Or start only the Slurm container (for testing scheduler module)
docker-compose --profile slurm up slurm
# View Prometheus at http://localhost:9090
# View Grafana at http://localhost:3000 (credentials from .env)
# View app at http://localhost:8080The project ships with a pre-provisioned Grafana dashboard. Below are example views from the example dashboard:
Job Metrics
Node Metrics (overview)
Node Metrics (detail)
| Variable | Default | Description |
|---|---|---|
PORT |
8080 |
Server port |
HOST |
0.0.0.0 |
Server host |
| Variable | Default | Description |
|---|---|---|
DATABASE_URL |
postgres://... |
PostgreSQL connection string |
POSTGRES_USER |
hpc |
PostgreSQL username (Docker) |
POSTGRES_PASSWORD |
- | PostgreSQL password (Docker) |
POSTGRES_DB |
hpc_jobs |
PostgreSQL database name (Docker) |
| Variable | Default | Description |
|---|---|---|
METRICS_RETENTION_DAYS |
7 |
Days to retain metrics before cleanup |
GF_SECURITY_ADMIN_USER |
admin |
Grafana admin username |
GF_SECURITY_ADMIN_PASSWORD |
- | Grafana admin password |
Use the SEED_DEMO environment variable to seed demo data on startup (mock backend only).
The service uses an event-based architecture for Slurm integration:
| Component | Endpoint | Purpose |
|---|---|---|
| Prolog script | POST /v1/events/job-started |
Creates job when Slurm starts it |
| Epilog script | POST /v1/events/job-finished |
Updates job with exit code/signal |
| Collector | POST /v1/jobs/{id}/metrics |
Records CPU/memory/GPU metrics |
State detection uses exit codes and signals:
- Exit code 0 = completed
- Exit code non-zero = failed
- Signal 9 (SIGKILL) or 15 (SIGTERM) = cancelled
For detailed setup instructions, see:
- Run collector on compute nodes:
- Deploy a collector agent on each Slurm compute node that pushes metrics to the central service via API.
- Development has been done on one machine; further work is needed to containerize and deploy across nodes.
- Architecture - System design and component overview
- API Reference - Detailed endpoint documentation
- Development Guide - Setup and contribution guidelines
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
Apache 2.0 License. See LICENSE for details.




