Keep GPU keeps shared GPUs from being reclaimed while you prep data, debug, or coordinate multi-stage pipelines. It allocates just enough VRAM and issues lightweight CUDA work so schedulers observe an “active” device—without running a full training job.
- 🧾 License: MIT
- 📚 Docs: https://keepgpu.readthedocs.io
On many clusters, idle GPUs are reaped or silently shared after a short grace period. The cost of losing your reservation (or discovering another job has taken your card) can dwarf the cost of a tiny keep-alive loop. KeepGPU is a minimal, auditable guardrail:
- Predictable – Single-purpose controller with explicit resource knobs (VRAM size, interval, utilization backoff).
- Polite – Uses NVML to read utilization and backs off when the GPU is busy.
- Portable – Typer/Rich CLI for humans; Python API for orchestrators and notebooks.
- Observable – Structured logging and optional file logs for auditing what kept the GPU alive.
- Power-aware – Uses intervalled elementwise ops instead of heavy matmul floods to present “busy” utilization while keeping power and thermals lower (see
CudaGPUController._run_mat_batchfor the loop). - NVML-backed – GPU telemetry comes from
nvidia-ml-py(thepynvmlmodule), with optionalrocm-smisupport when you install therocmextra.
pip install keep-gpu
# Hold GPU 0 with 1 GiB VRAM and throttle if utilization exceeds 25%
keep-gpu --gpu-ids 0 --vram 1GiB --busy-threshold 25 --interval 60- CUDA (example: cu121)
pip install --index-url https://download.pytorch.org/whl/cu121 torch pip install keep-gpu
- ROCm (example: rocm6.1)
pip install --index-url https://download.pytorch.org/whl/rocm6.1 torch pip install keep-gpu[rocm]
- CPU-only
pip install torch pip install keep-gpu
Flags that matter:
--vram(1GiB,750MB, or bytes): how much memory to pin.--interval(seconds): sleep between keep-alive bursts.--busy-threshold: skip work when NVML reports higher utilization.--gpu-ids: target a subset; otherwise all visible GPUs are guarded.
from keep_gpu.single_gpu_controller.cuda_gpu_controller import CudaGPUController
with CudaGPUController(rank=0, interval=0.5, vram_to_keep="1GiB", busy_threshold=20):
preprocess_dataset() # GPU is marked busy while you run CPU-heavy code
train_model() # GPU freed after exiting the contextNeed multiple devices?
from keep_gpu.global_gpu_controller.global_gpu_controller import GlobalGPUController
with GlobalGPUController(gpu_ids=[0, 1], vram_to_keep="750MB", interval=90, busy_threshold=30):
run_pipeline_stage()- Battle-tested keep-alive loop built on PyTorch.
- NVML-based utilization monitoring (by way of
nvidia-ml-py) to avoid hogging busy GPUs; optional ROCm SMI support by way ofpip install keep-gpu[rocm]. - CLI + API parity: same controllers power both code paths.
- Continuous docs + CI: mkdocs + mkdocstrings build in CI to keep guidance up to date.
- Install dev extras:
pip install -e ".[dev]"(add.[rocm]if you need ROCm SMI). - Fast CUDA checks:
pytest tests/cuda_controller tests/global_controller tests/utilities/test_platform_manager.py tests/test_cli_thresholds.py - ROCm-only tests carry
@pytest.mark.rocm; run withpytest --run-rocm tests/rocm_controller. - Markers:
rocm(needs ROCm stack) andlarge_memory(opt-in locally).
Contributions are welcome—especially around ROCm support, platform fallbacks, and scheduler-specific recipes. Open an issue or PR if you hit edge cases on your cluster.
This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.
If you find KeepGPU useful in your research or work, please cite it as:
@software{Wangmerlyn_KeepGPU_2025,
author = {Wang, Siyuan and Shi, Yaorui and Liu, Yida and Yin, Yuqi},
title = {KeepGPU: a simple CLI app that keeps your GPUs running},
year = {2025},
publisher = {Zenodo},
doi = {10.5281/zenodo.17129114},
url = {https://github.com/Wangmerlyn/KeepGPU},
note = {GitHub repository},
keywords = {ai, hpc, gpu, cluster, cuda, torch, debug}
}