Compact Automated Reproducible Assessment of Machine Learning (CARAML) is a benchmark framework designed to assess mainstream Computer Vision (CV) and Natural Language Processing (NLP) workloads on novel accelerators. It has been developed and extensively tested on systems at the Jülich Supercomputing Centre (JSC).
CARAML provides a compact and automated benchmarking tool that leverages JUBE, a scripting-based framework for creating benchmark sets, running them across different systems, and evaluating results. Additionally, it includes power/energy measurements through the jpwr tool.
CARAML has been tested on the JURECA-DC EVALUATION PLATFORM, JURECA-DC, JEDI and WEST-AI Nodes. These include the accelerators:
- AMD MI200 node with 4
$\times$ MI250 GPUs (tag: MI250
) - Graphcore IPU-POD4 M2000 with 4
$\times$ GC200 IPUs (tag: GC200
) - NVIDIA Ampere node (SXM) with 4
$\times$ A100 GPUs (tag: A100
) - NVIDIA Hopper node (PCIe) with 4
$\times$ H100 GPUs (tag: H100
) - NVIDIA Hopper node (NVLink) with 4
$\times$ H100 GPUs (tag: WAIH100
) - NVIDIA Grace-Hopper chip with 1
$\times$ GH200 GPU (tag: GH200
) - NVIDIA Grace-Hopper Node with 4
$\times$ GH200 GPUs (tag: JEDI
)
CARAML currently provides two main benchmarks implemented in Python:
The image_classification model training benchmark is implemented in PyTorch. It is designed to test image classification models such as ResNet50 on various accelerators. For IPU's graphcore/examples is used. Performance is measured in images/sec
and energy is measured in Wh
.
Note: Support for the Image Classification benchmark in TensorFlow has been discontinued.
The LLM-training benchmark is implemented in PyTorch with:
- Megatron-LM with commit:
f7727433293427bef04858f67b2889fe9b177d88
and patch applied for NVIDIA - Megatron-LM-ROCm with commit:
21045b59127cd2d5509f1ca27d81fae7b485bd22
and patch applied for AMD - graphcore/examples (forked version) for Graphcore
Performance is measured in tokens/sec
and energy is recorded in Wh
.
To run the benchmarks, you must install JUBE. Follow the JUBE Installation Documentation for setup instructions. The benchmarks are deployed using Apptainer containers and executed using SLURM on the tested accelerators.
-
Image Classification: Synthetic data is generated on the host machine for benchmarking. The IPU tag
synthetic
additionally allows for the generation of synthetic data directly on the IPU. -
LLM Training: A subset of the OSCAR dataset (790 samples, ~10 MB) is pre-processed using GPT-2 tokenizers. This data is provided in the
llm_data
directory.
Clone the repository and navigate into it:
git clone https://github.com/FZJ-JSC/CARAML.git
cd CARAML
-
Modify
system
,model
parameters in JUBE config -
To pull the required container use
container
tag as:jube run image_classification/image_classification_torch_benchmark.xml --tag container H100
For JSC systems
H100
can be replaced withGH200
,MI250
andGC200
as required. -
To run the benchmark with defined configurations do
jube run image_classification/image_classification_torch_benchmark.xml --tag H100
H100
can be replaced withA100
,WAIH100
,GH200
,JEDI
,MI250
andGC200
as required. -
After the benchmark has been executed, use
jube continue
to postprocess resultsjube continue image_classification/image_classification_torch_benchmark._run -i last
-
To generate result do:
jube result image_classification/image_classification_torch_benchmark._run -i last
-
Set the required
system
andmodel
parameters in llm_benchmark_nvidia_amd.yaml for NVIDIA and AMD devices and in llm_benchmark_ipu.yaml for Graphcore -
To run the benchmark with defined configurations for
800M
GPT model with OSCAR data do:jube run llm_training/llm_benchmark_nvidia_amd.yaml --tag 800M A100
A100
can be replaced withH100
,WAIH100
,GH200
,JEDI
andMI250
for the respective systems and800M
can be replaced with13B
and175B
for systems with more node resources likeJEDI
,H100
,A100
andMI250
. -
To run the benchmark with defined configurations for
117M
GPT model on Graphcore with synthetic data dojube run llm_training/llm_benchmark_ipu.yaml --tag 117M synthetic
If tag
synthetic
is not given, the benchmark will use OSCAR data -
After the benchmark has been executed, use
jube continue
to postprocess resultsjube continue llm_training/llm_benchmark_nvidia_amd_run -i last
-
To generate result do:
jube result llm_training/llm_benchmark_nvidia_amd_run -i last
In order to use PyTorch torch run
API on JSC systems fixed_torch_run.py fix is required. The fix solves the issue defined here.
Additionally the hostname
is appended with an i
for allowing communication over InfiniBand as described here.
@INPROCEEDINGS{10820809,
author={John, Chelsea Maria and Nassyr, Stepan and Penke, Carolin and Herten, Andreas},
booktitle={SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis},
title={Performance and Power: Systematic Evaluation of AI Workloads on Accelerators with CARAML},
year={2024},
pages={1164-1176},
doi={10.1109/SCW63240.2024.00158}
}