This repository implements a GPU-accelerated tiny neural network framework using Intel hardware, based on the original CUDA implementation. The implementation uses the Intel DPC++ compiler and relies on the SYCL language with optional ESIMD acceleration.
The network is optimized for loading both activation matrices and weights matrices into the GPU's fast L1 memory and registers. Computation of matrix multiplications is executed using Intel's joint_matrix
extension, a high-level wrapper for systolic array operations.
We benchmarked the thoughput of our network in training and inference on both Intel Data Center GPU Max Series (Ponte Vecchio) and Intel Arc Series and compared our network with PyTorch.
To replicate the performance of the dpcpp code, please set BUILD_BENCHMARK=ON
in tiny-dpcpp-nn/CMakeLists.txt
, build benchmark-all
and run the benchmark from the build/
folder using
I_MPI_DEBUG=3 I_MPI_OFFLOAD=1 I_MPI_OFFLOAD_DOMAIN=[1,2] mpirun -n 2 ./benchmarks/benchmark-all
To replicate the performance of the pytorch code, please run
cd python/ && python benchmark_pytorch.py
Finally, plot the results using
python benchmarks/plot_results.py
We reach up 60x to compared to PyTorch:
![]() |
![]() |
We reach up to 20x compared to PyTorch:
![]() |
![]() |
- High-Performance Computing: Optimized to run efficiently on Intel Data Center GPUs, enabling high-throughput training and inference with up to 60x over PyTorch.
- Compatibility with PyTorch: Provides Python bindings that integrate seamlessly with the PyTorch ecosystem, enabling users to include GPU-accelerated MLPs in PyTorch applications.
- Versatile Neural Network Structures: Supports networks with multiple hidden layers and a variety of neuron configurations to fit different use cases and performance requirements.
- Multi-Resolution Hash Encoding: Includes implementation of Multi-Resolution Hash Encoding, allowing the network to handle high-frequency features effectively.
- Cross-Platform Utilization: Designed to be run on various Intel GPUs, maximizing the portability and usability of the framework across different systems.
For a detailed documentation, please refer to tiny-dpcpp-nn documentation and for a detailed description of our fully-fused algorithm, please refer to our paper
To build the tiny-nn librairy, you can clone the github repo on your machine and put your code in the source folder. After cloning, if you choose to use the pybindings, please recursive pull the pybind11 repositories via
git submodule update --init -- extern/pybind11
Then you can build the library using :
source /opt/intel/oneapi/setvars.sh
mkdir build && cd build/
cmake -D<options>=<ON/OFF> ..
make
where are options that can be toggled on or off. See Build Options
Note: To make the use of the network, you have to disable the implicit scaling on PVC which can be done by uncommenting the portion of the code indicated in the sample when creating the queue.
We provide a pybind wrapper of our tiny-dpcpp-nn implementation for seamless integration into PyTorch. Please refer to tiny-dpcpp-nn pybind documentation for details.
Please recursively pull the pybind11
library:
git submodule update --init -- extern/pybind11
[Optional] - Load correct drivers, i.e., ensure that oneAPI and agama version match the required IPEX version
module load intel-comp-rt/agama-ci-devel/803.29 intel/oneapi/2024.1 cmake/3.26.0
[Optional] - Create a conda environment
conda create -n tiny-dpcpp-nn python=3.10 -y
conda activate tiny-dpcpp-nn
Install the latest ipex via
python -m pip install torch==2.1.0.post2 torchvision==0.16.0.post2 torchaudio==2.1.0.post2 intel-extension-for-pytorch==2.1.30+xpu oneccl_bind_pt==2.1.300+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
Note please ensure that the IPEX version (2.1.30 in this example) is the same as used in IPEX_VERSION
in tiny-dpcpp-nn/CMakeLists.txt
Install the module (if no TARGET_DEVICE
is set, the target_device in setup.py is set to ARC
. Currently PVC
and ARC
is supported):
cd dpcpp_bindings
TARGET_DEVICE=ARC pip install -e .
Finally, to test the sample scripts and tests, install the requirements:
cd python && pip install -r requirements.txt
To test that the installation was successful, you can do the following four tests.
cd test/python/ && pytest
and run the two python sample scripts:
cd samples && python benchmark_pytorch.py
cd samples && python mlp_learning_an_image_pytorch.py
When setting the additional flag BUILD_TORCH_TEST=ON
, the libtorch tests (tnn_api.h
) will be built.
To have all tests, run:
cmake -DTARGET_DEVICE="PVC" -DBUILD_TORCH_TEST="ON" ..
After all tests are build into build/
, you can run cd build/ && make tests
to verfiy that the setup is correct. Please note that we provide tests for both the core dpcpp
implementation and the libtorch
wrapper implementation.
To test whether the pytorch bindings were installed correctly, please run
pytest python/tests/test_compare_torch_dpcpp.py
to ensure that forward and backward passes work properly.python/tests/test_training_classification.py
andpython/tests/test_training_regression.py
to see if integration into PyTorch's optimiser works, and
- The repository was developed and is maintained by Christoph Bauinger ([email protected]) and Kai Yuan ([email protected]).
- The original implementation of SwiftNet was conducted by Darius Dabert (DariusDabert) and Adrien Tousnakhoff (Tousnaaa)
If you found this work useful, please consider citing this work as:
@software{tiny-dpcpp-nn,
author = {Bauinger, Christoph and Yuan, Kai},
license = {BSD-3-Clause},
month = {3},
title = {{tiny-dpcpp-nn}},
url = {https://github.com/intel/tiny-dpcpp-nn/},
version = {0.1},
year = {2024}
}