AccelOpt is a self-improving large language model (LLM) agentic system that autonomously optimizes kernels for emerging AI acclerators, eliminating the need for expert-provided hardware-specific optimization knowledge. AccelOpt explores the kernel optimization space through iterative generation, informed by an optimization memory that curates experiences and insights from previously encountered slow-fast kernel pairs.
🚧 This repository is still under construction.
EC2 Instance: trn1.32xlarge
AMI: Deep Learning AMI Neuron (Ubuntu 22.04)
source /opt/aws_neuronx_venv_pytorch_2_7/bin/activate # Check the PyTorch version of your AMI
pip install logfire
pip install openai-agents
git clone [email protected]:zhang677/AccelOpt.git
cd AccelOpt
python setup.py install
experiments/full_complete_local shows how to run AccelOpt on NKIBench with a local served gpt-oss-120b.
EC2 Instance: p5.48xlarge
AMI: Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.8 (Ubuntu 24.04)
# Create a PyTorch environment first
pip install logfire
pip install openai-agents
git clone https://github.com/zhang677/flashinfer-bench.git # DPS is not consistent with flashinfer-trace now
cd flashinfer-bench
pip install -v -e .
cd ..
git clone [email protected]:zhang677/AccelOpt.git
cd AccelOpt
python setup.py install
experiments/flb_full_complete_local shows how to run AccelOpt on FlashInfer-Bench with a local served gpt-oss-120b.
NKIBench is a new benchmark suite of AWS Trainium accelerator kernels with varying complexity extracted from real-world LLM workloads to evaluate the effectiveness of AccelOpt.
All the kernels are under /NKIBench. Kernels are
grouped by operator name and configuration in a structured
storage format as shown in the /NKIBench/summary.json
AccelOpt provides a NKIKernel class that is pluggable to any AI optimizers. /tests shows how to use the profiling API for single and a group of NKI kernels.
from accelopt.kernel_wrapper import NKIKernel
nki_kernel = NKIKernel(program_path, base_numpy_path)
result = nki_kernel.profile(save_fields)NKIBench estimates the best achievable performance offered by the Trainium hardware, which offers additional insights on how effective AccelOpt has been in exploring the entire optimization landscape. The best achievable performance is calculated in experiments/full_complete_local/calculate_percentage_of_peak.py
This work is a follow-up to Adaptive Self-improvement LLM Agentic System for ML Library Development (ICML 2025) [paper] [blog] [code]. If you find this project useful, please cite:
@inproceedings{zhang2025adaptive,
title={Adaptive Self-improvement LLM Agentic System for ML Library Development},
author={Zhang, Genghan and Liang, Weixin and Hsu, Olivia and Olukotun, Kunle},
booktitle={Forty-second International Conference on Machine Learning}
}
@article{zhang2025accelopt,
title={AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization},
author={Zhang, Genghan and Zhu, Shaowei and Wei, Anjiang and Song, Zhenyu and Nie, Allen and Jia, Zhen and Vijaykumar, Nandita and Wang, Yida and Olukotun, Kunle},
journal={arXiv preprint arXiv:2511.15915},
year={2025}
}


