ShlokVFX

Follow

Shlok_Limbhare ShlokVFX

Follow

GPU dev

37 followers · 644 following

Achievements

Achievements

Organizations

Pinned Loading

Mini-Attention Mini-Attention Public

FP16 Flash Attention 2 from scratch in CUDA C++ acheving 96% of CuDnn performance on SM120 (RTX 5090)

Cuda 5
100-days-cuda 100-days-cuda Public

This repository documents my 100-day journey of learning and writing CUDA kernels.

Jupyter Notebook 28 1
xDiT xDiT Public

Forked from xdit-project/xDiT

xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism

Python
HazyResearch/ThunderKittens HazyResearch/ThunderKittens Public

Tile primitives for speedy kernels

Cuda 3.3k 273
SageAttention SageAttention Public

Forked from thu-ml/SageAttention

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

Cuda