- mumbai
-
03:20
(UTC +05:30) - in/shlok-l-50180120b
- @shlok_fx
- https://leetcode.com/u/Shlok_Fx/
Pinned Loading
-
Mini-Attention
Mini-Attention PublicFP16 Flash Attention 2 from scratch in CUDA C++ acheving 96% of CuDnn performance on SM120 (RTX 5090)
Cuda 5
-
100-days-cuda
100-days-cuda PublicThis repository documents my 100-day journey of learning and writing CUDA kernels.
-
xDiT
xDiT PublicForked from xdit-project/xDiT
xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism
Python
-
HazyResearch/ThunderKittens
HazyResearch/ThunderKittens PublicTile primitives for speedy kernels
-
SageAttention
SageAttention PublicForked from thu-ml/SageAttention
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
Cuda
If the problem persists, check the GitHub status page or contact support.




