Skip to content
View ShlokVFX's full-sized avatar

Organizations

@bits-bsc-cs

Block or report ShlokVFX

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Pinned Loading

  1. Mini-Attention Mini-Attention Public

    FP16 Flash Attention 2 from scratch in CUDA C++ acheving 96% of CuDnn performance on SM120 (RTX 5090)

    Cuda 5

  2. 100-days-cuda 100-days-cuda Public

    This repository documents my 100-day journey of learning and writing CUDA kernels.

    Jupyter Notebook 28 1

  3. xDiT xDiT Public

    Forked from xdit-project/xDiT

    xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism

    Python

  4. HazyResearch/ThunderKittens HazyResearch/ThunderKittens Public

    Tile primitives for speedy kernels

    Cuda 3.3k 273

  5. SageAttention SageAttention Public

    Forked from thu-ml/SageAttention

    [ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

    Cuda