100 days of learning cuda
Read chapter 2 of PMPP, wrote cuda program to add two 1d vectors
Read chapter 3 of PMPP, wrote cuda programs to convert rgb image to greyscale, blur a 2d image, 2d matrix multiplication
Read chapter 4 of PMPP, learnt about warps, blocks, SMs, how threads are assigned to a SM, occupancy.
Read chapter 5 of PMPP, wrote cuda program to do matrix multiplication using shared mamory to reduce number of global memory access
Read chapter 6 of PMPP, wrote cuda programs to do matrix multiplication (A * B.T) using corner turning and matrix multiplication using thread coarsening
Read chapter 7 of PMPP, wrote cuda program to do a 2d convolution operation reducing global memory access using shread memory and local memory
Read chapter 9 of PMPP, wrote cuda program to compute histogram. Applied shared memory usage, thread coarsening and memory coalescing to optimise the vanilla implementation.
Read chapter 10 of PMPP, wrote cuda program to do reduction (addition) on a 1d array. Applied memory coalescing, thread coarsening, heirarchial reduction etc. to optimise.
Read half of the chapter 11 of PMPP. wrote cuda prgram to do prefix sum via kogg-stone algorithm.
Read rest of the chapter 11 of PMPP. wrote cuda prgram to do prefix sum via brent-kung algorithm plus with its thread coarsened version.
Implemented 1d vector addition using triton. Implemented flash attention (forward pass) using triton.