What's Changed
- Add grpo_compute_loss_logits.py to support different GRPO_LOSS function parameters by @qescccczmr in #57
- Add unit test scatter & gather. by @pdx1989 in #58
- Fix grpo unit_test. by @pdx1989 in #59
- add bwd grpo loss logits by @qescccczmr in #60
- add matmul for dicp triton by @zhaochaoxing in #63
- lightning_attn by @SHshenhao in #61
- support npu by @zhaochaoxing in #64
- Szl/grouped gemm by @sukoncon in #65
- add permute unpermute triton op by @qescccczmr in #66
- add kernels by @zhaochaoxing in #67
- add ascend unittest by @hellozmz in #68
- refacte grouped_gemm and matmul by @zhaochaoxing in #69
- add flash_attention_v2__fwd_kernel_hdim96 by @wxk-cmd in #71
- add k_grouped_matmul by @zhaochaoxing in #70
- add docs by @zhaochaoxing in #73
- register matmul_v1 and matmul_v2 by @zhaochaoxing in #74
- add layernorm by @zhaochaoxing in #75
- add lighting_indexer sparse_mla and sink_attention by @zhaochaoxing in #78
- fix for npu by @zhaochaoxing in #79
- Add 4d_matrix_multipily op. by @pdx1989 in #82
- fix BLOCK_SIZE_T by @rz2778 in #80
- upgrade deep_gemm 2.1.1+c9f8b34 by @zhaochaoxing in #83
- Add fusion gemm & matrix. by @pdx1989 in #84
- Sh/update deepep support by @SHshenhao in #86
New Contributors
- @qescccczmr made their first contribution in #57
- @sukoncon made their first contribution in #65
- @wxk-cmd made their first contribution in #71
- @rz2778 made their first contribution in #80
Full Changelog: v0.0.5...0.0.6