[pull] main from NVIDIA:main #10

pull · 2024-05-22T11:22:55Z

See Commits and Changes for more details.

Can you help keep this open source service alive? 💖 Please sponsor : )

replaced deprecated pkg_resources with packaging Signed-off-by: Alp Dener <[email protected]>

* added alignment requirements for CuBLAS heuristics Signed-off-by: Phuong Nguyen <[email protected]> * minor rewords Signed-off-by: Phuong Nguyen <[email protected]> * added unit test for gemm with unaligned inputs Signed-off-by: Phuong Nguyen <[email protected]> * added pytest skip if fp8 is not available Signed-off-by: Phuong Nguyen <[email protected]> * changed offset so that it has alignment with 128 Signed-off-by: Phuong Nguyen <[email protected]> --------- Signed-off-by: Phuong Nguyen <[email protected]>

* Fixed the shape mismatching issue in MLP. Signed-off-by: Ming Huang <[email protected]> * Add a corresponding test Signed-off-by: Ming Huang <[email protected]> --------- Signed-off-by: Ming Huang <[email protected]> Co-authored-by: Phuong Nguyen <[email protected]>

TE checkpoint now preserves the torch autocast context from the forward pass during the recompute phase Signed-off-by: Alp Dener <[email protected]>

* Fixed Llama tutorial. Changed batch size and added fused=True. Signed-off-by: Pawel Gadzinski <[email protected]> Signed-off-by: root <[email protected]> * Tutorial updated but not complete yet. Signed-off-by: Pawel Gadzinski <[email protected]> Signed-off-by: root <[email protected]> * Tutorial notebook reseted - removed fuse=true Signed-off-by: Pawel Gadzinski <[email protected]> Signed-off-by: root <[email protected]> * Removed fused=true Signed-off-by: Pawel Gadzinski <[email protected]> Signed-off-by: root <[email protected]> * Batch size back to 8 Signed-off-by: Pawel Gadzinski <[email protected]> Signed-off-by: root <[email protected]> * Typo and commented out line Signed-off-by: Pawel Gadzinski <[email protected]> Signed-off-by: root <[email protected]> * fixed whitespace Signed-off-by: root <[email protected]> * fixed whitespace Signed-off-by: root <[email protected]> * Added comment to attention line. Fixed potential bug with loading weights - now loading works correctly, confirmed by the generation code. Signed-off-by: root <[email protected]> * Comments Signed-off-by: root <[email protected]> * Models cast added again Signed-off-by: root <[email protected]> * Weight download info Signed-off-by: Pawel Gadzinski <[email protected]> * Moved parameter gate_proj_size to config Signed-off-by: Pawel Gadzinski <[email protected]> * gate_proj_size removed and put immediate_size instead Signed-off-by: Pawel Gadzinski <[email protected]> * Llama 3 added to tutorial Signed-off-by: Pawel Gadzinski <[email protected]> * Typos fix Signed-off-by: Pawel Gadzinski <[email protected]> * Typos fix Signed-off-by: Pawel Gadzinski <[email protected]> * Fixed model loading Signed-off-by: Pawel Gadzinski <[email protected]> * Loading fix Signed-off-by: Pawel Gadzinski <[email protected]> * Different dim for attention Signed-off-by: Pawel Gadzinski <[email protected]> * Reversed other commit Signed-off-by: Pawel Gadzinski <[email protected]> * Changed name to kv_channels Signed-off-by: Pawel Gadzinski <[email protected]> * Fixed typo Signed-off-by: Pawel Gadzinski <[email protected]> * Back to kv_channels in transformer layer Signed-off-by: Pawel Gadzinski <[email protected]> * Back to kv_channels in transformer layer Signed-off-by: Pawel Gadzinski <[email protected]> * Small bug fix Signed-off-by: Pawel Gadzinski <[email protected]> * Small bug fix Signed-off-by: Pawel Gadzinski <[email protected]> * Test fix Signed-off-by: Pawel Gadzinski <[email protected]> * changed file modes Signed-off-by: Pawel Gadzinski <[email protected]> * lint fix and resolved conflict Signed-off-by: Pawel Gadzinski <[email protected]> * lint fix and resolved conflict Signed-off-by: Pawel Gadzinski <[email protected]> * Lint fix, hopefully last Signed-off-by: Pawel Gadzinski <[email protected]> --------- Signed-off-by: Pawel Gadzinski <[email protected]> Signed-off-by: root <[email protected]> Signed-off-by: root <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Pawel Gadzinski <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Sudhakar Singh <[email protected]> Co-authored-by: Kirthi Shankar Sivamani <[email protected]>

allow bias support for sm80/86/89 for cuDNN 9+ Signed-off-by: Charlene Yang <[email protected]>

Signed-off-by: Tim Moon <[email protected]>

* Use correct FP8 group in multi-GPU docs FP8 process group should be tensor-parallel group Signed-off-by: Tim Moon <[email protected]> * Synchronize FP8 scales over world group in multi-GPU docs Signed-off-by: Tim Moon <[email protected]> --------- Signed-off-by: Tim Moon <[email protected]>

Make sure RoPE frequencies are in FP32 Signed-off-by: Tim Moon <[email protected]>

* Change the documentation footer Signed-off-by: Przemek Tredak <[email protected]> * Update docs toolchain versions Signed-off-by: Przemek Tredak <[email protected]> --------- Signed-off-by: Przemek Tredak <[email protected]>

* Initial refactor of FP8 workspaces in Linear module Signed-off-by: Tim Moon <[email protected]> * Remove extra kernel launch Signed-off-by: Tim Moon <[email protected]> * Minor perf optimizations Tensor base class functions in Float8Tensor have significant overhead. Signed-off-by: Tim Moon <[email protected]> * Debug FP8 recipe test Signed-off-by: Tim Moon <[email protected]> * Refactor FP8 workspaces in LayerNormLinear and LayerNormMLP Signed-off-by: Tim Moon <[email protected]> * Document FP8 workspace function Signed-off-by: Tim Moon <[email protected]> * Revert changes to FP8 recipe tests Signed-off-by: Tim Moon <[email protected]> * Add support for lazy FP8 transpose caching Previous caching behavior (always fill cache) incorrectly filled cache during CUDA graph warmup steps. Signed-off-by: Tim Moon <[email protected]> * Fix Pylint warnings Signed-off-by: Tim Moon <[email protected]> * Debug ONNX export ONNX FP8 cast ops assumed that FP8 scales were created during model export (i.e. not initialized during training). Signed-off-by: Tim Moon <[email protected]> * Debug fused attention tests Signed-off-by: Tim Moon <[email protected]> * Make sure Float8Tensor.transpose_2d is backward compatible Signed-off-by: Tim Moon <[email protected]> * Revert changes to ONNX export operations Work around ONNX test failures by filling FP8 scale tensors instead of copying into them. Signed-off-by: Tim Moon <[email protected]> * Debug scale factor update in Float8Tensor transpose_2d Signed-off-by: Tim Moon <[email protected]> --------- Signed-off-by: Tim Moon <[email protected]>

…TE (#867) * add multi-tensor kernels Signed-off-by: Xin Yao <[email protected]> * add FusedAdam Signed-off-by: Xin Yao <[email protected]> * add test to qa Signed-off-by: Xin Yao <[email protected]> * add FusedSGD Signed-off-by: Xin Yao <[email protected]> * fix lint Signed-off-by: Xin Yao <[email protected]> --------- Signed-off-by: Xin Yao <[email protected]> Signed-off-by: Tim Moon <[email protected]> Co-authored-by: Tim Moon <[email protected]>

* add THD support Signed-off-by: Charlene Yang <[email protected]> * add seq_offsets_o and use new offset calculation Signed-off-by: Charlene Yang <[email protected]> * addition to previous commit; fix unit test Signed-off-by: Charlene Yang <[email protected]> * add None for offset_o gradient Signed-off-by: Charlene Yang <[email protected]> * fix lint Signed-off-by: Charlene Yang <[email protected]> * WIP: test padding between sequences Signed-off-by: Charlene Yang <[email protected]> * WIP: fix tests for padding between sequences Signed-off-by: Charlene Yang <[email protected]> * fix tests for sbhd/bshd layouts; clean up Signed-off-by: Charlene Yang <[email protected]> * update cudnn-frontend and add tests for max_seqlen_q=1 and d=256 for inference Signed-off-by: Charlene Yang <[email protected]> * test sbhd/bshd layouts for sq1, d256 inference case Signed-off-by: Charlene Yang <[email protected]> * fix lint Signed-off-by: Charlene Yang <[email protected]> * replace wording from accumulative to cumulative Signed-off-by: Charlene Yang <[email protected]> * add offset tensors to custom fp8 mha tests Signed-off-by: Charlene Yang <[email protected]> * add version control for cuDNN Signed-off-by: Charlene Yang <[email protected]> * add sm>=90 constraint for thd support Signed-off-by: Charlene Yang <[email protected]> * fix cuDNN support for sq=1, d=256 Signed-off-by: Charlene Yang <[email protected]> * fix lint and minor tweak for fp8 tests Signed-off-by: Charlene Yang <[email protected]> * modify cudnn version and restrict MQA/GQA support for THD Signed-off-by: Charlene Yang <[email protected]> * add notes for seq offset tensors Signed-off-by: Charlene Yang <[email protected]> * add dummy tensor to pass jax build Signed-off-by: Charlene Yang <[email protected]> * add dummy tensor to pass paddle build Signed-off-by: Charlene Yang <[email protected]> * fix Jax CI Signed-off-by: Charlene Yang <[email protected]> --------- Signed-off-by: Charlene Yang <[email protected]> Signed-off-by: cyanguwa <[email protected]>

shamisp and others added 3 commits May 20, 2024 20:57

[UB] Fixing consistency of error messages. (#840)

f0311a1

[PyTorch] Replaced deprecated pkg_resources with packaging (#860)

d705f7f

replaced deprecated pkg_resources with packaging Signed-off-by: Alp Dener <[email protected]>

pull bot added the ⤵️ pull label May 22, 2024

mingxu1067 and others added 11 commits May 22, 2024 14:33

[PyTorch] Support torch.amp.autocast in TE checkpoint (#791)

7c4887b

TE checkpoint now preserves the torch autocast context from the forward pass during the recompute phase Signed-off-by: Alp Dener <[email protected]>

[C] Allow bias support for sm80/86/89 for cuDNN 9+ (#863)

223050a

allow bias support for sm80/86/89 for cuDNN 9+ Signed-off-by: Charlene Yang <[email protected]>

Add user to TE CI (#874)

9bd938b

Signed-off-by: Tim Moon <[email protected]>

[PyTorch] Make sure RoPE frequencies are in FP32 (#875)

4473d81

Make sure RoPE frequencies are in FP32 Signed-off-by: Tim Moon <[email protected]>

New NVIDIA footer in documentation (#876)

4e30bc4

* Change the documentation footer Signed-off-by: Przemek Tredak <[email protected]> * Update docs toolchain versions Signed-off-by: Przemek Tredak <[email protected]> --------- Signed-off-by: Przemek Tredak <[email protected]>

phu0ngng merged commit e960607 into phu0ngng:main May 31, 2024

phu0ngng had a problem deploying to github-pages May 31, 2024 20:55 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] main from NVIDIA:main #10

[pull] main from NVIDIA:main #10

pull bot commented May 22, 2024 •

edited

Loading

[pull] main from NVIDIA:main #10

[pull] main from NVIDIA:main #10

Conversation

pull bot commented May 22, 2024 • edited Loading

pull bot commented May 22, 2024 •

edited

Loading