forked from NVIDIA/TransformerEngine
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pull] main from NVIDIA:main #10
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
replaced deprecated pkg_resources with packaging Signed-off-by: Alp Dener <[email protected]>
* added alignment requirements for CuBLAS heuristics Signed-off-by: Phuong Nguyen <[email protected]> * minor rewords Signed-off-by: Phuong Nguyen <[email protected]> * added unit test for gemm with unaligned inputs Signed-off-by: Phuong Nguyen <[email protected]> * added pytest skip if fp8 is not available Signed-off-by: Phuong Nguyen <[email protected]> * changed offset so that it has alignment with 128 Signed-off-by: Phuong Nguyen <[email protected]> --------- Signed-off-by: Phuong Nguyen <[email protected]>
* Fixed the shape mismatching issue in MLP. Signed-off-by: Ming Huang <[email protected]> * Add a corresponding test Signed-off-by: Ming Huang <[email protected]> --------- Signed-off-by: Ming Huang <[email protected]> Co-authored-by: Phuong Nguyen <[email protected]>
TE checkpoint now preserves the torch autocast context from the forward pass during the recompute phase Signed-off-by: Alp Dener <[email protected]>
* Fixed Llama tutorial. Changed batch size and added fused=True. Signed-off-by: Pawel Gadzinski <[email protected]> Signed-off-by: root <[email protected]> * Tutorial updated but not complete yet. Signed-off-by: Pawel Gadzinski <[email protected]> Signed-off-by: root <[email protected]> * Tutorial notebook reseted - removed fuse=true Signed-off-by: Pawel Gadzinski <[email protected]> Signed-off-by: root <[email protected]> * Removed fused=true Signed-off-by: Pawel Gadzinski <[email protected]> Signed-off-by: root <[email protected]> * Batch size back to 8 Signed-off-by: Pawel Gadzinski <[email protected]> Signed-off-by: root <[email protected]> * Typo and commented out line Signed-off-by: Pawel Gadzinski <[email protected]> Signed-off-by: root <[email protected]> * fixed whitespace Signed-off-by: root <[email protected]> * fixed whitespace Signed-off-by: root <[email protected]> * Added comment to attention line. Fixed potential bug with loading weights - now loading works correctly, confirmed by the generation code. Signed-off-by: root <[email protected]> * Comments Signed-off-by: root <[email protected]> * Models cast added again Signed-off-by: root <[email protected]> * Weight download info Signed-off-by: Pawel Gadzinski <[email protected]> * Moved parameter gate_proj_size to config Signed-off-by: Pawel Gadzinski <[email protected]> * gate_proj_size removed and put immediate_size instead Signed-off-by: Pawel Gadzinski <[email protected]> * Llama 3 added to tutorial Signed-off-by: Pawel Gadzinski <[email protected]> * Typos fix Signed-off-by: Pawel Gadzinski <[email protected]> * Typos fix Signed-off-by: Pawel Gadzinski <[email protected]> * Fixed model loading Signed-off-by: Pawel Gadzinski <[email protected]> * Loading fix Signed-off-by: Pawel Gadzinski <[email protected]> * Different dim for attention Signed-off-by: Pawel Gadzinski <[email protected]> * Reversed other commit Signed-off-by: Pawel Gadzinski <[email protected]> * Changed name to kv_channels Signed-off-by: Pawel Gadzinski <[email protected]> * Fixed typo Signed-off-by: Pawel Gadzinski <[email protected]> * Back to kv_channels in transformer layer Signed-off-by: Pawel Gadzinski <[email protected]> * Back to kv_channels in transformer layer Signed-off-by: Pawel Gadzinski <[email protected]> * Small bug fix Signed-off-by: Pawel Gadzinski <[email protected]> * Small bug fix Signed-off-by: Pawel Gadzinski <[email protected]> * Test fix Signed-off-by: Pawel Gadzinski <[email protected]> * changed file modes Signed-off-by: Pawel Gadzinski <[email protected]> * lint fix and resolved conflict Signed-off-by: Pawel Gadzinski <[email protected]> * lint fix and resolved conflict Signed-off-by: Pawel Gadzinski <[email protected]> * Lint fix, hopefully last Signed-off-by: Pawel Gadzinski <[email protected]> --------- Signed-off-by: Pawel Gadzinski <[email protected]> Signed-off-by: root <[email protected]> Signed-off-by: root <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Pawel Gadzinski <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Sudhakar Singh <[email protected]> Co-authored-by: Kirthi Shankar Sivamani <[email protected]>
allow bias support for sm80/86/89 for cuDNN 9+ Signed-off-by: Charlene Yang <[email protected]>
Signed-off-by: Tim Moon <[email protected]>
* Use correct FP8 group in multi-GPU docs FP8 process group should be tensor-parallel group Signed-off-by: Tim Moon <[email protected]> * Synchronize FP8 scales over world group in multi-GPU docs Signed-off-by: Tim Moon <[email protected]> --------- Signed-off-by: Tim Moon <[email protected]>
Make sure RoPE frequencies are in FP32 Signed-off-by: Tim Moon <[email protected]>
* Change the documentation footer Signed-off-by: Przemek Tredak <[email protected]> * Update docs toolchain versions Signed-off-by: Przemek Tredak <[email protected]> --------- Signed-off-by: Przemek Tredak <[email protected]>
* Initial refactor of FP8 workspaces in Linear module Signed-off-by: Tim Moon <[email protected]> * Remove extra kernel launch Signed-off-by: Tim Moon <[email protected]> * Minor perf optimizations Tensor base class functions in Float8Tensor have significant overhead. Signed-off-by: Tim Moon <[email protected]> * Debug FP8 recipe test Signed-off-by: Tim Moon <[email protected]> * Refactor FP8 workspaces in LayerNormLinear and LayerNormMLP Signed-off-by: Tim Moon <[email protected]> * Document FP8 workspace function Signed-off-by: Tim Moon <[email protected]> * Revert changes to FP8 recipe tests Signed-off-by: Tim Moon <[email protected]> * Add support for lazy FP8 transpose caching Previous caching behavior (always fill cache) incorrectly filled cache during CUDA graph warmup steps. Signed-off-by: Tim Moon <[email protected]> * Fix Pylint warnings Signed-off-by: Tim Moon <[email protected]> * Debug ONNX export ONNX FP8 cast ops assumed that FP8 scales were created during model export (i.e. not initialized during training). Signed-off-by: Tim Moon <[email protected]> * Debug fused attention tests Signed-off-by: Tim Moon <[email protected]> * Make sure Float8Tensor.transpose_2d is backward compatible Signed-off-by: Tim Moon <[email protected]> * Revert changes to ONNX export operations Work around ONNX test failures by filling FP8 scale tensors instead of copying into them. Signed-off-by: Tim Moon <[email protected]> * Debug scale factor update in Float8Tensor transpose_2d Signed-off-by: Tim Moon <[email protected]> --------- Signed-off-by: Tim Moon <[email protected]>
…TE (#867) * add multi-tensor kernels Signed-off-by: Xin Yao <[email protected]> * add FusedAdam Signed-off-by: Xin Yao <[email protected]> * add test to qa Signed-off-by: Xin Yao <[email protected]> * add FusedSGD Signed-off-by: Xin Yao <[email protected]> * fix lint Signed-off-by: Xin Yao <[email protected]> --------- Signed-off-by: Xin Yao <[email protected]> Signed-off-by: Tim Moon <[email protected]> Co-authored-by: Tim Moon <[email protected]>
* add THD support Signed-off-by: Charlene Yang <[email protected]> * add seq_offsets_o and use new offset calculation Signed-off-by: Charlene Yang <[email protected]> * addition to previous commit; fix unit test Signed-off-by: Charlene Yang <[email protected]> * add None for offset_o gradient Signed-off-by: Charlene Yang <[email protected]> * fix lint Signed-off-by: Charlene Yang <[email protected]> * WIP: test padding between sequences Signed-off-by: Charlene Yang <[email protected]> * WIP: fix tests for padding between sequences Signed-off-by: Charlene Yang <[email protected]> * fix tests for sbhd/bshd layouts; clean up Signed-off-by: Charlene Yang <[email protected]> * update cudnn-frontend and add tests for max_seqlen_q=1 and d=256 for inference Signed-off-by: Charlene Yang <[email protected]> * test sbhd/bshd layouts for sq1, d256 inference case Signed-off-by: Charlene Yang <[email protected]> * fix lint Signed-off-by: Charlene Yang <[email protected]> * replace wording from accumulative to cumulative Signed-off-by: Charlene Yang <[email protected]> * add offset tensors to custom fp8 mha tests Signed-off-by: Charlene Yang <[email protected]> * add version control for cuDNN Signed-off-by: Charlene Yang <[email protected]> * add sm>=90 constraint for thd support Signed-off-by: Charlene Yang <[email protected]> * fix cuDNN support for sq=1, d=256 Signed-off-by: Charlene Yang <[email protected]> * fix lint and minor tweak for fp8 tests Signed-off-by: Charlene Yang <[email protected]> * modify cudnn version and restrict MQA/GQA support for THD Signed-off-by: Charlene Yang <[email protected]> * add notes for seq offset tensors Signed-off-by: Charlene Yang <[email protected]> * add dummy tensor to pass jax build Signed-off-by: Charlene Yang <[email protected]> * add dummy tensor to pass paddle build Signed-off-by: Charlene Yang <[email protected]> * fix Jax CI Signed-off-by: Charlene Yang <[email protected]> --------- Signed-off-by: Charlene Yang <[email protected]> Signed-off-by: cyanguwa <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot]
Can you help keep this open source service alive? 💖 Please sponsor : )