Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] main from NVIDIA:main #10

Merged
merged 14 commits into from
May 31, 2024
Merged

[pull] main from NVIDIA:main #10

merged 14 commits into from
May 31, 2024

Conversation

pull[bot]
Copy link

@pull pull bot commented May 22, 2024

See Commits and Changes for more details.


Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

shamisp and others added 3 commits May 20, 2024 20:57
replaced deprecated pkg_resources with packaging

Signed-off-by: Alp Dener <[email protected]>
* added alignment requirements for CuBLAS heuristics

Signed-off-by: Phuong Nguyen <[email protected]>

* minor rewords

Signed-off-by: Phuong Nguyen <[email protected]>

* added unit test for gemm with unaligned inputs

Signed-off-by: Phuong Nguyen <[email protected]>

* added pytest skip if fp8 is not available

Signed-off-by: Phuong Nguyen <[email protected]>

* changed offset so that it has alignment with 128

Signed-off-by: Phuong Nguyen <[email protected]>

---------

Signed-off-by: Phuong Nguyen <[email protected]>
@pull pull bot added the ⤵️ pull label May 22, 2024
mingxu1067 and others added 11 commits May 22, 2024 14:33
* Fixed the shape mismatching issue in MLP.

Signed-off-by: Ming Huang <[email protected]>

* Add a corresponding test

Signed-off-by: Ming Huang <[email protected]>

---------

Signed-off-by: Ming Huang <[email protected]>
Co-authored-by: Phuong Nguyen <[email protected]>
TE checkpoint now preserves the torch autocast context from the forward pass during the recompute phase

Signed-off-by: Alp Dener <[email protected]>
* Fixed Llama tutorial. Changed batch size and added fused=True.

Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: root <[email protected]>

* Tutorial updated but not complete yet.

Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: root <[email protected]>

* Tutorial notebook reseted - removed fuse=true

Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: root <[email protected]>

* Removed fused=true

Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: root <[email protected]>

* Batch size back to 8

Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: root <[email protected]>

* Typo and commented out line

Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: root <[email protected]>

* fixed whitespace

Signed-off-by: root <[email protected]>

* fixed whitespace

Signed-off-by: root <[email protected]>

* Added comment to attention line. Fixed potential bug with loading weights - now loading works correctly, confirmed by the generation code.

Signed-off-by: root <[email protected]>

* Comments

Signed-off-by: root <[email protected]>

* Models cast added again

Signed-off-by: root <[email protected]>

* Weight download info

Signed-off-by: Pawel Gadzinski <[email protected]>

* Moved parameter gate_proj_size to config

Signed-off-by: Pawel Gadzinski <[email protected]>

* gate_proj_size removed and put immediate_size instead

Signed-off-by: Pawel Gadzinski <[email protected]>

* Llama 3 added to tutorial

Signed-off-by: Pawel Gadzinski <[email protected]>

* Typos fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* Typos fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* Fixed model loading

Signed-off-by: Pawel Gadzinski <[email protected]>

* Loading fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* Different dim for attention

Signed-off-by: Pawel Gadzinski <[email protected]>

* Reversed other commit

Signed-off-by: Pawel Gadzinski <[email protected]>

* Changed name to kv_channels

Signed-off-by: Pawel Gadzinski <[email protected]>

* Fixed typo

Signed-off-by: Pawel Gadzinski <[email protected]>

* Back to kv_channels in transformer layer

Signed-off-by: Pawel Gadzinski <[email protected]>

* Back to kv_channels in transformer layer

Signed-off-by: Pawel Gadzinski <[email protected]>

* Small bug fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* Small bug fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* Test fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* changed file modes

Signed-off-by: Pawel Gadzinski <[email protected]>

* lint fix and resolved conflict

Signed-off-by: Pawel Gadzinski <[email protected]>

* lint fix and resolved conflict

Signed-off-by: Pawel Gadzinski <[email protected]>

* Lint fix, hopefully last

Signed-off-by: Pawel Gadzinski <[email protected]>

---------

Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: root <[email protected]>
Signed-off-by: root <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: Pawel Gadzinski <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: Sudhakar Singh <[email protected]>
Co-authored-by: Kirthi Shankar Sivamani <[email protected]>
allow bias support for sm80/86/89 for cuDNN 9+

Signed-off-by: Charlene Yang <[email protected]>
* Use correct FP8 group in multi-GPU docs

FP8 process group should be tensor-parallel group

Signed-off-by: Tim Moon <[email protected]>

* Synchronize FP8 scales over world group in multi-GPU docs

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>
Make sure RoPE frequencies are in FP32

Signed-off-by: Tim Moon <[email protected]>
* Change the documentation footer

Signed-off-by: Przemek Tredak <[email protected]>

* Update docs toolchain versions

Signed-off-by: Przemek Tredak <[email protected]>

---------

Signed-off-by: Przemek Tredak <[email protected]>
* Initial refactor of FP8 workspaces in Linear module

Signed-off-by: Tim Moon <[email protected]>

* Remove extra kernel launch

Signed-off-by: Tim Moon <[email protected]>

* Minor perf optimizations

Tensor base class functions in Float8Tensor have significant overhead.

Signed-off-by: Tim Moon <[email protected]>

* Debug FP8 recipe test

Signed-off-by: Tim Moon <[email protected]>

* Refactor FP8 workspaces in LayerNormLinear and LayerNormMLP

Signed-off-by: Tim Moon <[email protected]>

* Document FP8 workspace function

Signed-off-by: Tim Moon <[email protected]>

* Revert changes to FP8 recipe tests

Signed-off-by: Tim Moon <[email protected]>

* Add support for lazy FP8 transpose caching

Previous caching behavior (always fill cache) incorrectly filled cache during CUDA graph warmup steps.

Signed-off-by: Tim Moon <[email protected]>

* Fix Pylint warnings

Signed-off-by: Tim Moon <[email protected]>

* Debug ONNX export

ONNX FP8 cast ops assumed that FP8 scales were created during model export (i.e. not initialized during training).

Signed-off-by: Tim Moon <[email protected]>

* Debug fused attention tests

Signed-off-by: Tim Moon <[email protected]>

* Make sure Float8Tensor.transpose_2d is backward compatible

Signed-off-by: Tim Moon <[email protected]>

* Revert changes to ONNX export operations

Work around ONNX test failures by filling FP8 scale tensors instead of copying into them.

Signed-off-by: Tim Moon <[email protected]>

* Debug scale factor update in Float8Tensor transpose_2d

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>
…TE (#867)

* add multi-tensor kernels

Signed-off-by: Xin Yao <[email protected]>

* add FusedAdam

Signed-off-by: Xin Yao <[email protected]>

* add test to qa

Signed-off-by: Xin Yao <[email protected]>

* add FusedSGD

Signed-off-by: Xin Yao <[email protected]>

* fix lint

Signed-off-by: Xin Yao <[email protected]>

---------

Signed-off-by: Xin Yao <[email protected]>
Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: Tim Moon <[email protected]>
* add THD support

Signed-off-by: Charlene Yang <[email protected]>

* add seq_offsets_o and use new offset calculation

Signed-off-by: Charlene Yang <[email protected]>

* addition to previous commit; fix unit test

Signed-off-by: Charlene Yang <[email protected]>

* add None for offset_o gradient

Signed-off-by: Charlene Yang <[email protected]>

* fix lint

Signed-off-by: Charlene Yang <[email protected]>

* WIP: test padding between sequences

Signed-off-by: Charlene Yang <[email protected]>

* WIP: fix tests for padding between sequences

Signed-off-by: Charlene Yang <[email protected]>

* fix tests for sbhd/bshd layouts; clean up

Signed-off-by: Charlene Yang <[email protected]>

* update cudnn-frontend and add tests for max_seqlen_q=1 and d=256 for inference

Signed-off-by: Charlene Yang <[email protected]>

* test sbhd/bshd layouts for sq1, d256 inference case

Signed-off-by: Charlene Yang <[email protected]>

* fix lint

Signed-off-by: Charlene Yang <[email protected]>

* replace wording from accumulative to cumulative

Signed-off-by: Charlene Yang <[email protected]>

* add offset tensors to custom fp8 mha tests

Signed-off-by: Charlene Yang <[email protected]>

* add version control for cuDNN

Signed-off-by: Charlene Yang <[email protected]>

* add sm>=90 constraint for thd support

Signed-off-by: Charlene Yang <[email protected]>

* fix cuDNN support for sq=1, d=256

Signed-off-by: Charlene Yang <[email protected]>

* fix lint and minor tweak for fp8 tests

Signed-off-by: Charlene Yang <[email protected]>

* modify cudnn version and restrict MQA/GQA support for THD

Signed-off-by: Charlene Yang <[email protected]>

* add notes for seq offset tensors

Signed-off-by: Charlene Yang <[email protected]>

* add dummy tensor to pass jax build

Signed-off-by: Charlene Yang <[email protected]>

* add dummy tensor to pass paddle build

Signed-off-by: Charlene Yang <[email protected]>

* fix Jax CI

Signed-off-by: Charlene Yang <[email protected]>

---------

Signed-off-by: Charlene Yang <[email protected]>
Signed-off-by: cyanguwa <[email protected]>
@phu0ngng phu0ngng merged commit e960607 into phu0ngng:main May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants