-
Notifications
You must be signed in to change notification settings - Fork 2
Add transformers backend (Dense model only) #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 115 commits
7488385
39a3b34
ea7c594
5f0adf5
7c3795c
3fb2bf8
25daeca
3e67f2c
8c5c0ae
4ae9560
9be95f9
bf91447
4c2fc0b
9bffa38
40d84cc
bd3f332
249be92
e2d4ada
4b498a9
eb403d5
a0d67a7
ea05552
adefa2c
a235863
fc43dc8
23ae378
2573be4
46ae0a3
b33d575
007f005
dd2b04c
9abdae3
f9e90bc
ba5d6d1
338a250
36a5673
ed892a2
1c1452f
5e4911f
4891a47
9e260a0
a604bee
0b38d0d
025a86f
590737f
1a9af68
e6b9ff5
a4cb8c3
12c0c47
13edc66
52250fb
c523ede
f9f5c66
5a875b6
e4d963c
957cc4a
a317c53
d2f80a2
6454e40
b99a4d2
218f400
bb080ad
3168f9e
a9a65b7
b4a1b88
5f1075b
81f1855
c35ccfc
d5ce2e9
8d46723
087f841
937c68d
4f2b357
154289d
1b2cfd7
0f2c51e
4c8b4b7
c61271e
a848545
9488a16
5be438b
e8a1757
141c377
a67e971
060befe
3425b12
7b0ee5d
3e2222c
70c348d
af0a1cb
980a92b
06b6f24
a70c4c4
42884cd
4fa0874
8ffa7f4
ff21c2b
0a43a8a
8026bc7
cd4042f
0700bdb
767f71d
7f71f88
7e63a82
04fb8eb
0d80f62
09f0c94
78d26ff
5243795
84af768
fe691b8
5d5ce2b
6ace9f4
97cd6fe
5f1695f
2d2b612
a2ea2ef
47fb2ea
20308d3
019f2cc
fc93b4f
f9e8e11
83b0437
fb978dd
71ff098
663a415
3dbe6fa
9be95da
c0c273c
4c50a00
5b8d38c
57bb8dd
1bbb3a8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| transformers==4.57.1 |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,53 @@ | ||
| name: Transformers Backend 8 GPU Integration Tests | ||
|
|
||
| on: | ||
| push: | ||
| branches: [ main ] | ||
| paths: | ||
| - 'torchtitan/experiments/transformers_backend/**' | ||
| pull_request: | ||
| paths: | ||
| - 'torchtitan/experiments/transformers_backend/**' | ||
| schedule: | ||
| # Runs every 12 hours | ||
| - cron: '0 */12 * * *' | ||
|
|
||
| concurrency: | ||
| group: unit-test${{ github.workflow }}-${{ github.ref == 'refs/heads/main' && github.run_number || github.ref }} | ||
| cancel-in-progress: true | ||
|
|
||
| defaults: | ||
| run: | ||
| shell: bash -l -eo pipefail {0} | ||
|
|
||
| jobs: | ||
| build-test: | ||
| uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main | ||
| with: | ||
| runner: linux.g5.48xlarge.nvidia.gpu | ||
| gpu-arch-type: cuda | ||
| gpu-arch-version: "12.6" | ||
| # This image is faster to clone than the default, but it lacks CC needed by triton | ||
| # (1m25s vs 2m37s). | ||
| docker-image: torchtitan-ubuntu-20.04-clang12 | ||
| repository: pytorch/torchtitan | ||
| upload-artifact: outputs | ||
| script: | | ||
| set -eux | ||
|
|
||
| # The generic Linux job chooses to use base env, not the one setup by the image | ||
| CONDA_ENV=$(conda env list --json | jq -r ".envs | .[-1]") | ||
| conda activate "${CONDA_ENV}" | ||
|
|
||
| # Log CUDA driver version for debugging. | ||
| DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -n 1 || true) | ||
| echo "CUDA driver version: ${DRIVER_VERSION}" | ||
|
|
||
| pip config --user set global.progress_bar off | ||
|
|
||
| python -m pip install --force-reinstall --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126 | ||
|
|
||
| USE_CPP=0 python -m pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126 | ||
|
|
||
| mkdir artifacts-to-be-uploaded | ||
| python -m torchtitan.experiments.transformers_backend.tests.integration_tests artifacts-to-be-uploaded --ngpu 8 |
3outeille marked this conversation as resolved.
Show resolved
Hide resolved
|
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -12,5 +12,6 @@ | |
| "vlm", | ||
| "compiler_toolkit.deepseek_v3", | ||
| "compiler_toolkit.llama3", | ||
| "transformers_backend", | ||
| ] | ||
| ) | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,51 @@ | ||
| # Huggingface Transformers backend | ||
|
|
||
| ## Quick start | ||
|
|
||
| - Requirements `transformers==4.57.1` | ||
|
|
||
| - Config: `torchtitan/torchtitan/experiments/transformers_backend/configs/qwen3_fsdp2_tp2_pp2.toml` | ||
| ```diff | ||
| ... | ||
| [model] | ||
| - name = "llama3" | ||
| + name = "Qwen/Qwen3-4B-Instruct-2507" | ||
| flavor = "debugmodel" | ||
| hf_assets_path = "./tests/assets/tokenizer" | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I just realize the naming of this field might be a little bit confusing in current use case - The user still need to specify their own tokenizer, or download it from HF before kick off a run, right? We need to make this part clear in README, telling users they need to download or prepare their tokenizer. Otherwise I think some users might be confused will the tokenizer is downloaded from
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I forgot the reason why we switch to having a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The
We want to keep modules being dynamically imported and follow the current practice in torchtitan because we used to find importing torchtitan is super slow. By default the folder/path name is the same as the model name, in most of use cases. In our case, we need to set the model name to I also agree the
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What do you actually think of the proposed solution ? I think it is a reasonable one while making user experience very smooth because in our case, this is not a model but a backend @tianyu-l @wwwjn There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the proposed solution works for now with the assumption : Any model name containing "/" is automatically recognized as a HuggingFace model ID and will use the However if we need to add other job configs only related to |
||
| ... | ||
| ``` | ||
| **Note:** Any model name containing "/" is automatically recognized as a HuggingFace model ID and will use the `transformers_backend`. | ||
|
|
||
| - Train: `LOG_RANK=7 CONFIG_FILE=<YOUR_PATH>/torchtitan/experiments/transformers_backend/configs/qwen3_fsdp2_tp2_pp2.toml ./run_train.sh --compile.enable` | ||
| - Make sure you have created the tokenizers beforehand | ||
| <img width="1334" height="453" alt="image" src="https://github.com/user-attachments/assets/da459448-027b-4af9-8176-6a3e433a272c" /> | ||
|
|
||
| ## Supported Features | ||
|
|
||
| - The following models were tested: | ||
| - Dense (FSDP/CP/TP/PP/`torch.compile`) | ||
| - `meta-llama/Llama-3.2-1B` | ||
| - `microsoft/phi-2` | ||
| - `Qwen/Qwen2.5-7B` | ||
| - `mistralai/Mistral-7B-v0.1` | ||
| - `ByteDance-Seed/Seed-Coder-8B-Instruct` | ||
| - `Qwen/Qwen3-4B-Instruct-2507` | ||
| - `arcee-ai/AFM-4.5B` | ||
| - `ibm-granite/granite-3b-code-base-2k` | ||
| - `baidu/ERNIE-4.5-0.3B-Base-PT` | ||
| - `kyutai/helium-1-preview-2b` | ||
| - `allenai/OLMo-7B-hf` | ||
| - `mistralai/Ministral-8B-Instruct-2410` | ||
| - MoE (upcoming) | ||
|
|
||
| ## Known issues to address later | ||
|
|
||
| - When using HF modeling, the test `FSDP=2 vs FSDP=2 + PP=2`, the `loss` and `grad_norm` not bitwise matching (but converging) while it is the case with Torchtitan modeling. This will be addressed in another PR but the culprit is probably `register_buffer` when loading `seed_checkpoint` | ||
| - the HF modeling has lower MFU than Torchtitan MFU | ||
|
|
||
| ## Further work | ||
|
|
||
| - Missing `build_optimizers_with_moe_load_balancing` support for MoE | ||
| - Missing TP/PP/EP supports for MoE | ||
| - Load HF weights | ||
| - Add LORA support | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,73 @@ | ||
| # Copyright (c) Meta Platforms, Inc. and affiliates. | ||
| # All rights reserved. | ||
| # | ||
| # This source code is licensed under the BSD-style license found in the | ||
| # LICENSE file in the root directory of this source tree. | ||
| from dataclasses import dataclass | ||
|
|
||
| from torchtitan.components.loss import build_cross_entropy_loss | ||
| from torchtitan.components.lr_scheduler import build_lr_schedulers | ||
| from torchtitan.components.optimizer import build_optimizers | ||
| from torchtitan.components.tokenizer import build_hf_tokenizer | ||
| from torchtitan.hf_datasets.text_datasets import build_text_dataloader | ||
| from torchtitan.protocols.train_spec import TrainSpec | ||
|
|
||
| from .infra.parallelize import parallelize_hf_transformers | ||
|
|
||
| from .infra.pipeline import pipeline_hf_transformers | ||
| from .model.args import HFTransformerModelArgs | ||
| from .model.model import HFTransformerModel | ||
|
|
||
|
|
||
| __all__ = [ | ||
| "HFTransformerModelArgs", | ||
| "HFTransformerModel", | ||
| ] | ||
|
|
||
|
|
||
| @dataclass | ||
| class TitanDenseModelArgs: | ||
| """Arguments for the base TorchTitan model.""" | ||
|
|
||
| dim: int = 4096 | ||
| n_layers: int = 32 | ||
| n_heads: int = 32 | ||
| n_kv_heads: int | None = None | ||
| vocab_size: int | None = None | ||
| multiple_of: int = 256 | ||
| ffn_dim_multiplier: float | None = None | ||
| norm_eps: float = 1e-5 | ||
| rope_theta: float = 10000 | ||
| max_seq_len: int = 2048 | ||
| depth_init: bool = True | ||
| use_flex_attn: bool = False | ||
| attn_mask_type: str = "causal" | ||
|
|
||
|
|
||
| flavors = { | ||
| "debugmodel": HFTransformerModelArgs( | ||
| titan_dense_args=TitanDenseModelArgs( | ||
| dim=256, | ||
| n_layers=2, | ||
| n_heads=16, | ||
| n_kv_heads=16, | ||
| ), | ||
| ), | ||
| "full": HFTransformerModelArgs( | ||
| titan_dense_args=TitanDenseModelArgs(), | ||
| ), | ||
| } | ||
|
|
||
|
|
||
| def get_train_spec() -> TrainSpec: | ||
| return TrainSpec( | ||
| model_cls=HFTransformerModel, | ||
| model_args=flavors, | ||
| parallelize_fn=parallelize_hf_transformers, | ||
| pipelining_fn=pipeline_hf_transformers, | ||
| build_optimizers_fn=build_optimizers, | ||
| build_lr_schedulers_fn=build_lr_schedulers, | ||
| build_dataloader_fn=build_text_dataloader, | ||
| build_tokenizer_fn=build_hf_tokenizer, | ||
| build_loss_fn=build_cross_entropy_loss, | ||
| ) |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,87 @@ | ||
| # torchtitan Config.toml | ||
|
|
||
| [job] | ||
| dump_folder = "./outputs" | ||
| description = "Qwen 3 debug training" | ||
| print_config = true | ||
|
|
||
| [profiling] | ||
| enable_profiling = false | ||
| save_traces_folder = "profile_trace" | ||
| profile_freq = 5 | ||
| enable_memory_snapshot = false | ||
| save_memory_snapshot_folder = "memory_snapshot" | ||
|
|
||
| [metrics] | ||
| log_freq = 1 | ||
| disable_color_printing = false | ||
| enable_tensorboard = false | ||
| save_tb_folder = "tb" | ||
| enable_wandb = false | ||
|
|
||
| [model] | ||
| name = "Qwen/Qwen3-4B-Instruct-2507" | ||
| flavor = "debugmodel" | ||
| # test folder with tokenizer.json, for debug purpose only | ||
| hf_assets_path = "./tests/assets/tokenizer" | ||
| # converters = ["float8"] | ||
|
|
||
| [optimizer] | ||
| name = "AdamW" | ||
| lr = 8e-4 | ||
| eps = 1e-8 | ||
|
|
||
| [lr_scheduler] | ||
| warmup_steps = 2 # lr scheduler warm up, normally 20% of the train steps | ||
| decay_ratio = 0.8 # lr scheduler decay ratio, 80% of the train steps | ||
| decay_type = "linear" | ||
| min_lr_factor = 0.0 | ||
|
|
||
| [training] | ||
| local_batch_size = 2 | ||
| seq_len = 2048 | ||
| max_norm = 1.0 # grad norm clipping | ||
| steps = 10 | ||
| dataset = "c4_test" # supported datasets: c4_test (2K), c4 (177M) | ||
| dataset_path = "./tests/assets/c4_test" | ||
| mixed_precision_param = "float32" # force float32 for comparison | ||
| mixed_precision_reduce = "float32" | ||
|
|
||
| [parallelism] | ||
| data_parallel_replicate_degree = 1 | ||
| data_parallel_shard_degree = 2 | ||
| fsdp_reshard_after_forward = "default" # default / never / always | ||
| tensor_parallel_degree = 2 | ||
| enable_async_tensor_parallel = false | ||
| pipeline_parallel_degree = 2 | ||
| pipeline_parallel_schedule = "1F1B" | ||
| context_parallel_degree = 1 | ||
| expert_parallel_degree = 1 | ||
| expert_tensor_parallel_degree = 1 | ||
|
|
||
| [checkpoint] | ||
| enable = false | ||
| folder = "checkpoint" | ||
| interval = 10 | ||
| last_save_model_only = false | ||
| export_dtype = "float32" | ||
| async_mode = "disabled" # ["disabled", "async", "async_with_pinned_mem"] | ||
|
|
||
| [activation_checkpoint] | ||
| mode = "selective" # ["none", "selective", "full"] | ||
| selective_ac_option = '2' # 'int' = ac every positive int layer or 'op', ac based on ops policy | ||
|
|
||
| [compile] | ||
| enable=false | ||
| components = ["model", "loss"] | ||
|
|
||
| [quantize.linear.float8] | ||
| enable_fsdp_float8_all_gather = false | ||
| precompute_float8_dynamic_scale_for_fsdp = false | ||
| filter_fqns = ["output"] | ||
|
|
||
| [validation] | ||
| enable = false | ||
| dataset = "c4_validation" | ||
| freq = 5 | ||
| steps = 10 |
Uh oh!
There was an error while loading. Please reload this page.