Add transformers backend (Dense model only) #1

3outeille · 2025-09-06T09:35:56Z

Context

This PR enables:

Llama-like HF models to work with 4D parallelism: FSDP, CP, TP, PP (and the combinations between them). The following models were tested:
- meta-llama/Llama-3.2-1B
- microsoft/phi-2
- Qwen/Qwen2.5-7B
- mistralai/Mistral-7B-v0.1
- ByteDance-Seed/Seed-Coder-8B-Instruct
- Qwen/Qwen3-4B-Instruct-2507
- arcee-ai/AFM-4.5B
- ibm-granite/granite-3b-code-base-2k
- baidu/ERNIE-4.5-0.3B-Base-PT
- kyutai/helium-1-preview-2b
- allenai/OLMo-7B-hf
- mistralai/Ministral-8B-Instruct-2410
Patching HF models weights initialisation. Without this, the the loss and grad_norm starts very high

Usage

Config: torchtitan/torchtitan/experiments/transformers_backend/configs/qwen3_fsdp2_tp2_pp2.toml

...
[model]
- name = "llama3"
+ name = "Qwen/Qwen3-4B-Instruct-2507" 
flavor = "debugmodel"
hf_assets_path = "./tests/assets/tokenizer"
...

Train: LOG_RANK=7 ./torchtitan/torchtitan/experiments/transformers_backend/run_train.sh

Testing methodology

Following the converging.md guidelines, I am comparing the baseline FSDP=2 vs FSDP=2 & <other //-ism>
More precisely, the test_hf_integration.pyis going to do:

    results/
        |_ meta-llama
            |_ Llama-3.2-1B
                |_ debugmodel/
                    |_ seed_checkpoint/
                        |_ config.toml
                        |_ seed.slurm
                        |_ step-0/
                           |_ ....
                    |_ fsdp2_tp1_cp1_pp1/
                        |_ config.toml
                        |_ nd_parallelism.slurm
                        |_ nd_parallelism.log
                    |_ fsdp2_tp2_cp1_pp1/
                        |_ config.toml
                        |_ nd_parallelism.slurm
                        |_ nd_parallelism.log
                        |_ diff_baseline_vs_nd_parallelism.log
                    |_ fsdp2_tp1_cp1_pp2/
                        |_ config.toml
                        |_ nd_parallelism.slurm
                        |_ nd_parallelism.log
                        |_ diff_baseline_vs_nd_parallelism.log
                    |_ fsdp2_tp1_cp2_pp1/
                        |_ config.toml
                        |_ nd_parallelism.slurm
                        |_ nd_parallelism.log
                        |_ diff_baseline_vs_nd_parallelism.log
                    |_ fsdp2_tp1_cp2_pp2/
                        |_ config.toml
                        |_ nd_parallelism.slurm
                        |_ nd_parallelism.log
                        |_ diff_baseline_vs_nd_parallelism.log`
                |_ full/
                ...

Here is the grid search to test the HF modelling

#!/usr/bin/bash
model_names=(
     "meta-llama/Llama-3.2-1B"
     "microsoft/phi-2" 
     "Qwen/Qwen2.5-7B"
     "mistralai/Mistral-7B-v0.1"
     "ByteDance-Seed/Seed-Coder-8B-Instruct"
     "Qwen/Qwen3-4B-Instruct-2507" 
     "arcee-ai/AFM-4.5B" 
     "ibm-granite/granite-3b-code-base-2k" 
     "baidu/ERNIE-4.5-0.3B-Base-PT" 
     "kyutai/helium-1-preview-2b" 
     "allenai/OLMo-7B-hf"
     "mistralai/Ministral-8B-Instruct-2410" 
)

for model_name in "${model_names[@]}"; do
    rm -rf slurm_results/${model_name}

    python test_hf_integration.py create_configs --model_name "$model_name" --out_dir slurm_results --flavor debugmodel
    python test_hf_integration.py submit_jobs --inp_dir slurm_results/${model_name}/debugmodel/seed_checkpoint --qos high
    while [ ! -f slurm_results/${model_name}/debugmodel/seed_checkpoint/status.txt ] || [ "$(cat slurm_results/${model_name}/debugmodel/seed_checkpoint/status.txt)" != "completed" ]; do
        echo "Waiting for seed checkpoint from ${model_name} to complete ..."
        sleep 1
    done
    python test_hf_integration.py submit_jobs --inp_dir slurm_results/${model_name}/debugmodel --qos high
    echo "================"
done

Further tasks

Moe (handle in PR Add transformer backend (MoE) clean #3)
- Missing build_optimizers_with_moe_load_balancing support for MoE
- Missing TP/PP/EP supports for MoE
When using HF modeling, the test FSDP=2 vs FSDP=2 + PP=2, the loss and grad_norm not bitwise matching (but converging) while it is the case with Torchtitan modeling. (issue is tracked in Fix pp convergence to be bitwise #4)
Add convergence tests to CI by doing tiny model + gloo backend (once PP is bitwise matching)
the HF modeling has lower MFU than Torchtitan MFU
NOTE: import torch._dynamo.config; torch._dynamo.config.cache_size_limit = 128 to avoid recomputation for graph when using torch.compile and activation checkpointing

…roper mapping

torchtitan/models/llama3/infra/parallelize.py

torchtitan/experiments/transformers_backend/model/hf_transformers_args.py

torchtitan/train.py

torchtitan/experiments/transformers_backend/__init__.py

… gradnorm and less tps with HF model

…on backend)

3outeille · 2025-11-13T12:45:44Z

@wwwjn addresses all the issues mentioned. There is one last point I want to address (cf here) which I think provide a better user experience

This reverts commit 09f0c94.

# Context Reference PR: huggingface#1 This PR enables: - Llama-like HF models to work with 4D parallelism: FSDP, CP, TP, PP (and the combinations between them). The following models were tested: - `meta-llama/Llama-3.2-1B` - `microsoft/phi-2` - `Qwen/Qwen2.5-7B` - `mistralai/Mistral-7B-v0.1` - `ByteDance-Seed/Seed-Coder-8B-Instruct` - `Qwen/Qwen3-4B-Instruct-2507` - `arcee-ai/AFM-4.5B` - `ibm-granite/granite-3b-code-base-2k` - `baidu/ERNIE-4.5-0.3B-Base-PT` - `kyutai/helium-1-preview-2b` - `allenai/OLMo-7B-hf` - `mistralai/Ministral-8B-Instruct-2410` - Patching HF models weights initialisation. Without this, the the `loss` and `grad_norm` starts very high # Usage - Requirements `transformers==4.57.1` - Config: `torchtitan/torchtitan/experiments/transformers_backend/configs/qwen3.toml` ```diff ... [model] - name = "llama3" + name = "transformers_backend" flavor = "debugmodel" hf_assets_path = "./tests/assets/tokenizer" +[hf_transformers] +model = "Qwen/Qwen3-4B-Instruct-2507" ... ``` - Train: `LOG_RANK=7 CONFIG_FILE=<YOUR_PATH>/torchtitan/experiments/transformers_backend/configs/qwen3.toml ./run_train.sh --job.custom_config_module=torchtitan.experiments.transformers_backend.job_config --compile.enable` <img width="1334" height="453" alt="image" src="https://github.com/user-attachments/assets/da459448-027b-4af9-8176-6a3e433a272c" /> # Testing methodology <img width="2672" height="2018" alt="image" src="https://github.com/user-attachments/assets/66d8689d-7ede-47e3-b389-d4fc1bdd70f7" /> - Following the [converging.md](https://github.com/pytorch/torchtitan/blob/main/docs/converging.md) guidelines, I am comparing the baseline `FSDP=2` vs `FSDP=2 & <other //-ism>` - More precisely, the `test_hf_integration.py`is going to do: ```bash results/ |_ meta-llama |_ Llama-3.2-1B |_ debugmodel/ |_ seed_checkpoint/ |_ config.toml |_ seed.slurm |_ step-0/ |_ .... |_ fsdp2_tp1_cp1_pp1/ |_ config.toml |_ nd_parallelism.slurm |_ nd_parallelism.log |_ fsdp2_tp2_cp1_pp1/ |_ config.toml |_ nd_parallelism.slurm |_ nd_parallelism.log |_ diff_baseline_vs_nd_parallelism.log |_ fsdp2_tp1_cp1_pp2/ |_ config.toml |_ nd_parallelism.slurm |_ nd_parallelism.log |_ diff_baseline_vs_nd_parallelism.log |_ fsdp2_tp1_cp2_pp1/ |_ config.toml |_ nd_parallelism.slurm |_ nd_parallelism.log |_ diff_baseline_vs_nd_parallelism.log |_ fsdp2_tp1_cp2_pp2/ |_ config.toml |_ nd_parallelism.slurm |_ nd_parallelism.log |_ diff_baseline_vs_nd_parallelism.log` |_ full/ ... ``` - Here is the grid search to test the HF modelling ```shell #!/usr/bin/bash model_names=( "meta-llama/Llama-3.2-1B" "microsoft/phi-2" "Qwen/Qwen2.5-7B" "mistralai/Mistral-7B-v0.1" "ByteDance-Seed/Seed-Coder-8B-Instruct" "Qwen/Qwen3-4B-Instruct-2507" "arcee-ai/AFM-4.5B" "ibm-granite/granite-3b-code-base-2k" "baidu/ERNIE-4.5-0.3B-Base-PT" "kyutai/helium-1-preview-2b" "allenai/OLMo-7B-hf" "mistralai/Ministral-8B-Instruct-2410" ) for model_name in "${model_names[@]}"; do rm -rf slurm_results/${model_name} python test_hf_integration.py create_configs --model_name "$model_name" --out_dir slurm_results --flavor debugmodel python test_hf_integration.py submit_jobs --inp_dir slurm_results/${model_name}/debugmodel/seed_checkpoint --qos high while [ ! -f slurm_results/${model_name}/debugmodel/seed_checkpoint/status.txt ] || [ "$(cat slurm_results/${model_name}/debugmodel/seed_checkpoint/status.txt)" != "completed" ]; do echo "Waiting for seed checkpoint from ${model_name} to complete ..." sleep 1 done python test_hf_integration.py submit_jobs --inp_dir slurm_results/${model_name}/debugmodel --qos high echo "================" done ``` # Further tasks - Moe (handle in PR huggingface#3) - Missing `build_optimizers_with_moe_load_balancing` support for MoE - Missing TP/PP/EP supports for MoE - When using HF modeling, the test `FSDP=2 vs FSDP=2 + PP=2`, the `loss` and `grad_norm` not bitwise matching (but converging) while it is the case with Torchtitan modeling. (issue is tracked in huggingface#4) - Add convergence tests to CI by doing tiny model + gloo backend (once PP is bitwise matching) - the HF modeling has lower MFU than Torchtitan MFU - NOTE: `import torch._dynamo.config; torch._dynamo.config.cache_size_limit = 128` to avoid recomputation for graph when using `torch.compile` and `activation checkpointing`

3outeille added 7 commits August 28, 2025 08:05

create transformer_backend folder with debug run

7488385

add hf config

39a3b34

can now register train spec for hf model

ea7c594

can now switch with different flavors using HF Llama modeling

5f0adf5

it is now working up to apply_ac

7c3795c

now working up to init_weights

3fb2bf8

fix mapping when convert_to_hf_config + add breaking test to ensure p…

25daeca

…roper mapping

tianyu-l reviewed Sep 7, 2025

View reviewed changes

3outeille added 22 commits September 8, 2025 08:42

define own apply_ac for transformer backend instead of reusing llama3

3e67f2c

HF model without any parallelism now train (but grad_norm is high)

8c5c0ae

a bit cleaner way to get passed args

4ae9560

now same number of params + same attention backend but noticed higher…

9be95f9

… gradnorm and less tps with HF model

fix seed and deterministic

bf91447

fix torch deterministic for HF modeling that was producing Nans

4c2fc0b

HF model now numerically stable compared to TT (given a fixed attenti…

9bffa38

…on backend)

handling the is_hf_initialized flag in patch

40d84cc

refactor HF transformer model args

bd3f332

wrapper model class to avoid transformers to be explicit in train.py

249be92

add better testing script with reference log for later sanity check

e2d4ada

no need to fill passed args

4b498a9

can now handle multiple HF modeling

eb403d5

handle pref logits accessing inside HF model wrapper

a0d67a7

isolate HF patch for llama in another file

ea05552

find hacky way to pass HF model.name through CLI

adefa2c

more granularity of logging when doing parameter breakdown

a235863

add __repr__ to HFTransformerModelArgs for better debugging logs

fc43dc8

HF deepseek v3 is now training

23ae378

refactor to make it clear which args comes from which parts

2573be4

fix refactor and simplify things

46ae0a3

hacky way to switch flavors for now

b33d575

fix linting

fe691b8

3outeille force-pushed the 3outeille/transformers_backend branch from 5ae8455 to fe691b8 Compare November 13, 2025 10:54

3outeille requested a review from tianyu-l November 14, 2025 10:23

3outeille added 4 commits November 14, 2025 11:01

fix head dims in flops counting

5d5ce2b

propose an alternative to passing name

6ace9f4

fix linting

97cd6fe

bump transformers version from 4.55.4 to 4.57.1

5f1695f

3outeille force-pushed the 3outeille/transformers_backend branch from eceb1c3 to 5f1695f Compare November 14, 2025 14:28

3outeille mentioned this pull request Nov 17, 2025

3outeille/transformers backend (Dense model only) pytorch/torchtitan#2048

Merged

3outeille added 14 commits November 18, 2025 10:11

change qwen3 config name

2d2b612

reuse fsdp from llama3. Moe will be handle in another PR

a2ea2ef

clean logging

47fb2ea

move TitanDenseModelArgs to args

20308d3

clean

019f2cc

fix integration tests

fc93b4f

rename integration test file

f9e8e11

update README

83b0437

revert accidental changes linting

fb978dd

typo in naming

71ff098

refactor

663a415

revert the way we select HF modeling in config

3dbe6fa

Revert "reuse pipeline from torchtitan"

9be95da

This reverts commit 09f0c94.

pass deterministic.fill_uninitialized_memory to HF model

c0c273c

3outeille force-pushed the 3outeille/transformers_backend branch from bcf5355 to c0c273c Compare November 19, 2025 11:25

3outeille added 3 commits November 19, 2025 11:27

fix linting

4c50a00

fix integration tests

5b8d38c

fix minor stuff

57bb8dd

Merge branch 'main' into 3outeille/transformers_backend

1bbb3a8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add transformers backend (Dense model only) #1

Add transformers backend (Dense model only) #1

Uh oh!

3outeille commented Sep 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

3outeille commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add transformers backend (Dense model only) #1

Are you sure you want to change the base?

Add transformers backend (Dense model only) #1

Uh oh!

Conversation

3outeille commented Sep 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Usage

Testing methodology

Further tasks

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

3outeille commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

3outeille commented Sep 6, 2025 •

edited

Loading