PyTorch RuntimeError when enabling mixed precision in transformer (roberta-base)

Hi! I've been using spaCy over the last few weeks to fine-tune a `roberta-base` model for NER. So far, the experience has been great and I'm able to train and use the fine-tuned models without any issues.

I now wanted to enable mixed precision to speed up the training process. However, when I do that, I get the following error:

```
File "/usr/local/lib/python3.10/dist-packages/thinc/shims/pytorch_grad_scaler.py", line 171, in update
 torch._amp_update_scale_(
RuntimeError: current_scale must be a float tensor.
```

Toggling `mixed_precision` back to `false` results in successful training.

<details><summary>Traceback</summary>


```
/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
 _torch_pytree._register_pytree_node(
ℹ Saving to output directory: spacy_trained_pipeline_en
ℹ Using GPU: 0

=========================== Initializing pipeline ===========================
/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
 _torch_pytree._register_pytree_node(
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
 warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
 _torch_pytree._register_pytree_node(
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['transformer', 'ner']
ℹ Initial learn rate: 0.0
E # LOSS TRANS... LOSS NER ENTS_F ENTS_P ENTS_R SCORE
--- ------ ------------- -------- ------ ------ ------ ------
⚠ Aborting and saving the final best model. Encountered exception:
RuntimeError('current_scale must be a float tensor.')
Traceback (most recent call last):
 File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
 return _run_code(code, main_globals, None,
 File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
 exec(code, run_globals)
 File "/usr/local/lib/python3.10/dist-packages/spacy/__main__.py", line 4, in <module>
 setup_cli()
 File "/usr/local/lib/python3.10/dist-packages/spacy/cli/_util.py", line 87, in setup_cli
 command(prog_name=COMMAND)
 File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
 return self.main(*args, **kwargs)
 File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 783, in main
 return _main(
 File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 225, in _main
 rv = self.invoke(ctx)
 File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke
 return _process_result(sub_ctx.command.invoke(sub_ctx))
 File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
 return ctx.invoke(self.callback, **ctx.params)
 File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
 return __callback(*args, **kwargs)
 File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 683, in wrapper
 return callback(**use_params) # type: ignore
 File "/usr/local/lib/python3.10/dist-packages/spacy/cli/train.py", line 54, in train_cli
 train(config_path, output_path, use_gpu=use_gpu, overrides=overrides)
 File "/usr/local/lib/python3.10/dist-packages/spacy/cli/train.py", line 84, in train
 train_nlp(nlp, output_path, use_gpu=use_gpu, stdout=sys.stdout, stderr=sys.stderr)
 File "/usr/local/lib/python3.10/dist-packages/spacy/training/loop.py", line 135, in train
 raise e
 File "/usr/local/lib/python3.10/dist-packages/spacy/training/loop.py", line 118, in train
 for batch, info, is_best_checkpoint in training_step_iterator:
 File "/usr/local/lib/python3.10/dist-packages/spacy/training/loop.py", line 236, in train_while_improving
 proc.finish_update(optimizer) # type: ignore[attr-defined]
 File "spacy/pipeline/trainable_pipe.pyx", line 252, in spacy.pipeline.trainable_pipe.TrainablePipe.finish_update
 File "/usr/local/lib/python3.10/dist-packages/thinc/model.py", line 342, in finish_update
 shim.finish_update(optimizer)
 File "/usr/local/lib/python3.10/dist-packages/thinc/shims/pytorch.py", line 180, in finish_update
 self._grad_scaler.update()
 File "/usr/local/lib/python3.10/dist-packages/thinc/shims/pytorch_grad_scaler.py", line 171, in update
 torch._amp_update_scale_(
RuntimeError: current_scale must be a float tensor.
```


</details> 

To me, this hints that the `grad_scaler_config` is somehow not getting to PyTorch, but I'm not sure what I'm doing wrong. 
I'm following the example config from [spacy-transformers.TransformerModel.v3](https://spacy.io/api/architectures#TransformerModel).

<details><summary>My config file, trf_config.cfg</summary>


```
[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer","ner"]
batch_size = 64
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
vectors = {"@vectors":"spacy.Vectors.v1"}

[components]

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "roberta-base"
# mixed_precision = false
mixed_precision = true
grad_scaler_config = {"init_scale": 32768}

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.transformer.model.tokenizer_config]
use_fast = true

[components.transformer.model.transformer_config]

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 200
buffer = 256
get_length = null

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005

[training.score_weights]
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0
ents_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]
```


</details> 

## How to reproduce the behaviour
I'm running the training on Google Colab, using a Tesla T4 runtime:

```sh
!nvidia-smi -L
!export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

GPU 0: Tesla T4 (UUID: GPU-0c3e659f-2933-c77e-7694-6112031f1cef)
```

I've tried _not_ executing the line `!export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`, but it doesn't make a difference.

I've also made sure that I call `spacy train` with `--gpu-id 0`.

Here's the exact steps of the Colab notebook I use:

<details><summary>Colab notebook</summary>


```sh
!nvcc --version
```

> nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0

```sh
!pip install spacy[cuda12x,transformers] transformers[sentencepiece]
```

```sh
!pip freeze | grep cupy
```

> cupy-cuda12x==12.2.0

```sh
!python -m spacy download en_core_web_trf
```

```sh
!nvidia-smi -L
!export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
```

> GPU 0: Tesla T4 (UUID: GPU-0c3e659f-2933-c77e-7694-6112031f1cef)

```sh
!pip3 freeze | grep torch
```

> torch @ https://download.pytorch.org/whl/cu121/torch-2.3.0%2Bcu121-cp310-cp310-linux_x86_64.whl#sha256=0a12aa9aa6bc442dff8823ac8b48d991fd0771562eaa38593f9c8196d65f7007
torchaudio @ https://download.pytorch.org/whl/cu121/torchaudio-2.3.0%2Bcu121-cp310-cp310-linux_x86_64.whl#sha256=38b49393f8c322dcaa29d19e5acbf5a0b1978cf1b719445ab670f1fb486e3aa6
torchsummary==1.5.1
torchtext==0.18.0
torchvision @ https://download.pytorch.org/whl/cu121/torchvision-0.18.0%2Bcu121-cp310-cp310-linux_x86_64.whl#sha256=13e1b48dc5ce41ccb8100ab3dd26fdf31d8f1e904ecf2865ac524493013d0df5

```sh
!python -m spacy train ./trf_config.cfg --output ./spacy_trained_pipeline_en --paths.train "train.spacy" --paths.dev "dev.spacy" --gpu-id 0
```


</details> 

Could you please give me a hand? Thanks a lot!

## Info about spaCy

- **spaCy version:** 3.7.4
- **Platform:** Linux-6.1.85+-x86_64-with-glibc2.35
- **Python version:** 3.10.12
- **Pipelines:** en_core_web_trf (3.7.3), en_core_web_sm (3.7.1)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PyTorch RuntimeError when enabling mixed precision in transformer (roberta-base) #13512

How to reproduce the behaviour

Info about spaCy

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

PyTorch RuntimeError when enabling mixed precision in transformer (roberta-base) #13512

Description

How to reproduce the behaviour

Info about spaCy

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions