Reuse pre-trained VQ-VAE throws runtimeError #99

ObscuraDK · 2020-05-31T20:23:06Z

Hi there.
I am trying to reuse the VQ-VAE, with 118 44.1khz 16bit audio files on a 1080 TI.

executing this:
mpiexec -n 1 python jukebox/train.py --hps=vqvae,small_prior,all_fp16,cpu_ema --name=pretrained_vqvae_small_prior --sample_length=1048576 --bs=4 --aug_shift --aug_blend --audio_files_dir='/home/vertigo/jukebox/learning' --labels=False --train --test --prior --levels=3 --level=2 --weight_decay=0.01 --save_iters=1000

Get this:
Using cuda True
0: Found 118 files. Getting durations
0: self.sr=44100, min: 24, max: inf
0: Keeping 118 of 118 files
{'l2': 0.010829454215347525, 'l1': 0.07307693362236023, 'spec': 4325.51904296875}
Creating Data Loader
0: Train 859 samples. Test 96 samples
0: Train sampler: <torch.utils.data.distributed.DistributedSampler object at 0x7f29043f8110>
0: Train loader: 214
Downloading from gce
Restored from /home/vertigo/.cache/jukebox-assets/models/5b/vqvae.pth.tar
0: Loading vqvae in eval mode
Parameters VQVAE:0
Level:2, Cond downsample:None, Raw to tokens:128, Sample length:1048576
0: Converting to fp16 params
0: Loading prior in train mode
Parameters Prior:161862656
{'dynamic': True, 'loss_scale': 65536.0, 'max_loss_scale': 16777216.0, 'scale_factor': 1.0027764359010778, 'scale_window': 1, 'unskipped': 0, 'overflow': False}
Using CPU EMA
Logging to logs/pretrained_vqvae_small_prior
0/214 [00:08<?, ?it/s]
Traceback (most recent call last):
File "jukebox/train.py", line 341, in
fire.Fire(run)
File "/home/vertigo/miniconda3/envs/jukebox/lib/python3.7/site-packages/fire/core.py", line 127, in Fire
component_trace = _Fire(component, args, context, name)
File "/home/vertigo/miniconda3/envs/jukebox/lib/python3.7/site-packages/fire/core.py", line 366, in _Fire
component, remaining_args)
File "/home/vertigo/miniconda3/envs/jukebox/lib/python3.7/site-packages/fire/core.py", line 542, in CallCallable
result = fn(*varargs, **kwargs)
File "jukebox/train.py", line 325, in run
train_metrics = train(distributed_model, model, opt, shd, scalar, ema, logger, metrics, data_processor, hps)
File "jukebox/train.py", line 241, in train
opt.step(scale=clipped_grad_scale(grad_norm, hps.clip, scale))
File "/home/vertigo/jukebox/jukebox/utils/fp16.py", line 213, in step
group["weight_decay"],
File "/home/vertigo/jukebox/jukebox/utils/fp16.py", line 29, in adam_step
p.add(exp_avg/denom + weight_decay*p.float(), alpha=-step_size)
RuntimeError: output with backend CUDA and dtype Half doesn't match the desired backend CUDA and dtype Float

prafullasd · 2020-06-01T21:29:14Z

Can you print the data types of all variables in this step as well as input to the function adam_step?

ObscuraDK · 2020-06-02T08:19:25Z

Hi prafullasd.

Where do I find the step that you would like to see variables from?
The variables from the adam_step (line 29 in fp16.py), are pasted in below:

A sidenote: I tried to run without the 'all_fp16' parameter and got a little bit further.

Thank you for your help.

exp_avg: tensor([[-1.2535e-05, -7.5248e-07, -1.6771e-05, ..., 8.8499e-06,
-5.9975e-06, -9.2271e-06]], device='cuda:0') denom tensor([[3.9739e-06, 2.4796e-07, 5.3135e-06, ..., 2.8086e-06, 1.9066e-06,
2.9279e-06]], device='cuda:0')

weight_decay*p.float(): tensor([[ 1.2296e-05, -7.0621e-05, -4.8056e-05, ..., -1.3472e-04,
8.2957e-06, 6.4212e-05]], device='cuda:0')

step_size: 0.0
exp_avg: tensor([[ 3.3316e-06, -6.2995e-06, 4.2114e-07, ..., -1.0788e-05,
2.5579e-06, -9.8463e-06],
[-2.7420e-06, 2.7076e-06, -4.6100e-06, ..., 1.7860e-06,
7.7512e-08, 1.2060e-06],
[ 9.8920e-07, -7.2325e-06, 5.6269e-06, ..., -5.2558e-06,
1.7770e-06, -1.1786e-05],
...,
[-4.0166e-06, 3.6385e-06, -6.9851e-06, ..., 3.4009e-06,
-7.2514e-07, 1.7762e-06],
[-2.6218e-06, 2.4748e-06, -4.4383e-06, ..., 1.9369e-06,
-8.6355e-09, 1.2277e-06],
[-4.2350e-06, 5.1437e-08, -4.2196e-06, ..., -2.2541e-07,
-1.8059e-06, -3.9884e-07]], device='cuda:0') denom tensor([[1.0635e-06, 2.0021e-06, 1.4318e-07, ..., 3.4214e-06, 8.1887e-07,
3.1237e-06],
[8.7708e-07, 8.6623e-07, 1.4678e-06, ..., 5.7479e-07, 3.4512e-08,
3.9138e-07],
[3.2281e-07, 2.2971e-06, 1.7894e-06, ..., 1.6720e-06, 5.7193e-07,
3.7370e-06],
...,
[1.2802e-06, 1.1606e-06, 2.2189e-06, ..., 1.0855e-06, 2.3931e-07,
5.7167e-07],
[8.3908e-07, 7.9259e-07, 1.4135e-06, ..., 6.2250e-07, 1.2731e-08,
3.9825e-07],
[1.3492e-06, 2.6266e-08, 1.3444e-06, ..., 8.1282e-08, 5.8107e-07,
1.3613e-07]], device='cuda:0')

weight_decay*p.float(): tensor([[ 4.8816e-05, 5.4078e-05, 1.6971e-04, ..., -4.7752e-05,
-1.0746e-04, 1.4871e-04],
[-1.7761e-04, 1.6470e-05, 6.5703e-05, ..., -1.8517e-04,
1.9216e-04, -1.2899e-04],
[-4.5714e-05, -2.5560e-04, 9.7563e-05, ..., 1.4601e-04,
-3.1741e-05, -2.3758e-04],
...,
[ 2.4744e-04, -1.1324e-04, 2.1962e-05, ..., -1.7920e-04,
1.2483e-04, -1.2064e-05],
[ 5.4863e-05, -1.9226e-05, 6.8406e-05, ..., -8.6350e-05,
-1.7733e-04, 2.1887e-04],
[ 2.1376e-04, -6.3089e-05, -1.3905e-04, ..., -3.5342e-05,
-2.2315e-05, -9.9548e-05]], device='cuda:0')

step_size: 0.0
exp_avg: tensor([[-1.2535e-05, -7.5248e-07, -1.6771e-05, ..., 8.8499e-06,
-5.9975e-06, -9.2271e-06],
[ 9.2020e-07, 4.9647e-07, -1.0082e-05, ..., 1.0352e-06,
5.9144e-07, -8.2584e-06],
[ 5.7311e-07, 1.9648e-06, -8.5980e-06, ..., 6.2198e-07,
7.5082e-07, -1.0802e-05],
...,
[-1.9820e-07, -6.6953e-08, -6.9843e-07, ..., -1.0760e-08,
-2.1478e-07, -1.3698e-07],
[ 1.2750e-07, -1.2141e-07, -4.7240e-07, ..., 2.8985e-07,
-5.4958e-07, -2.6231e-07],
[-1.2559e-07, -4.0557e-07, 2.3975e-07, ..., 3.0800e-07,
7.6356e-07, 3.8858e-07]], device='cuda:0') denom tensor([[3.9739e-06, 2.4796e-07, 5.3135e-06, ..., 2.8086e-06, 1.9066e-06,
2.9279e-06],
[3.0099e-07, 1.6700e-07, 3.1981e-06, ..., 3.3734e-07, 1.9703e-07,
2.6215e-06],
[1.9123e-07, 6.3132e-07, 2.7289e-06, ..., 2.0669e-07, 2.4743e-07,
3.4260e-06],
...,
[7.2678e-08, 3.1172e-08, 2.3086e-07, ..., 1.3403e-08, 7.7919e-08,
5.3316e-08],
[5.0320e-08, 4.8394e-08, 1.5939e-07, ..., 1.0166e-07, 1.8379e-07,
9.2950e-08],
[4.9716e-08, 1.3825e-07, 8.5815e-08, ..., 1.0740e-07, 2.5146e-07,
1.3288e-07]], device='cuda:0')

weight_decay*p.float(): tensor([[ 7.7831e-06, -7.0554e-05, 5.0016e-06, ..., 1.0374e-04,
1.8247e-04, 2.1118e-04],
[-1.0152e-04, 1.1799e-05, 1.6475e-05, ..., -1.5426e-05,
3.7702e-05, 8.4314e-05],
[ 1.5354e-04, 2.9751e-05, 4.2645e-05, ..., 3.1049e-05,
5.8405e-05, 2.2491e-05],
...,
[-5.2624e-05, -1.5191e-04, -5.7607e-05, ..., -1.5991e-06,
-3.6916e-05, -7.8471e-05],
[ 4.3725e-05, 4.3821e-05, 2.3359e-05, ..., 2.9689e-05,
-5.9945e-05, 1.8269e-04],
[-8.5394e-05, 3.6782e-05, -2.0110e-05, ..., -1.5885e-05,
-1.1521e-05, 4.9064e-05]], device='cuda:0')

step_size: 0.0
exp_avg: tensor([[ 2.9516e-07, -5.2174e-07, 9.7740e-07, ..., -3.3729e-05,
-2.4057e-06, 3.5328e-05],
[ 2.9855e-06, 1.1829e-06, -3.5897e-06, ..., 3.8940e-06,
3.0810e-06, -1.3207e-04],
[-9.1298e-07, -9.2131e-07, 2.4368e-06, ..., -9.2853e-06,
8.0324e-06, 4.7875e-05],
...,
[ 1.1707e-06, 7.9413e-07, -2.2969e-06, ..., -5.0602e-06,
6.1309e-06, -4.5671e-05],
[-1.1185e-06, -5.1674e-07, 2.8389e-06, ..., -2.6390e-05,
8.6911e-07, 6.6641e-05],
[ 6.2823e-08, -7.5040e-08, -6.5752e-07, ..., -1.1183e-07,
-9.9606e-06, 2.5981e-05]], device='cuda:0') denom tensor([[1.0334e-07, 1.7499e-07, 3.1908e-07, ..., 1.0676e-05, 7.7076e-07,
1.1182e-05],
[9.5410e-07, 3.8406e-07, 1.1452e-06, ..., 1.2414e-06, 9.8430e-07,
4.1775e-05],
[2.9871e-07, 3.0134e-07, 7.8059e-07, ..., 2.9463e-06, 2.5501e-06,
1.5149e-05],
...,
[3.8019e-07, 2.6113e-07, 7.3634e-07, ..., 1.6102e-06, 1.9488e-06,
1.4452e-05],
[3.6369e-07, 1.7341e-07, 9.0774e-07, ..., 8.3552e-06, 2.8484e-07,
2.1084e-05],
[2.9866e-08, 3.3730e-08, 2.1793e-07, ..., 4.5364e-08, 3.1598e-06,
8.2259e-06]], device='cuda:0')

weight_decay*p.float(): tensor([[ 2.3758e-04, 1.2268e-04, -1.3771e-04, ..., -2.8629e-05,
4.6849e-06, 9.8801e-05],
[-1.5091e-04, -2.2690e-04, -5.1804e-05, ..., 2.1317e-04,
1.9440e-04, -1.8677e-04],
[-4.9248e-05, 1.4137e-04, 9.7418e-06, ..., 8.4877e-06,
9.6817e-05, 5.9166e-05],
...,
[ 2.5085e-04, -4.0321e-05, -1.9638e-04, ..., 2.2888e-06,
9.4833e-05, 6.6032e-05],
[ 1.0468e-04, -2.8305e-04, -1.6296e-04, ..., 2.0859e-04,
-9.9716e-05, -6.4049e-05],
[-1.3451e-04, -1.6785e-04, 4.7226e-05, ..., 7.4501e-05,
-8.6288e-05, 1.0010e-04]], device='cuda:0') step_size: 0.0

worosom · 2021-06-11T11:37:04Z

I was also running into this issue.
Then I installed apex and it all works now.
Seems to me like the "standard" implementation of the adam_step function doesn't work with fp16 models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reuse pre-trained VQ-VAE throws runtimeError #99

Reuse pre-trained VQ-VAE throws runtimeError #99

ObscuraDK commented May 31, 2020

prafullasd commented Jun 1, 2020

ObscuraDK commented Jun 2, 2020

worosom commented Jun 11, 2021

Reuse pre-trained VQ-VAE throws runtimeError #99

Reuse pre-trained VQ-VAE throws runtimeError #99

Comments

ObscuraDK commented May 31, 2020

prafullasd commented Jun 1, 2020

ObscuraDK commented Jun 2, 2020

worosom commented Jun 11, 2021