Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reuse pre-trained VQ-VAE throws runtimeError #99

Open
ObscuraDK opened this issue May 31, 2020 · 3 comments
Open

Reuse pre-trained VQ-VAE throws runtimeError #99

ObscuraDK opened this issue May 31, 2020 · 3 comments

Comments

@ObscuraDK
Copy link

Hi there.
I am trying to reuse the VQ-VAE, with 118 44.1khz 16bit audio files on a 1080 TI.

executing this:
mpiexec -n 1 python jukebox/train.py --hps=vqvae,small_prior,all_fp16,cpu_ema --name=pretrained_vqvae_small_prior --sample_length=1048576 --bs=4 --aug_shift --aug_blend --audio_files_dir='/home/vertigo/jukebox/learning' --labels=False --train --test --prior --levels=3 --level=2 --weight_decay=0.01 --save_iters=1000

Get this:
Using cuda True
0: Found 118 files. Getting durations
0: self.sr=44100, min: 24, max: inf
0: Keeping 118 of 118 files
{'l2': 0.010829454215347525, 'l1': 0.07307693362236023, 'spec': 4325.51904296875}
Creating Data Loader
0: Train 859 samples. Test 96 samples
0: Train sampler: <torch.utils.data.distributed.DistributedSampler object at 0x7f29043f8110>
0: Train loader: 214
Downloading from gce
Restored from /home/vertigo/.cache/jukebox-assets/models/5b/vqvae.pth.tar
0: Loading vqvae in eval mode
Parameters VQVAE:0
Level:2, Cond downsample:None, Raw to tokens:128, Sample length:1048576
0: Converting to fp16 params
0: Loading prior in train mode
Parameters Prior:161862656
{'dynamic': True, 'loss_scale': 65536.0, 'max_loss_scale': 16777216.0, 'scale_factor': 1.0027764359010778, 'scale_window': 1, 'unskipped': 0, 'overflow': False}
Using CPU EMA
Logging to logs/pretrained_vqvae_small_prior
0/214 [00:08<?, ?it/s]
Traceback (most recent call last):
File "jukebox/train.py", line 341, in
fire.Fire(run)
File "/home/vertigo/miniconda3/envs/jukebox/lib/python3.7/site-packages/fire/core.py", line 127, in Fire
component_trace = _Fire(component, args, context, name)
File "/home/vertigo/miniconda3/envs/jukebox/lib/python3.7/site-packages/fire/core.py", line 366, in _Fire
component, remaining_args)
File "/home/vertigo/miniconda3/envs/jukebox/lib/python3.7/site-packages/fire/core.py", line 542, in CallCallable
result = fn(*varargs, **kwargs)
File "jukebox/train.py", line 325, in run
train_metrics = train(distributed_model, model, opt, shd, scalar, ema, logger, metrics, data_processor, hps)
File "jukebox/train.py", line 241, in train
opt.step(scale=clipped_grad_scale(grad_norm, hps.clip, scale))
File "/home/vertigo/jukebox/jukebox/utils/fp16.py", line 213, in step
group["weight_decay"],
File "/home/vertigo/jukebox/jukebox/utils/fp16.py", line 29, in adam_step
p.add
(exp_avg/denom + weight_decay*p.float(), alpha=-step_size)
RuntimeError: output with backend CUDA and dtype Half doesn't match the desired backend CUDA and dtype Float

@prafullasd
Copy link
Collaborator

Can you print the data types of all variables in this step as well as input to the function adam_step?

@ObscuraDK
Copy link
Author

Hi prafullasd.

Where do I find the step that you would like to see variables from?
The variables from the adam_step (line 29 in fp16.py), are pasted in below:

A sidenote: I tried to run without the 'all_fp16' parameter and got a little bit further.

Thank you for your help.

exp_avg: tensor([[-1.2535e-05, -7.5248e-07, -1.6771e-05, ..., 8.8499e-06,
-5.9975e-06, -9.2271e-06]], device='cuda:0') denom tensor([[3.9739e-06, 2.4796e-07, 5.3135e-06, ..., 2.8086e-06, 1.9066e-06,
2.9279e-06]], device='cuda:0')

weight_decay*p.float(): tensor([[ 1.2296e-05, -7.0621e-05, -4.8056e-05, ..., -1.3472e-04,
8.2957e-06, 6.4212e-05]], device='cuda:0')

step_size: 0.0
exp_avg: tensor([[ 3.3316e-06, -6.2995e-06, 4.2114e-07, ..., -1.0788e-05,
2.5579e-06, -9.8463e-06],
[-2.7420e-06, 2.7076e-06, -4.6100e-06, ..., 1.7860e-06,
7.7512e-08, 1.2060e-06],
[ 9.8920e-07, -7.2325e-06, 5.6269e-06, ..., -5.2558e-06,
1.7770e-06, -1.1786e-05],
...,
[-4.0166e-06, 3.6385e-06, -6.9851e-06, ..., 3.4009e-06,
-7.2514e-07, 1.7762e-06],
[-2.6218e-06, 2.4748e-06, -4.4383e-06, ..., 1.9369e-06,
-8.6355e-09, 1.2277e-06],
[-4.2350e-06, 5.1437e-08, -4.2196e-06, ..., -2.2541e-07,
-1.8059e-06, -3.9884e-07]], device='cuda:0') denom tensor([[1.0635e-06, 2.0021e-06, 1.4318e-07, ..., 3.4214e-06, 8.1887e-07,
3.1237e-06],
[8.7708e-07, 8.6623e-07, 1.4678e-06, ..., 5.7479e-07, 3.4512e-08,
3.9138e-07],
[3.2281e-07, 2.2971e-06, 1.7894e-06, ..., 1.6720e-06, 5.7193e-07,
3.7370e-06],
...,
[1.2802e-06, 1.1606e-06, 2.2189e-06, ..., 1.0855e-06, 2.3931e-07,
5.7167e-07],
[8.3908e-07, 7.9259e-07, 1.4135e-06, ..., 6.2250e-07, 1.2731e-08,
3.9825e-07],
[1.3492e-06, 2.6266e-08, 1.3444e-06, ..., 8.1282e-08, 5.8107e-07,
1.3613e-07]], device='cuda:0')

weight_decay*p.float(): tensor([[ 4.8816e-05, 5.4078e-05, 1.6971e-04, ..., -4.7752e-05,
-1.0746e-04, 1.4871e-04],
[-1.7761e-04, 1.6470e-05, 6.5703e-05, ..., -1.8517e-04,
1.9216e-04, -1.2899e-04],
[-4.5714e-05, -2.5560e-04, 9.7563e-05, ..., 1.4601e-04,
-3.1741e-05, -2.3758e-04],
...,
[ 2.4744e-04, -1.1324e-04, 2.1962e-05, ..., -1.7920e-04,
1.2483e-04, -1.2064e-05],
[ 5.4863e-05, -1.9226e-05, 6.8406e-05, ..., -8.6350e-05,
-1.7733e-04, 2.1887e-04],
[ 2.1376e-04, -6.3089e-05, -1.3905e-04, ..., -3.5342e-05,
-2.2315e-05, -9.9548e-05]], device='cuda:0')

step_size: 0.0
exp_avg: tensor([[-1.2535e-05, -7.5248e-07, -1.6771e-05, ..., 8.8499e-06,
-5.9975e-06, -9.2271e-06],
[ 9.2020e-07, 4.9647e-07, -1.0082e-05, ..., 1.0352e-06,
5.9144e-07, -8.2584e-06],
[ 5.7311e-07, 1.9648e-06, -8.5980e-06, ..., 6.2198e-07,
7.5082e-07, -1.0802e-05],
...,
[-1.9820e-07, -6.6953e-08, -6.9843e-07, ..., -1.0760e-08,
-2.1478e-07, -1.3698e-07],
[ 1.2750e-07, -1.2141e-07, -4.7240e-07, ..., 2.8985e-07,
-5.4958e-07, -2.6231e-07],
[-1.2559e-07, -4.0557e-07, 2.3975e-07, ..., 3.0800e-07,
7.6356e-07, 3.8858e-07]], device='cuda:0') denom tensor([[3.9739e-06, 2.4796e-07, 5.3135e-06, ..., 2.8086e-06, 1.9066e-06,
2.9279e-06],
[3.0099e-07, 1.6700e-07, 3.1981e-06, ..., 3.3734e-07, 1.9703e-07,
2.6215e-06],
[1.9123e-07, 6.3132e-07, 2.7289e-06, ..., 2.0669e-07, 2.4743e-07,
3.4260e-06],
...,
[7.2678e-08, 3.1172e-08, 2.3086e-07, ..., 1.3403e-08, 7.7919e-08,
5.3316e-08],
[5.0320e-08, 4.8394e-08, 1.5939e-07, ..., 1.0166e-07, 1.8379e-07,
9.2950e-08],
[4.9716e-08, 1.3825e-07, 8.5815e-08, ..., 1.0740e-07, 2.5146e-07,
1.3288e-07]], device='cuda:0')

weight_decay*p.float(): tensor([[ 7.7831e-06, -7.0554e-05, 5.0016e-06, ..., 1.0374e-04,
1.8247e-04, 2.1118e-04],
[-1.0152e-04, 1.1799e-05, 1.6475e-05, ..., -1.5426e-05,
3.7702e-05, 8.4314e-05],
[ 1.5354e-04, 2.9751e-05, 4.2645e-05, ..., 3.1049e-05,
5.8405e-05, 2.2491e-05],
...,
[-5.2624e-05, -1.5191e-04, -5.7607e-05, ..., -1.5991e-06,
-3.6916e-05, -7.8471e-05],
[ 4.3725e-05, 4.3821e-05, 2.3359e-05, ..., 2.9689e-05,
-5.9945e-05, 1.8269e-04],
[-8.5394e-05, 3.6782e-05, -2.0110e-05, ..., -1.5885e-05,
-1.1521e-05, 4.9064e-05]], device='cuda:0')

step_size: 0.0
exp_avg: tensor([[ 2.9516e-07, -5.2174e-07, 9.7740e-07, ..., -3.3729e-05,
-2.4057e-06, 3.5328e-05],
[ 2.9855e-06, 1.1829e-06, -3.5897e-06, ..., 3.8940e-06,
3.0810e-06, -1.3207e-04],
[-9.1298e-07, -9.2131e-07, 2.4368e-06, ..., -9.2853e-06,
8.0324e-06, 4.7875e-05],
...,
[ 1.1707e-06, 7.9413e-07, -2.2969e-06, ..., -5.0602e-06,
6.1309e-06, -4.5671e-05],
[-1.1185e-06, -5.1674e-07, 2.8389e-06, ..., -2.6390e-05,
8.6911e-07, 6.6641e-05],
[ 6.2823e-08, -7.5040e-08, -6.5752e-07, ..., -1.1183e-07,
-9.9606e-06, 2.5981e-05]], device='cuda:0') denom tensor([[1.0334e-07, 1.7499e-07, 3.1908e-07, ..., 1.0676e-05, 7.7076e-07,
1.1182e-05],
[9.5410e-07, 3.8406e-07, 1.1452e-06, ..., 1.2414e-06, 9.8430e-07,
4.1775e-05],
[2.9871e-07, 3.0134e-07, 7.8059e-07, ..., 2.9463e-06, 2.5501e-06,
1.5149e-05],
...,
[3.8019e-07, 2.6113e-07, 7.3634e-07, ..., 1.6102e-06, 1.9488e-06,
1.4452e-05],
[3.6369e-07, 1.7341e-07, 9.0774e-07, ..., 8.3552e-06, 2.8484e-07,
2.1084e-05],
[2.9866e-08, 3.3730e-08, 2.1793e-07, ..., 4.5364e-08, 3.1598e-06,
8.2259e-06]], device='cuda:0')

weight_decay*p.float(): tensor([[ 2.3758e-04, 1.2268e-04, -1.3771e-04, ..., -2.8629e-05,
4.6849e-06, 9.8801e-05],
[-1.5091e-04, -2.2690e-04, -5.1804e-05, ..., 2.1317e-04,
1.9440e-04, -1.8677e-04],
[-4.9248e-05, 1.4137e-04, 9.7418e-06, ..., 8.4877e-06,
9.6817e-05, 5.9166e-05],
...,
[ 2.5085e-04, -4.0321e-05, -1.9638e-04, ..., 2.2888e-06,
9.4833e-05, 6.6032e-05],
[ 1.0468e-04, -2.8305e-04, -1.6296e-04, ..., 2.0859e-04,
-9.9716e-05, -6.4049e-05],
[-1.3451e-04, -1.6785e-04, 4.7226e-05, ..., 7.4501e-05,
-8.6288e-05, 1.0010e-04]], device='cuda:0') step_size: 0.0

@worosom
Copy link

worosom commented Jun 11, 2021

I was also running into this issue.
Then I installed apex and it all works now.
Seems to me like the "standard" implementation of the adam_step function doesn't work with fp16 models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants