-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reuse pre-trained VQ-VAE throws runtimeError #99
Comments
Can you print the data types of all variables in this step as well as input to the function adam_step? |
Hi prafullasd. Where do I find the step that you would like to see variables from? A sidenote: I tried to run without the 'all_fp16' parameter and got a little bit further. Thank you for your help. exp_avg: tensor([[-1.2535e-05, -7.5248e-07, -1.6771e-05, ..., 8.8499e-06, weight_decay*p.float(): tensor([[ 1.2296e-05, -7.0621e-05, -4.8056e-05, ..., -1.3472e-04, step_size: 0.0 weight_decay*p.float(): tensor([[ 4.8816e-05, 5.4078e-05, 1.6971e-04, ..., -4.7752e-05, step_size: 0.0 weight_decay*p.float(): tensor([[ 7.7831e-06, -7.0554e-05, 5.0016e-06, ..., 1.0374e-04, step_size: 0.0 weight_decay*p.float(): tensor([[ 2.3758e-04, 1.2268e-04, -1.3771e-04, ..., -2.8629e-05, |
I was also running into this issue. |
Hi there.
I am trying to reuse the VQ-VAE, with 118 44.1khz 16bit audio files on a 1080 TI.
executing this:
mpiexec -n 1 python jukebox/train.py --hps=vqvae,small_prior,all_fp16,cpu_ema --name=pretrained_vqvae_small_prior --sample_length=1048576 --bs=4 --aug_shift --aug_blend --audio_files_dir='/home/vertigo/jukebox/learning' --labels=False --train --test --prior --levels=3 --level=2 --weight_decay=0.01 --save_iters=1000
Get this:
Using cuda True
0: Found 118 files. Getting durations
0: self.sr=44100, min: 24, max: inf
0: Keeping 118 of 118 files
{'l2': 0.010829454215347525, 'l1': 0.07307693362236023, 'spec': 4325.51904296875}
Creating Data Loader
0: Train 859 samples. Test 96 samples
0: Train sampler: <torch.utils.data.distributed.DistributedSampler object at 0x7f29043f8110>
0: Train loader: 214
Downloading from gce
Restored from /home/vertigo/.cache/jukebox-assets/models/5b/vqvae.pth.tar
0: Loading vqvae in eval mode
Parameters VQVAE:0
Level:2, Cond downsample:None, Raw to tokens:128, Sample length:1048576
0: Converting to fp16 params
0: Loading prior in train mode
Parameters Prior:161862656
{'dynamic': True, 'loss_scale': 65536.0, 'max_loss_scale': 16777216.0, 'scale_factor': 1.0027764359010778, 'scale_window': 1, 'unskipped': 0, 'overflow': False}
Using CPU EMA
Logging to logs/pretrained_vqvae_small_prior
0/214 [00:08<?, ?it/s]
Traceback (most recent call last):
File "jukebox/train.py", line 341, in
fire.Fire(run)
File "/home/vertigo/miniconda3/envs/jukebox/lib/python3.7/site-packages/fire/core.py", line 127, in Fire
component_trace = _Fire(component, args, context, name)
File "/home/vertigo/miniconda3/envs/jukebox/lib/python3.7/site-packages/fire/core.py", line 366, in _Fire
component, remaining_args)
File "/home/vertigo/miniconda3/envs/jukebox/lib/python3.7/site-packages/fire/core.py", line 542, in CallCallable
result = fn(*varargs, **kwargs)
File "jukebox/train.py", line 325, in run
train_metrics = train(distributed_model, model, opt, shd, scalar, ema, logger, metrics, data_processor, hps)
File "jukebox/train.py", line 241, in train
opt.step(scale=clipped_grad_scale(grad_norm, hps.clip, scale))
File "/home/vertigo/jukebox/jukebox/utils/fp16.py", line 213, in step
group["weight_decay"],
File "/home/vertigo/jukebox/jukebox/utils/fp16.py", line 29, in adam_step
p.add(exp_avg/denom + weight_decay*p.float(), alpha=-step_size)
RuntimeError: output with backend CUDA and dtype Half doesn't match the desired backend CUDA and dtype Float
The text was updated successfully, but these errors were encountered: