Training issue: AssertionError: Expected torch.Size #93

ObscuraDK · 2020-05-26T12:27:06Z

Hi there.

I am trying to train a vqvae, but are ending with the following error message.
I have added the while statement to audio_utils.py,as decribed in #59 .

0/96 [00:00<?, ?it/s]
Traceback (most recent call last):
File "jukebox/train.py", line 342, in
fire.Fire(run)
File "/home/vertigo/miniconda3/envs/jukebox/lib/python3.7/site-packages/fire/core.py", line 127, in Fire
component_trace = _Fire(component, args, context, name)
File "/home/vertigo/miniconda3/envs/jukebox/lib/python3.7/site-packages/fire/core.py", line 366, in _Fire
component, remaining_args)
File "/home/vertigo/miniconda3/envs/jukebox/lib/python3.7/site-packages/fire/core.py", line 542, in _CallCallable
result = fn(*varargs, **kwargs)
File "jukebox/train.py", line 325, in run
train_metrics = train(distributed_model, model, opt, shd, scalar, ema, logger, metrics, data_processor, hps)
File "jukebox/train.py", line 227, in train
x_out, loss, _metrics = model(x, **forw_kwargs)
File "/home/vertigo/miniconda3/envs/jukebox/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/vertigo/miniconda3/envs/jukebox/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 376, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/vertigo/miniconda3/envs/jukebox/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/vertigo/jukebox/jukebox/vqvae/vqvae.py", line 168, in forward
assert_shape(x_out, x_in.shape)
File "/home/vertigo/jukebox/jukebox/utils/torch_utils.py", line 25, in assert_shape
assert x.shape == exp_shape, f"Expected {exp_shape} got {x.shape}"
AssertionError: Expected torch.Size([4, 1, 130976]) got torch.Size([4, 1, 130816])

ObscuraDK · 2020-05-29T07:25:38Z

I have figured out that its line 225 in train.py which fails when it runs this:
x_out, loss, _metrics = model(x, **forw_kwargs)

ObscuraDK · 2020-05-29T08:21:37Z

Solved: typo in setting sample length, and fail in multiple GPU setup.

I am running with two 1080 ti

I fired this and restart:
sudo nvidia-xconfig -sli=off -multigpu=off

And fired this:
mpiexec -n 2 python jukebox/train.py --hps=small_vqvae --name=small_vqvae --sample_length=131072 --bs=4
--audio_files_dir={audio_files_dir} --labels=False --train --aug_shift --aug_blend

ObscuraDK closed this as completed May 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training issue: AssertionError: Expected torch.Size #93

Training issue: AssertionError: Expected torch.Size #93

ObscuraDK commented May 26, 2020

ObscuraDK commented May 29, 2020

ObscuraDK commented May 29, 2020

Training issue: AssertionError: Expected torch.Size #93

Training issue: AssertionError: Expected torch.Size #93

Comments

ObscuraDK commented May 26, 2020

ObscuraDK commented May 29, 2020

ObscuraDK commented May 29, 2020