Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rendering in Stereo? #131

Open
FlexCouncil opened this issue Jul 31, 2020 · 6 comments
Open

Rendering in Stereo? #131

FlexCouncil opened this issue Jul 31, 2020 · 6 comments

Comments

@FlexCouncil
Copy link

FlexCouncil commented Jul 31, 2020

Anybody know how to make stereo renders with Jukebox? I tried changing a hyperparameter (hps.channels = 2) but the model was expecting a tensor with “1” in the second dimension, implying mono:

AssertionError: Expected (1, 1, 831872) got torch.Size([1, 2, 831872])

Full traceback:

<ipython-input-12-988948e1e679> in <module>()
     15   duration = (int(sample_hps.prompt_length_in_seconds*hps.sr)//top_prior.raw_to_tokens)*top_prior.raw_to_tokens
     16   x = load_prompts(audio_files, duration, hps)
---> 17   zs = top_prior.encode(x, start_level=0, end_level=len(priors), bs_chunks=x.shape[0])
     18   zs = _sample(zs, labels, sampling_kwargs, [None, None, top_prior], [2], hps)
     19 else:

5 frames

/usr/local/lib/python3.6/dist-packages/jukebox/prior/prior.py in encode(self, x, start_level, end_level, bs_chunks)
    218         # Get latents
    219         with t.no_grad():
--> 220             zs = self.encoder(x, start_level=start_level, end_level=end_level, bs_chunks=bs_chunks)
    221         return zs
    222 

/usr/local/lib/python3.6/dist-packages/jukebox/vqvae/vqvae.py in encode(self, x, start_level, end_level, bs_chunks)
    139         zs_list = []
    140         for x_i in x_chunks:
--> 141             zs_i = self._encode(x_i, start_level=start_level, end_level=end_level)
    142             zs_list.append(zs_i)
    143         zs = [t.cat(zs_level_list, dim=0) for zs_level_list in zip(*zs_list)]

/usr/local/lib/python3.6/dist-packages/jukebox/vqvae/vqvae.py in _encode(self, x, start_level, end_level)
    130         for level in range(self.levels):
    131             encoder = self.encoders[level]
--> 132             x_out = encoder(x_in)
    133             xs.append(x_out[-1])
    134         zs = self.bottleneck.encode(xs)

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    548             result = self._slow_forward(*input, **kwargs)
    549         else:
--> 550             result = self.forward(*input, **kwargs)
    551         for hook in self._forward_hooks.values():
    552             hook_result = hook(self, input, result)

/usr/local/lib/python3.6/dist-packages/jukebox/vqvae/encdec.py in forward(self, x)
     71         N, T = x.shape[0], x.shape[-1]
     72         emb = self.input_emb_width
---> 73         assert_shape(x, (N, emb, T))
     74         xs = []
     75 

/usr/local/lib/python3.6/dist-packages/jukebox/utils/torch_utils.py in assert_shape(x, exp_shape)
     23 
     24 def assert_shape(x, exp_shape):
---> 25     assert x.shape == exp_shape, f"Expected {exp_shape} got {x.shape}"
     26 
     27 def count_parameters(model):

AssertionError: Expected (1, 1, 831872) got torch.Size([1, 2, 831872])

It seemed like the problem was with the variable “input_emb_width”, which is hardcoded in make_vqvae to “1”. I tried changing that to “2” but ran into this error:

size mismatch for encoders.0.level_blocks.0.model.0.0.weight: copying a param with shape torch.Size([64, 1, 4]) from checkpoint, the shape in current model is torch.Size([64, 2, 4]).
size mismatch for encoders.1.level_blocks.0.model.0.0.weight: copying a param with shape torch.Size([32, 1, 4]) from checkpoint, the shape in current model is torch.Size([32, 2, 4]).
size mismatch for encoders.2.level_blocks.0.model.0.0.weight: copying a param with shape torch.Size([32, 1, 4]) from checkpoint, the shape in current model is torch.Size([32, 2, 4]).
size mismatch for decoders.0.out.weight: copying a param with shape torch.Size([1, 64, 3]) from checkpoint, the shape in current model is torch.Size([2, 64, 3]).
size mismatch for decoders.0.out.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([2]).
size mismatch for decoders.1.out.weight: copying a param with shape torch.Size([1, 64, 3]) from checkpoint, the shape in current model is torch.Size([2, 64, 3]).
size mismatch for decoders.1.out.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([2]).
size mismatch for decoders.2.out.weight: copying a param with shape torch.Size([1, 64, 3]) from checkpoint, the shape in current model is torch.Size([2, 64, 3]).
size mismatch for decoders.2.out.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([2]).

Is mono just baked into the training? Since training is cost-prohibitive, is there any way around this at the inference stage? Jukebox would sound so much richer in stereo.

@Cortexelus
Copy link

One way to make fake stereo, during inference: you could make two upsampled versions of the final tier. They'll sound slightly different, it'll probably be cool. Your bass should be mono though, to prevent phase cancelation and mud, so mix everything below ~200Hz as mono (use only one version of the audio, don't combine the channels into mono for bass)

@FlexCouncil
Copy link
Author

Great idea—it worked! I had to re-run tiers 0 & 1 instead of just 0, but that provided about the right amount of variation. Maybe even a third run for the mono bass would sound good. Another cool way to get around the stereo problem is to feed Jukebox a loop as a primer and then send the sparser continuations left and right in a DAW.

@michaelklachko
Copy link

Wait, how can it be possible to get the stereo effect this way? The two channels are supposed to provide spatial positioning information, but if you just generate two slightly different variations of the same things it would be nothing of the sort. Can you please post an example of what you made?

@FlexCouncil
Copy link
Author

Great idea—it worked! I had to re-run tiers 0 & 1 instead of just 0, but that provided about the right amount of variation. Maybe even a third run for the mono bass would sound good. Another cool way to get around the stereo problem is to feed Jukebox a loop as a primer and then send the sparser continuations left and right in a DAW.

@FlexCouncil
Copy link
Author

It’s fake stereo, but I like that it’s not what you’d hear in the real world. At some point Jukebox will be stereo and near-perfect, so I am taking advantage of its current flaws. I don’t want to post an example at this point but you can try it to see if it works for you.

@Cortexelus
Copy link

Cortexelus commented Aug 18, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants