Rendering in Stereo? #131

FlexCouncil · 2020-07-31T19:07:03Z

Anybody know how to make stereo renders with Jukebox? I tried changing a hyperparameter (hps.channels = 2) but the model was expecting a tensor with “1” in the second dimension, implying mono:

AssertionError: Expected (1, 1, 831872) got torch.Size([1, 2, 831872])

Full traceback:

<ipython-input-12-988948e1e679> in <module>()
     15   duration = (int(sample_hps.prompt_length_in_seconds*hps.sr)//top_prior.raw_to_tokens)*top_prior.raw_to_tokens
     16   x = load_prompts(audio_files, duration, hps)
---> 17   zs = top_prior.encode(x, start_level=0, end_level=len(priors), bs_chunks=x.shape[0])
     18   zs = _sample(zs, labels, sampling_kwargs, [None, None, top_prior], [2], hps)
     19 else:

5 frames

/usr/local/lib/python3.6/dist-packages/jukebox/prior/prior.py in encode(self, x, start_level, end_level, bs_chunks)
    218         # Get latents
    219         with t.no_grad():
--> 220             zs = self.encoder(x, start_level=start_level, end_level=end_level, bs_chunks=bs_chunks)
    221         return zs
    222 

/usr/local/lib/python3.6/dist-packages/jukebox/vqvae/vqvae.py in encode(self, x, start_level, end_level, bs_chunks)
    139         zs_list = []
    140         for x_i in x_chunks:
--> 141             zs_i = self._encode(x_i, start_level=start_level, end_level=end_level)
    142             zs_list.append(zs_i)
    143         zs = [t.cat(zs_level_list, dim=0) for zs_level_list in zip(*zs_list)]

/usr/local/lib/python3.6/dist-packages/jukebox/vqvae/vqvae.py in _encode(self, x, start_level, end_level)
    130         for level in range(self.levels):
    131             encoder = self.encoders[level]
--> 132             x_out = encoder(x_in)
    133             xs.append(x_out[-1])
    134         zs = self.bottleneck.encode(xs)

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    548             result = self._slow_forward(*input, **kwargs)
    549         else:
--> 550             result = self.forward(*input, **kwargs)
    551         for hook in self._forward_hooks.values():
    552             hook_result = hook(self, input, result)

/usr/local/lib/python3.6/dist-packages/jukebox/vqvae/encdec.py in forward(self, x)
     71         N, T = x.shape[0], x.shape[-1]
     72         emb = self.input_emb_width
---> 73         assert_shape(x, (N, emb, T))
     74         xs = []
     75 

/usr/local/lib/python3.6/dist-packages/jukebox/utils/torch_utils.py in assert_shape(x, exp_shape)
     23 
     24 def assert_shape(x, exp_shape):
---> 25     assert x.shape == exp_shape, f"Expected {exp_shape} got {x.shape}"
     26 
     27 def count_parameters(model):

AssertionError: Expected (1, 1, 831872) got torch.Size([1, 2, 831872])

It seemed like the problem was with the variable “input_emb_width”, which is hardcoded in make_vqvae to “1”. I tried changing that to “2” but ran into this error:

size mismatch for encoders.0.level_blocks.0.model.0.0.weight: copying a param with shape torch.Size([64, 1, 4]) from checkpoint, the shape in current model is torch.Size([64, 2, 4]).
size mismatch for encoders.1.level_blocks.0.model.0.0.weight: copying a param with shape torch.Size([32, 1, 4]) from checkpoint, the shape in current model is torch.Size([32, 2, 4]).
size mismatch for encoders.2.level_blocks.0.model.0.0.weight: copying a param with shape torch.Size([32, 1, 4]) from checkpoint, the shape in current model is torch.Size([32, 2, 4]).
size mismatch for decoders.0.out.weight: copying a param with shape torch.Size([1, 64, 3]) from checkpoint, the shape in current model is torch.Size([2, 64, 3]).
size mismatch for decoders.0.out.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([2]).
size mismatch for decoders.1.out.weight: copying a param with shape torch.Size([1, 64, 3]) from checkpoint, the shape in current model is torch.Size([2, 64, 3]).
size mismatch for decoders.1.out.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([2]).
size mismatch for decoders.2.out.weight: copying a param with shape torch.Size([1, 64, 3]) from checkpoint, the shape in current model is torch.Size([2, 64, 3]).
size mismatch for decoders.2.out.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([2]).

Is mono just baked into the training? Since training is cost-prohibitive, is there any way around this at the inference stage? Jukebox would sound so much richer in stereo.

The text was updated successfully, but these errors were encountered:

Cortexelus · 2020-08-14T01:16:40Z

One way to make fake stereo, during inference: you could make two upsampled versions of the final tier. They'll sound slightly different, it'll probably be cool. Your bass should be mono though, to prevent phase cancelation and mud, so mix everything below ~200Hz as mono (use only one version of the audio, don't combine the channels into mono for bass)

FlexCouncil · 2020-08-16T03:24:03Z

Great idea—it worked! I had to re-run tiers 0 & 1 instead of just 0, but that provided about the right amount of variation. Maybe even a third run for the mono bass would sound good. Another cool way to get around the stereo problem is to feed Jukebox a loop as a primer and then send the sparser continuations left and right in a DAW.

michaelklachko · 2020-08-18T00:29:50Z

Wait, how can it be possible to get the stereo effect this way? The two channels are supposed to provide spatial positioning information, but if you just generate two slightly different variations of the same things it would be nothing of the sort. Can you please post an example of what you made?

FlexCouncil · 2020-08-18T20:36:51Z

Great idea—it worked! I had to re-run tiers 0 & 1 instead of just 0, but that provided about the right amount of variation. Maybe even a third run for the mono bass would sound good. Another cool way to get around the stereo problem is to feed Jukebox a loop as a primer and then send the sparser continuations left and right in a DAW.

FlexCouncil · 2020-08-18T20:37:40Z

It’s fake stereo, but I like that it’s not what you’d hear in the real world. At some point Jukebox will be stereo and near-perfect, so I am taking advantage of its current flaws. I don’t want to post an example at this point but you can try it to see if it works for you.

Cortexelus · 2020-08-18T22:10:24Z

Another technique would be to split the stems of your generated tracks using Spleeter or OpenUnmix or Wave-U-Net, and then widen/pan them around in a DAW

…

On Tue, Aug 18, 2020 at 4:37 PM Demon Flex Council ***@***.***> wrote: It’s fake stereo, but I like that it’s not what you’d hear in the real world. At some point Jukebox will be stereo and near-perfect, so I am taking advantage of its current flaws. I don’t want to post an example at this point but you can try it to see if it works for you. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#131 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AASXCZID2OJJXAG5GSZID63SBLRCHANCNFSM4PQXXETA> .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rendering in Stereo? #131

Rendering in Stereo? #131

FlexCouncil commented Jul 31, 2020 •

edited

Loading

Cortexelus commented Aug 14, 2020

FlexCouncil commented Aug 16, 2020

michaelklachko commented Aug 18, 2020

FlexCouncil commented Aug 18, 2020

FlexCouncil commented Aug 18, 2020

Cortexelus commented Aug 18, 2020 via email

Rendering in Stereo? #131

Rendering in Stereo? #131

Comments

FlexCouncil commented Jul 31, 2020 • edited Loading

Cortexelus commented Aug 14, 2020

FlexCouncil commented Aug 16, 2020

michaelklachko commented Aug 18, 2020

FlexCouncil commented Aug 18, 2020

FlexCouncil commented Aug 18, 2020

Cortexelus commented Aug 18, 2020 via email

FlexCouncil commented Jul 31, 2020 •

edited

Loading