bending the re/de-constructed melspectrogram to create new sounds. #4

johndpope · 2021-11-02T05:53:43Z

Is it possible? I want to take above visual and mash it around (change the shapes) to create new vocals....

UPDATE
basically - i think I want to condition the SpecVQGAN on these images - (not video a video frame per se')

v-iashin · 2021-11-02T06:41:37Z

Hi there!

Let me try to rephrase the question to make sure I am on the same page. The visual is not a spectrogram, is it? It is an activation matrix (Frequency x time). You would like to condition audio synthesis on this information and, hopefully, get the audio with a similar activation matrix.

Do I understand it correctly?

johndpope · 2021-11-02T10:18:27Z

The visual is not a spectrogram, is it? - that's right.
It is an activation matrix (Frequency x time). Yes.
I spat this out from a song using this library by @marl https://github.com/marl/crepe
The voice is an amazing instrument - but only has 4 directions -> straight / up / down or zigzag pattern.
I want to play with this (using say photoshop) to create new melodies (I don't need understandable lyrics).
Do I understand it correctly? yes.

v-iashin · 2021-11-02T11:26:03Z

Ok, I see. This seems to be quite interesting.

I think I saw something similar before: https://magenta.tensorflow.org/music-vae – it is more like a MIDI player.
Maybe this will also be useful: https://sonycslparis.github.io/interactive-spectrogram-inpainting/

Regarding the Spectrogram VQGAN. I don't think this image (activation matrix) is a good choice as an input here because you will need to quantize (encode) it as a sequence of codes that the transformer will be using as a prime. This would require training another VQGAN to reconstruct these activation matrices.

What you can do instead is to assume that for each time step you have only one frequency. Check our the visual, most of the time you have one activation per time. With this, you can simply take the sequence of frequencies and train the transformer to generate audio given this list. Maybe you can also add a class (style: male/female) into this condition to stylize the output.

For this idea you will need:

a dataset with speech segments and pitch/frequency annotations
to train the SpecVQGAN to reconstruct the speech spectrograms
to train the transformer using the encoded frequency annotations (as a list of frequencies) as the conditioning;
build a player which will do what you want: a photoshop kinda tool that will transform the user's input as the list of frequencies that will be plugged in as the input to the transformer.

johndpope referenced this issue in openai/jukebox Nov 6, 2021

add vqvae demo

d46decb

Repository owner locked and limited conversation to collaborators Nov 9, 2021

v-iashin closed this as completed Nov 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

bending the re/de-constructed melspectrogram to create new sounds. #4

bending the re/de-constructed melspectrogram to create new sounds. #4

johndpope commented Nov 2, 2021 •

edited

Loading

v-iashin commented Nov 2, 2021

johndpope commented Nov 2, 2021

v-iashin commented Nov 2, 2021 •

edited

Loading

This issue was moved to a discussion.

This issue was moved to a discussion.

bending the re/de-constructed melspectrogram to create new sounds. #4

bending the re/de-constructed melspectrogram to create new sounds. #4

Comments

johndpope commented Nov 2, 2021 • edited Loading

v-iashin commented Nov 2, 2021

johndpope commented Nov 2, 2021

v-iashin commented Nov 2, 2021 • edited Loading

This issue was moved to a discussion.

johndpope commented Nov 2, 2021 •

edited

Loading

v-iashin commented Nov 2, 2021 •

edited

Loading