-
Notifications
You must be signed in to change notification settings - Fork 39
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bending the re/de-constructed melspectrogram to create new sounds. #4
Comments
Hi there! Let me try to rephrase the question to make sure I am on the same page. The visual is not a spectrogram, is it? It is an activation matrix (Frequency x time). You would like to condition audio synthesis on this information and, hopefully, get the audio with a similar activation matrix. Do I understand it correctly? |
The visual is not a spectrogram, is it? - that's right. |
Ok, I see. This seems to be quite interesting. I think I saw something similar before: https://magenta.tensorflow.org/music-vae – it is more like a MIDI player. Regarding the Spectrogram VQGAN. I don't think this image (activation matrix) is a good choice as an input here because you will need to quantize (encode) it as a sequence of codes that the transformer will be using as a prime. This would require training another VQGAN to reconstruct these activation matrices. What you can do instead is to assume that for each time step you have only one frequency. Check our the visual, most of the time you have one activation per time. With this, you can simply take the sequence of frequencies and train the transformer to generate audio given this list. Maybe you can also add a class (style: male/female) into this condition to stylize the output. For this idea you will need:
|
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
ciaua/unagan#8
Is it possible? I want to take above visual and mash it around (change the shapes) to create new vocals....
UPDATE
basically - i think I want to condition the SpecVQGAN on these images - (not video a video frame per se')
The text was updated successfully, but these errors were encountered: