Question about training dataset #453

Reinliu · 2022-05-23T05:56:40Z

Hi I have a question regarding the training of DDSP. As this model has a harmonic + noise architecture I would assume this could work on bird vocalisation datasets. I have collected a small(around 1400 samples) of the same bird species sounds each 4 seconds in length and I've listened to most of them so there is no weird noise in the datasets. However after I put them in the DDSP model as in the Colab DDSP Demo train_autoencoders, the generated sounds are not really good. The spectral loss and total loss quickly goes below 5 in my case and I've trained them for 10000steps. Below is an example of the generated:

I have tried changing modifying the conditioning parameters in the Colab demo but it did not help. I'm not sure if I'm doing anything wrong with the training or it's the problem with the dataset. Could anyone offer any suggestions?

jesseengel · 2022-05-28T08:52:59Z

It seems like the CREPE model might be having trouble with the pitch detection. You probably want to turn off the automatic adjustments as it's not detecting any "notes" because the f0_confidence is so low. That will stop it from pitch shifting down which it is currently doing.

It also looks like it might be only outputting noise and the harmonic amplitude is very low (you can check this). If that's the case it likely is because during training it is not detecting f0 correctly so it's learning to just not use the harmonic synthesizer which is an important part of bird sounds.

Reinliu · 2022-05-28T14:45:26Z

Hi Jesse, thank you for your message. Do you mean turn off the automatic adjustments in the training? I took a look at the gin files but couldn't find my automatic adjustments settings there. From what I could see the f1 confidence is zero when there is silence in between bird chirps, while during bird chirps the confidence quickly rises up, from 0.2 to 1, which makes sense right?

My guess is that the onset of the bird chirps are different in the datasets so although the CREPE model is detecting some pitches but they occur in different time positions. So this irregular time occurance and duration might have tricked the autoencoder so it just tends to put more emphasis on the noise synth part instead. I took a listen of the NSynth dataset and it seems all the instrument starts at the beginning and ends in the end, so it causes no trouble to the autoencoder. That's my instinct about it although I'm not sure if this would be the case in training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about training dataset #453

Question about training dataset #453

Reinliu commented May 23, 2022

jesseengel commented May 28, 2022

Reinliu commented May 28, 2022

Question about training dataset #453

Question about training dataset #453

Comments

Reinliu commented May 23, 2022

jesseengel commented May 28, 2022

Reinliu commented May 28, 2022