-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New ML yaml + changes to allow for Spectral Codec training with text context #14894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New ML yaml + changes to allow for Spectral Codec training with text context #14894
Conversation
Signed-off-by: Jason <[email protected]>
Signed-off-by: Jason <[email protected]>
Signed-off-by: Jason <[email protected]>
Signed-off-by: Jason <[email protected]>
Signed-off-by: Jason <[email protected]>
Signed-off-by: Jason <[email protected]>
Signed-off-by: Jason <[email protected]>
Signed-off-by: Jason <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces changes to support Spectral Codec training with text context in the Magpie-TTS model. The changes include modifications to codec model loading to disable loss modules during inference and variable reorganization to handle different codebook configurations.
- Codec model loading enhancement to disable SCL loss during inference for memory optimization
- Variable restructuring to distinguish between data and model codebook configurations
- New multilingual configuration file for Magpie-TTS with text conditioning support
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| nemo/collections/tts/models/magpietts.py | Modified codec loading logic and reorganized codebook variables to support different training scenarios |
| examples/tts/conf/magpietts/magpietts_multilingual_v2_lhotse.yaml | Added new configuration file for multilingual Magpie-TTS with text conditioning |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Signed-off-by: Jason <[email protected]>
Signed-off-by: blisc <[email protected]>
rlangman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
| # @blisc: Added a +1. If we send in exactly 882 samples, then a conv layer complains about padding. | ||
| # Adding 883 works. This occurs when we use text context during inference. | ||
| context_audio = torch.zeros(self.codec_model_samples_per_frame + 1, dtype=torch.float32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if there is a clean way to handle reflect/replicate padding when input is short. We could have each codec architecture define a minimum length it can handle, and then have pad_audio zero pad to at least that length: https://github.com/blisc/NeMo/blob/magpietts_2503/nemo/collections/tts/models/audio_codec.py#L453
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, we should do that.
No description provided.