Skip to content

Conversation

@blisc
Copy link
Collaborator

@blisc blisc commented Oct 7, 2025

No description provided.

@github-actions github-actions bot added the TTS label Oct 7, 2025
@blisc blisc requested a review from subhankar-ghosh October 7, 2025 16:41
@blisc blisc added the Run CICD label Oct 7, 2025
@blisc blisc requested a review from Copilot October 7, 2025 16:41
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces changes to support Spectral Codec training with text context in the Magpie-TTS model. The changes include modifications to codec model loading to disable loss modules during inference and variable reorganization to handle different codebook configurations.

  • Codec model loading enhancement to disable SCL loss during inference for memory optimization
  • Variable restructuring to distinguish between data and model codebook configurations
  • New multilingual configuration file for Magpie-TTS with text conditioning support

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
nemo/collections/tts/models/magpietts.py Modified codec loading logic and reorganized codebook variables to support different training scenarios
examples/tts/conf/magpietts/magpietts_multilingual_v2_lhotse.yaml Added new configuration file for multilingual Magpie-TTS with text conditioning

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@blisc blisc requested a review from rlangman October 10, 2025 18:00
Copy link
Collaborator

@rlangman rlangman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Comment on lines +553 to +555
# @blisc: Added a +1. If we send in exactly 882 samples, then a conv layer complains about padding.
# Adding 883 works. This occurs when we use text context during inference.
context_audio = torch.zeros(self.codec_model_samples_per_frame + 1, dtype=torch.float32)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if there is a clean way to handle reflect/replicate padding when input is short. We could have each codec architecture define a minimum length it can handle, and then have pad_audio zero pad to at least that length: https://github.com/blisc/NeMo/blob/magpietts_2503/nemo/collections/tts/models/audio_codec.py#L453

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we should do that.

@blisc blisc merged commit 22be3f4 into NVIDIA-NeMo:magpietts_2508 Oct 21, 2025
63 checks passed
@blisc blisc deleted the magpietts_2508_jasondev0 branch October 21, 2025 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants