DIS-Vector - An Effective Low-Resource, Zero-Shot Approach for Controllable, End-to-End Voice Conversion and Cloning π€β¨
Welcome to the DIS-Vector project! This repository presents an advanced low-resource, zero-shot voice conversion and cloning model that leverages disentangled embeddings, clustering techniques, and language-based similarity matching to achieve highly natural and controllable voice synthesis.
The DIS-Vector model introduces a novel approach to voice conversion by disentangling speech components content, pitch, rhythm, and timbre into separate embedding spaces, enabling fine-grained control over voice synthesis. Unlike traditional voice conversion models, DIS-Vector is capable of zero-shot voice cloning, meaning it can synthesize voices from unseen speakers and languages without requiring large-scale speaker-specific training data.
- Overview
- Dis-Vector Model Details
- Datasets
- Evaluation
- Clustering & Language Matching
- Results
- MOS Score Analysis
- Conclusion
The Dis-Vector model represents a significant advancement in voice conversion and synthesis by employing disentangled embeddings and clustering methodologies to precisely capture and transfer speaker characteristics. It introduces a novel language-based similarity approach and K-Means clustering for efficient speaker retrieval and closest language matching during inference.
- Disentangled Embeddings: Separate encoders for content, pitch, rhythm, and timbre.
- Zero-Shot Capabilities: Effective voice cloning and conversion across different languages.
- High-Quality Synthesis: Enhanced accuracy and flexibility in voice cloning.
- K-Means Clustering: Optimized speaker embedding retrieval for inference.
- Language-Based Similarity Matching: Determines the closest match from the embedding database to improve synthesis quality.
Explore our live demo here showcasing the capabilities of the Dis-Vector model! This interactive demo allows you to experience the voice conversion and synthesis features in real-time. Users can listen to the synthesized audio samples generated by the model, highlighting its ability to accurately replicate and transform speaker characteristics across various languages and voices.
The Dis-Vector model consists of several key components that work together to achieve effective voice conversion and synthesis:
-
Architecture: The model employs a multi-encoder architecture, with dedicated encoders for each feature type:
- Content Encoder: Captures linguistic content and phonetic characteristics.
- Pitch Encoder: Extracts pitch-related features to ensure accurate pitch reproduction.
- Rhythm Encoder: Analyzes rhythmic patterns and timing to preserve the original speech flow.
- Timbre Encoder: Captures unique vocal qualities of the speaker, allowing for more natural-sounding outputs.
-
Disentangled Embeddings: The model produces a 512-dimensional embedding vector, organized as follows:
- 256 elements for content features
- 128 elements for pitch features
- 64 elements for rhythm features
- 64 elements for timbre features
-
Zero-Shot Capability: The Dis-Vector model demonstrates remarkable zero-shot performance, enabling voice cloning and conversion across different languages without needing extensive training data for each target voice.
-
Feature Transfer: The model facilitates the transfer of individual features from the source voice to the target voice, allowing for customizable voice synthesis while retaining the original speech's essence.
Integrating VITS with DIS-Vector enhances its capabilities by leveraging disentangled embeddings of speech components (content, pitch, rhythm, and timbre). DIS-Vector provides fine-grained control over these components, enabling high-quality voice conversion and zero-shot voice cloning. This integration empowers VITS to generate speech in new voices, adapting to different speakers and languages without the need for speaker-specific training data, offering more flexibility and realism in synthetic speech generation.
A speech signal ( s(t) ) is decomposed into four distinct components:
- C(t) (Content): Represents linguistic information.
- P(t) (Pitch): Corresponds to the fundamental frequency ( F_0 ).
- R(t) (Rhythm): Captures duration and timing patterns.
- T(t) (Timbre): Defines speaker identity characteristics.
The following loss functions are utilized in the model:
-
Mean Squared Error (MSE) Loss:
The MSE Loss is used for minimizing the difference between the predicted and actual values for continuous speech components such as pitch and timbre. It is applied as an overall reconstruction loss to ensure that the model accurately reconstructs these continuous components.
- Kullback-Leibler (KL) Divergence Loss:
Measures the difference between two probability distributions, often used for speaker similarity matching and ensuring that the embeddings align with the desired distributions.
- Disentanglement Loss:
Ensures that the learned embeddings for each speech component (content, pitch, rhythm, timbre) remain distinct and non-interfering, contributing to the overall performance of the model.To optimize the separation of speech components into distinct embedding spaces, the total loss function is defined as:
Where:
- L_content ensures linguistic consistency.
- L_pitch preserves fundamental frequency information.
- L_rhythm maintains speech timing.
- L_timbre preserves speaker identity characteristics.
- Features recordings from speakers of various Indian languages (English, Hindi, Kannada, Telugu, Bengali).
- Approximately 1 hour of speech data with around 400 utterances per language.
- Includes recordings from multiple speakers with different accents and regional variations.
- Utilizes data from 6 male and 2 female speakers, each providing approximately 1 hour of speech.
Quantitative analysis measures the performance of the Dis-Vector model using distance metrics and statistical measures.
- Pitch Testing: Evaluates pitch variations using Pitch Error Rate (PER).
- Rhythm Testing: Assesses rhythmic patterns with Rhythm Error Rate (RER).
- Timbre Testing: Analyzes vocal qualities using Timbre Error Rate (TER).
- Content Testing: Ensures content accuracy using Content Preservation Rate (CPR).
- Cosine Similarity: Evaluates feature transfer and voice synthesis.
- Similarity scores for pitch, rhythm, timbre, and content help measure synthesis accuracy.
Dis-Vector utilizes a language-annotated speaker embedding database, where each speaker is mapped to a distinct feature representation based on their timbre and prosody characteristics. To enable efficient cross-speaker and cross-language voice conversion, we apply K-Means clustering on these high-dimensional embeddings. This clustering process helps to:
- Group speakers based on intrinsic vocal attributes such as pitch, intonation, and articulation patterns.
- Enable zero-shot voice conversion by leveraging cluster-based matching, even for unseen speakers.
- Assign cluster centroids as representative embeddings, allowing the system to select the closest match for synthesis.
- Improve generalization and adaptation by ensuring robust speaker variation capture while maintaining speaker identity.
By organizing the embedding space into well-defined clusters, Dis-Vector ensures a more structured and interpretable representation of speaker embeddings, enhancing the quality and accuracy of voice conversion.
During inference, the model selects the most suitable speaker embedding by computing cosine similarity between the target speakerβs embedding and the pre-clustered speaker embeddings in the database. This method prioritizes selecting a linguistically similar speaker, leading to:
- Better prosody preservation, as speakers from the same linguistic background share similar pitch and rhythm structures.
- Accurate voice adaptation, ensuring that even when a target speakerβs language is unseen during training, the system can infer the best match.
- Efficient feature transfer, allowing for natural-sounding synthesis without distorting speaker identity.
The language-based similarity approach refines the voice conversion process by focusing on both speaker similarity and linguistic consistency, ensuring the most natural and high-quality voice generation.
To further enhance cross-lingual voice adaptation, Dis-Vector integrates a nearest language matching strategy. Given a target speaker's embedding, the system performs the following steps:
- Determine the closest linguistic cluster by measuring the embedding distance to pre-computed cluster centroids.
- Apply a threshold-based similarity measure to ensure the closest linguistic match is selected.
- If a direct match is unavailable, the system chooses a linguistically nearest neighbor based on phonetic and prosody similarities.
This technique ensures:
- Minimal loss in speech naturalness by selecting speakers with the most similar phonetic structures.
- Improved speaker adaptation, even in cases where the target speakerβs language is underrepresented in the dataset.
- Scalability for zero-shot voice conversion, allowing seamless expansion with new speakers and languages.
By leveraging this clustering-based framework, Dis-Vector significantly improves the accuracy and efficiency of voice conversion in multilingual and low-resource language settings, making it a robust solution for global voice synthesis applications.
The results of our evaluation showcase the efficacy of the Dis-Vector model compared to traditional models.
Source Language (Gender) | Target Language (Gender) | MOS Score |
---|---|---|
English Male | English Female | 3.8 |
Hindi Female | Hindi Male | 3.7 |
Source Language (Gender) | Target Language (Gender) | MOS Score |
---|---|---|
English Male | Hindi Female | 3.9 |
Hindi Female | Telugu Male | 3.7 |
Source Lang. | Target Lang. | MOS LIMMITS Baseline | MOS (DIS Vector) |
---|---|---|---|
English | English Female | 3.5 | 3.9 |
Hindi | Hindi Female | 3.4 | 3.7 |
Language | SpeechSplit2 MOS Score | DIS-Vector MOS Score |
---|---|---|
English Male | 3.4 | 3.8 |
English Female | 3.5 | 3.9 |
The Dis-Vector model's zero-shot capabilities, enhanced by clustering and similarity-based speaker retrieval, enable effective voice cloning and conversion across different languages. It sets a new benchmark for high-quality, customizable voice synthesis.
For more details, refer to our documentation! π