layout | title | tags | mathjax | ||
---|---|---|---|---|---|
post |
GBO notes: i-vectors and x-vectors |
|
true |
In this note, we will review the two most popular speaker embedding extraction methods, namely i-vectors and x-vectors. But first, it would be useful to quickly recap generative and discriminative models.
Suppose we have some observed variables
A generative model is a model of the joint probability
Neural network models trained for classification tasks are discriminative models that predict logits which are then converted into a distribution over the targets using softmax.
We basically need a way to characterize the variability in acoustic features resulting from speaker and channel differences. Suppose the utterances are represented by a sequence of feature vectors (usually MFCCs). A naive way may be to compute the average over the whole sequence; but this would also capture things like lexical content, phonetic variation, and so on. So what we actually want to do is to compare the difference between the features of the target utterance compared with those from a general set of utterances, in the hope that this difference would remove any effects arising from phonetic/lexical content.
This is done with the help of Gaussian mixture models (GMMs). First, we pool together a large set of utterances and compute their features. We then assume that all the feature vectors are generated I.I.D. using a GMM with a fixed number of components (usually 2048). We use the EM algorithm to learn the parameters of this GMM, and call it the Universal Background Model (UBM).
Suppose we now concatenate the means of all the UBM components into a giant vector (called a
"supervector"), and denote it as
where
The parameters of the model
At inference time, the i-vector of the utterance is computed as the MAP estimate of the feature sequence given the model parameters, i.e.,
where
While i-vectors are mathematically attractive and useful in the sense that they do not need a labeled training set, they are computationally expensive. Training the UBM model and the parameters of the GMM is time-consuming, and since they model the total variability (i.e., speaker and channel), it is hard to tease apart the speaker-specific information.
Deep neural network based speaker embeddings have become popular as a discriminatively-trained alternative to i-vectors. Given a sequence of feature vectors, we use a neural network to obtain an utterance-level representation, which is then fed into a softmax layer that estimates a distribution over a large set of speakers. The parameters of the network are trained using gradients of a classification-style loss.
Conventionally, the x-vector neural network consists of TDNN (basically 1-D strided CNN) layers are the bottom which operate at the frame-level, followed by a statistics pooling layer. This layer aggregates over the whole input sequence and computes its mean and standard deviation. These utterance-level stats are concatenated together and passed to further hidden layers that operate at the segment level and produce the final representation that is fed to the softmax.
At the time of inference, the output layer is discarded and the segment-level representation immediately preceding it is used as the embedding, also called an x-vector.