layout | title | tags | mathjax | ||
---|---|---|---|---|---|
post |
Theory of Deep Learning: Generative Models |
|
true |
Till now, in this series based on the ICML 2018 tutorial on "Toward a Theory for Deep Learning" by Prof. Sanjeev Arora, we have limited our discussion to the theory of supervised discriminative neural models, i.e., those models which learn the conditional probability
We now turn our attention towards the theory of unsupervised learning and generative models, with special emphasis on variational autoencoders and generative adversarial networks (GANs). But first, what is unsupervised learning?
The goal for unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about the data.
Evidently, unsupervised learning is much more abstract than its supervised counterpart. In the latter, our objective was essentially to find a function that approximates the original mapping of the distribution
Why is learning structures important? Creating large annotated datasets is an expensive task, and may even be infeasible for some problems such as parsing, which require significant domain knowledge. Let's consider the simplest problem of image classification. The largest dataset for this problem, ImageNet, contains 14 million images, with 20000 distinct output labels. However, the number of images freely available online far exceeds 14 million, which means that we can probably learn something from them. This kind of transfer learning is the most important motivation for unsupervised learning.
For instance, while training a machine translation model, obtaining a parallel corpus may be difficult, but we always have access to unilateral text corpora in different languages. If we then try to learn some underlying structure present in these languages, it can assist the downstream translation task. In fact, recent advances in [transfer learning for NLP]({% post_url 2018-06-15-transfer-learning-nlp %}) have empirically proven that huge performance gains are possible using such a technique.
Representation learning is perhaps the most widely studied aspect of unsupervised learning. A "good representation" often means one which disentangles factors of variation, i.e, each coordinate in the representation corresponds to one meaningful factor of variation. For example, if we consider word embeddings, an ideal vector representing a word would depict different features of the word along each dimension. However, this is easier said than done, since learning representations require an objective function, and it is still unknown how to translate these notions of "good representation" into training criteria. For this reason, representation learning is often criticized for getting too much attention for transfer learning. The essence of the criticism, taken from this post by Ferenc Huszár is this:
If we identified transfer learning as the primary task representation learning is supposed to solve, are we actually sure that representation learning is the way to solve it? One can argue that there may be many ways to transfer information from some dataset over to a novel task. Learning a representation and transferring that is just one approach. Meta-learning, for example, might provide another approach.
In the discussion so far, we have blindly assumed that the data indeed contains structures that can be learnt. This is not an oversight; it is actually based on the manifold assumption which we will discuss next.
A manifold is a topological space that locally resembles Euclidean space near each point.
This means that globally, a manifold may not be a Euclidean space. The only requirement for an
-
A neighborhood of a point
$p$ in$X$ is a$V \subset X$ which contains an open set$U$ containing$p$ , i.e.,$p$ must be in the interior of$V$ . -
A function
$f: X \rightarrow Y$ between two topological spaces$X$ and$Y$ is called a homeomorphism if it has the following properties:-
$f$ is a bijection, -
$f$ is continuous, -
$f^{-1}$ is continuous.
-
-
A Euclidean space is a topological space such that
- it is in 2 or 3 dimensions and obeys Euclidean postulates, or
- it is in any dimension such that points are given by coordinates and satisfy Euclidean distance.
Note that the dimension of a manifold may not always be the same as the dimension of the space in which the manifold is embedded. Dimension here simply means the degree of freedom of the underlying process that generated the manifold. As such, lines and curves, even if embedded in
With this definition in place, we can now state the manifold assumption. It hypothesizes that the intrinsic dimensionality of the data is much smaller than the ambient space in which the data is embedded. This means that if we have some data in
It is easy to see that the manifold assumption is, as the name suggests, just an assumption, and does not hold universally. Otherwise, applying the assumption consecutively, we would be able to represent any high-dimensional data using a one-dimensional manifold, which, of course, is not possible.
The task of manifold learning is modeled as approximating the joint probability density
-
Deep models promote reuse of features. We have already seen in the previous post that depth is analogous to composition whereas width is analogous to addition. Composition offers more representation capability than addition using the same number of parameters.
-
Deep models are conjectured to lead to progressively more abstract features at higher levels of representation. An example of this is the commonly known phenomenon in training deep convolutional networks on image data, where it is found that the first few layers learn lines, blobs, and other local features, and higher level layers learn more abstract features. This is done explicitly using the pooling mechanism.
Deep learning models often face some flak for being purely intution-based. Variational autoencoders (VAEs) are the practitioner's answer to such criticisms, since they are rooted in the theory of Bayesian inference, and also perform well empirically. In this section, we will look at the theory that forms VAEs.
First, we formalize the notion of the "code" that we mentioned earlier using the concept of a latent variable. These are those variables that are not directly observed but are inferred from the observable variables. For instance, if the model is drawing a picture of an MNIST digit, it would make sense to first have a variable choose a digit from
Formally, suppose we have a vector of latent variables
Now, since we have no idea how to check if randomly generated images are "like" our dataset, we use the notion of "maximum likelihood", i.e., if the model is likely to produce training set samples, then it is also likely to produce similar samples and unlikely to produce dissimilar ones. With this assumption, we want to maximize the probability of each
In VAEs, we usually have
- Sample
$z$ from some known distribution. - Feed
$z$ into some parametrized function to get$X$ . - Tune the parameters of the function such that generated
$X$ resemble those in dataset.
In this process, two questions arise:
How do we define
VAEs simply sample
How do we deal with
We need to understand that the space
- relate
$P(X)$ with$\mathbb{E}_{z\sim Q}P(X\vert z)$ , and - estimate
$\mathbb{E}_{z\sim Q}P(X\vert z)$ .
For the first, we use KL-divergence (that we saw in the previous post) between the probability distribution estimated by
$$ \begin{align} & \mathcal{D}{KL}[Q(z|X)||P(z|X)] = \mathbb{E}{z\sim Q}[\log Q(z|X) - \log P(z|X)] \ &= \mathbb{E}{z\sim Q}\left[ \log Q(z|X) - \log \frac{P(X|z)P(z)}{P(X)} \right] \ &= \mathbb{E}{z\sim Q} [ \log Q(z|X) - \log P(X|z) - \log P(z) ] + \log P(X) \ \Rightarrow & \log P(X) - \mathcal{D}{KL}[Q(z|X)||P(z|X)] = \mathbb{E}{z\sim Q}[\log P(X|z)] - \mathcal{D}_{KL}[Q(z|X)||P(z)] \end{align} $$
In the LHS of the above equation, we have an expression that we want to maximize, since we want
Now we are just left with finding some way to optimize the RHS in the equation. For this, we will have to choose some model for
To estimate the first term on the RHS, we just compute the term for one sample of
This can be represented in the form of a feedforward network by the figure on the left below.
There is, however, a caveat. The network is not trainable using backpropagation because the red box is a stochastic step, which means that it is not differentiable. To solve this problem, we use the reparametrization trick as follows.
After this trick, we get the final network as shown in the right in the above figure. Furthermore, we must have
Although VAEs have strong theoretical support, they do not work very well in practice, especially in problems such as face generation. This is because the loss function used for training is log-likelihood, which ultimately leads to fuzzy face images which have high match with several
GANs were proposed in 2014, and have become immensely popular in computer vision ever since. They are basically motivated from game theory, and I will not get into the details here since the tutorial by Ian Goodfellow is a excellent resource for the same.
Since the prior learnt by the generator depends upon the discriminative process, an important issue with GANs is that of mode collapse. The problem is that since the discriminator only learns from a few samples, it may be unable to teach the generator to produce
In this section, I will discuss three results from two important papers from Arora et al. which deal with mode collapse in GANs.
- Generalization and equilibrium in generative adversarial nets
- Do GANs learn the distribution? Some theory and empirics
For all our discussions in this section, we will consider the Wasserstein GAN objective instead of the usual minimax objective, which is as follows (and arguably more intuitive)
$$ J = \lvert \mathbb{E}{x\in \mathcal{P}{real}}[D(x)] - \mathbb{E}{x\in \mathcal{P}{synth}}[D(x)] \rvert, $$
where
If the discriminator size is
$n$ , then there exists a generator supported on$\mathcal{O}(n\log n)$ images, which wins against all possible discriminators.
This means that if we have a discriminator of size
Proof: Suppose
In the paper, the authors have defined an
Here, we don't really know the error, but we can use our distance measure to the same effect. If the size of discriminator is
First we approximate the parameter space
The
$$ P\left[ \lvert \mathbb{E}{x\sim \mu}[D{\nu}(x)] - \mathbb{E}{x\sim \tilde{\mu}}[D{\nu}(x)] \rvert \geq \frac{\epsilon}{4} \right] \leq 2\exp \left( -\frac{\epsilon^2 m}{2} \right). $$
Taking union bound over all
$$ \lvert \mathbb{E}{x\sim \mu}[D{\nu}(x)] - \mathbb{E}{x\sim \tilde{\mu}}[D{\nu}(x)] \rvert \leq \frac{\epsilon}{2}. $$
We can prove a similar upper bound for
For GANs to be successful, they must find an equilibrium in the G-D game where the generator wins. In the context of the minimax equation, this means that switching min and max in the objective should not cause any change in the equilibrium. In the paper, the authors prove an
If a generator net is able to generate a Gaussian distribution, then there exists an
$\epsilon$ -approximate equilibrium where the generator has capacity$\mathcal{O}(n\log n / \epsilon^2)$ .
The proof of this result lies in a classical result in statistics, which says that any probability distribution can be approximated by a mixture of infinite Gaussians. For this, we just need to take the standard Gaussian
We have already seen that GAN training can be successful even if the generator has not learnt a good enough distribution, if the discriminator is small. But suppose we take a really large discriminator and then train our GAN to a minima. How do we still make sure that the generator distribution is good? It could well be the case that the generator has simply memorized the training data, due to which the discriminator cannot make a better guess than random. Researchers have proposed several qualitative checks to test this:
- Check the similarity of each generated image to the nearest image in the training set.
- If the seed formed by interpolating two seeds
$s_1$ and$s_2$ that produce realistic images, also produces realistic images, then the learnt distribution probably has many realistic images. - Check for semantically important directions in latent space, which cause predictable changes in generated image.
We will now see a new empirical measure for the support size of the trained distribution, based on the Birthday Paradox.
The birthday paradox
In a room of just 23 people, there's a 50% chance of finding 2 people who share their birthday.
To see why, refer to this post. It is a simple problem of permutation and combination, followed by using the approximation for
Since
Suppose we have a probability distribution
Now, suppose we have empirically found this minimum probability of collision to be
This gives an upper bound on the support size of the distribution learned by the generator.
Generative models are definitely very promising, especially with the recent interest in transfer learning with unsupervised pretraining. While I have tried to explain the recent insights into GANs as best as possible, it is not possible to explain every detail in the proof in an overview post. Even so, I hope I have been able to at least give a flavor of how veterans in the field approach theoretical guarantees.