8.2 Variational Autoencoders
The use of a recognition distribution for every observation,
The moments,
Replacing the moments of the recognition distribution with deep neural networks is the point of departure for ††margin: variational autoencoders [27, 41]. Note well that they are not limited to Gaussian recognition distributions (Eq. 8.12). However, we start with this case because it is a natural extension of the approach to sparse coding just discussed; it will likewise make use of the identities Eqs. 8.3 and 8.4. We subsequently generalize.
The M step.
How does this change to the parameterization of the recognition model change the optimization?
Since the recognition distribution is still normal, we can still use Eq. 8.1 for the joint relative entropy, and Eqs. 8.3 and 8.4 for its gradients.
Beginning with the parameters
Likewise, for the covariance, we follow Eq. 8.4, although for notational simplicity we write the derivative with respect to only a single parameter,
Computing the Hessian
Now, for sufficiently expressive generative models, the conditional expectations under the recognition distribution will in general be difficult to carry out (although cf. Eqs. 8.10 and 8.11). Instead we might replace the conditional expectation with a sample average, since it is trivial to draw samples from Eq. 8.12. For example, the generative model could be yet more normal distributions,
(8.15) |
but with the mean and covariance of the emission, like the recognition distribution, flexible parameterized functions—indeed, neural networks—operating on their “inputs,” in this case
with the emission energy
To optimize these parameters, we can descend these gradients—but discussion of how precisely this is to be carried out we defer until after deriving the E step.
The E step.
Stochastic backpropagation.
Computing the gradients on the left-hand sides of Eqs. 8.16 and 8.17 evidently requires acquiring the gradients appearing in the right-hand sides.
The gradients of
This suggests that the E and M steps be simultaneous rather than alternating.
In particular, a set of
The gradient is therefore evaluated by initializing a backward pass through the emission energy Eq. 8.18 at the samples
A probabilistic autoencoder.
We observed above that this optimization procedure suggests thinking of the generative and recognition models together as a single, feedforward neural network (Fig. LABEL:fig:VAE). To emphasize this intuition, let us again reorganize the joint relative entropy:
The second term,
We have previously (Section 6.3.2) interpreted the first term as the sum of a “code cost,”
It is important to realize that we do not expect this relative entropy to vanish in the optimal model, which will embody a compromise between this regularizer and the reconstruction cost (cf. Section 6.3.2).
Instead, the joint relative entropy
This equation tells us something intuitive: Since the left-hand side is non-negative, so must the right-hand side be; or in other words, the optimal reconstruction cost never exceeds the entropy of the data distribution.66 6 Returning to our previous example of a biased coin, we find that the optimal model should correctly identify heads-vs.-tails at least as frequently as the bias of the coin.
In the simplest VAEs, the generative priors are not parameterized,
However, the recognition model makes additional demands on the generative prior: A simple calculation shows that if the recognition model,
In brief, for optimal generative and recognition models,
must hold. To understand this requirement, we yet again rewrite††margin: Exercise LABEL:ex:JREELBOsurgery the joint relative entropy, Eq. 8.20, this time in terms of this relative entropy on “priors” [17]:
The penalty on information transmission prevents the reconstruction loss from being driven to zero, and vice versa. But the relative entropy in this equation—unlike the relative entropy in Eq. 8.20—does vanish in the optimal model.
8.2.1 Monte Carlo gradient estimators
In our discussion thus far of probabilistic autoencoders, we have considered only Gaussian generative (Eq. 8.15) and recognition (Eq. 8.12) models. To see what other possibilities are available, let us pause to appraise our situation.
Our aim throughout these last few chapters has been to fit models to observed, complex distributions of data,
However, fitting the joint distribution of observed and latent variables is not equivalent to fitting the marginal, since the former can be achieved in part simply by making inference under the model more deterministic.
Ideally, one would just additionally penalize deterministic posteriors, but computing
We therefore opted instead to separate the inferential and generative functions, introducing a second model,
Since we no longer need to invert the generative model, it can be almost arbitrarily complex.
But what about the recognition model,
where
In deciding how to carry out this derivative, we are concerned with the constraints that the usual desiderata for estimators—consistency, unbiasedness, low variance—impose on the class of tractable recognition distributions and on the dimensionality of the latent space. Up to this point, we have considered a single pair of estimators, Eqs. 8.16 and 8.17, based on the identities in Eqs. 8.13 and 8.14. But these identities hold only for normal recognition distributions.
The score-function gradient estimator.
Arguably the simplest and most general gradient estimator can be derived by passing the derivative directly into the integral defined by the expectation:
where we have used Leibniz’s rule on the first line.77
7
This is generally licit for differentiable recognition distributions; for more precise conditions, see [33].
The point of the second line is to turn the expression back into an expectation under
The score-function estimator is consistent and unbiased [33], but the variance scales poorly with latent dimension. This can be seen (e.g.) by considering the gradient estimator in the case of factorial recognition distributions:
Clearly, the variance of this estimator scales with the number of terms,
On the other hand, the score-function estimator makes only very weak assumptions about
Because the expected score is always zero (see Section B.2), this alteration does not change the expected value of the estimator, but can potentially lower its variance. More sophisticated control variates are also possible.
The pathwise gradient estimator.
Still, since we anticipate working with high-dimensional latent spaces, it behooves us to find an estimator whose variance is independent of this dimension. Here we exploit the fact that many continuous random variables with highly expressive distributions can be written as transformations of random variables with simpler, “base” distributions. This will allow us to separate parameter dependence (in the transformation) from the sampling procedure (for the base random variable). For example, any computationally tractable inverse CDF can be used to transform a uniformly distributed random variable into a random variable with the corresponding PDF:
Similarly, random variables with distributions parameterized by location and scale can frequently be expressed as affine functions of random variables with the “standard” version of the distribution:
More ambitiously, as long as the “path”
Indeed, the two preceding examples qualify as special cases of this technique. More subtle variations still on this theme allow for sampling from yet more distributions [27, 41, 33].
In all such cases, the function
Crucially, under this formulation, the gradient passes into the cost function itself. Therefore, dimensions of the latent space that have little effect on the cost are guaranteed to have little effect on the estimator itself—in contrast to the score-function estimator. This is what allows low-variance estimates to be made even in high-dimensional spaces. The computational cost also turns out to be identical to that of the score-function estimator [33].
Nomenclature.
This class of density estimator, with neural-network parameterized generative and recognition models, has come to be known as the variational autoencoder [27] We have seen how the model resembles the classical autoencoder. As for the term “variational,” it is something of a misnomer; we shall see how that term entered the literature shortly (Section 8.4).
The “pathwise gradient estimator” that made this model practical has for its part come to be known as the “reparameterization trick.” I have followed the terminology of Mohamed and colleagues [33].
8.2.2 Examples
With this model, we have made considerable progress toward the goal of generating high-quality samples from complex distributions. …
Independent Gaussian random variables
Let us first proceed with the example we have been considering so far, Eqs. 8.12 and 8.15, but for simplicity impose a few more restrictions. In particular, we insist that the generative covariance be isotropic, and the recognition covariance matrix be diagonal:
(8.26) |
Then the emission energy, Eq. 8.18, and its derivatives simplify to
These gradients can be accumulated with backpropagation through the entire “autoencoder.” Parameters are then updated so as to descend the (joint) relative-entropy gradients, Eqs. 8.16, 8.17, and 8.19, which require the expectations of the above gradients under the recognition distribution. These can be approximated with sample averages. Notice that no pathwise gradient estimator is required.
We have avoided the pathwise gradient estimator by employing the identities Eqs. 8.3 and 8.4. However, it is useful to consider the more generic approach. In particular, we will simply derive the loss function, since in modern software packages, this is handed directly to an automatic differentiator. Let us take the perspective of an autoencoder, i.e. the last line in Eq. 8.20.
Anticipating, we calculate the log partition functions:
Then the KL regularizer is
In moving to the third line we twice used the identity Eq. B.13. Notice that this obviated the need for a sample average under the recognition distribution. Note also that the KL regularizer never depends on the generative-model parameters.
The reconstruction error is a cross entropy:
We have reparameterized the average on the last line in anticipation of differentiating with respect to the parameters of
Discrete VAEs with the Gumbel-Softmax trick
So far we have considered VAEs with Gaussian recognition distributions (Eq. 8.12), but what if we want discrete latent variables (as in e.g. the GMM)? As it stands, the pathwise gradient estimator will not work, because any function that discretizes will be flat almost everywhere. Any gradient that must pass through this function will be zero.
The most obvious workaround is to relax the discretization into some real-valued function, and we consider this first, although in a more subtle variation. Less obviously, we can employ a discretization that still passes real-valued information….
We begin simply with the relaxation itself and then bring in the VAE at the end.
Suppose we want to draw categorical samples—which we interpret as one-hot vectors,
then to draw a uniformly-distributed random variable,
However, now suppose we want to differentiate through the sampling operation. The inverse cumulative mass function is flat almost everywhere, so the derivative gives us no information. Perhaps we could smooth the CDF, although it is not obvious how we ought to.88 8 The two papers that proposed the Gumbel-softmax reparameterization do not even consider this possibility [32, 19]. Alternatively, perhaps there are other ways to sample from Eq. 8.27.
Gumbel perturbations.
Consider a set of
Equivalently, these random variables are produced by adding independent, zero-mean, Gumbel perturbations,
Distribution of the argmax.
If we know that
Then
To convert this conditional probability into the marginal probability that
In words, the probabilities of each Gumbel variate
Relaxing the categorical distribution.
Unfortunately, the final step of the sampling procedure—the argmax—is not differentiable, so it seems we have made no progress over the standard sampling technique.
On the other hand, it is more or less clear how to approximate the argmax—with a soft(arg)max, whose softness (
The softmax function is evidently differentiable:
For very soft functions (
If we interpret Eq. 8.29 not as an approximation but as the definition of
…
Using the Gumbel-softmax trick in a loss function.
Let the loss be an average over some function of
Now making the approximation,