8.2 Variational Autoencoders
The use of a recognition distribution for every observation, , will clearly fail if the number of data is too large. An alternative is to encourage the recognition model to use a single set of parameters for all data, by specifying a single, parameter-dependent mapping from the observations to the moments of the recognition distribution. For example, sticking with Gaussian recognition distributions, we can use
The moments, and , are now interpreted to be mappings or functions rather than parameters. (It will now be slightly simpler to work with covariance rather than precision matrices.) For example, the functions can be deep neural networks, and so in some sense arbitrarily flexible (see Section 5.1.3). For sufficiently large , this flexibility could include (depending on the ratio of parameters to observations) the ability to store a separate mean and covariance matrix for every datum, and therefore to mimic Eq. 8.2, but ideally it does not. Limiting the flexibility of the functions encourages the computational cost of learning to be “amortized” across observations, and for the recognition model to provide good inferences for new observations .
Replacing the moments of the recognition distribution with deep neural networks is the point of departure for ††margin: variational autoencoders [27, 41]. Note well that they are not limited to Gaussian recognition distributions (Eq. 8.12). However, we start with this case because it is a natural extension of the approach to sparse coding just discussed; it will likewise make use of the identities Eqs. 8.3 and 8.4. We subsequently generalize.
The M step.
How does this change to the parameterization of the recognition model change the optimization? Since the recognition distribution is still normal, we can still use Eq. 8.1 for the joint relative entropy, and Eqs. 8.3 and 8.4 for its gradients. Beginning with the parameters of the mean function:
Likewise, for the covariance, we follow Eq. 8.4, although for notational simplicity we write the derivative with respect to only a single parameter,
Computing the Hessian naïvely is expensive () but there are clever, , alternatives [41].
Now, for sufficiently expressive generative models, the conditional expectations under the recognition distribution will in general be difficult to carry out (although cf. Eqs. 8.10 and 8.11). Instead we might replace the conditional expectation with a sample average, since it is trivial to draw samples from Eq. 8.12. For example, the generative model could be yet more normal distributions,
(8.15) |
but with the mean and covariance of the emission, like the recognition distribution, flexible parameterized functions—indeed, neural networks—operating on their “inputs,” in this case . Then Eqs. 8.13 and 8.14 simplify slightly [41]:
with the emission energy equal to the energy of the Gaussian distribution on the right in Eq. 8.15,
To optimize these parameters, we can descend these gradients—but discussion of how precisely this is to be carried out we defer until after deriving the E step.
The E step.
Stochastic backpropagation.
Computing the gradients on the left-hand sides of Eqs. 8.16 and 8.17 evidently requires acquiring the gradients appearing in the right-hand sides. The gradients of and , for their part, obviously pass backwards through the recognition neural networks. But the gradients of the emission energy with respect must pass backwards through the generative neural networks, and , for which is an input. Along this backward propagating computation, we will compute precisely the quantities required for the gradients and that are needed for the E step, Eq. 8.19.
This suggests that the E and M steps be simultaneous rather than alternating. In particular, a set of observations can be passed “upward” through the recognition distribution, Eq. 8.12, yielding sample latent vectors, .33 3 It might seem that we need multiple samples of for every sample of , and indeed we do need (roughly) quadratically more samples to estimate the joint distribution of as opposed to just —say, . But it is more efficient to have different values of each of rather than just different values for and values of ; see Fig. LABEL:fig:XXX. These are then passed on through the generative model to produce values for and . Notice that this entire process looks like a forward pass through a single neural network that maps each value to itself—i.e., an autoencoder—or, more precisely, to a mean and variance for that sample, encoding the network’s “best guess” for that sample and its (inverse) confidence in that guess.
The gradient is therefore evaluated by initializing a backward pass through the emission energy Eq. 8.18 at the samples , working backwards through and toward the “inputs” . The gradient at each layer of these neural networks is evaluated at the values computed under the preceding forward pass. Averaging these gradients under all samples yields the right-hand side of Eq. 8.19, the “E-step” gradients. But rather than stop at the parameters of the first layer, the gradient computation is continued into the inputs . The chain of backward computations then proceeds through the parameters of the recognition distribution according to Eqs. 8.16 and 8.17, finally terminating at the first layers of and . Averaging these gradients under the samples yields the “M-step” gradients for updating the recognition distribution.
A probabilistic autoencoder.
We observed above that this optimization procedure suggests thinking of the generative and recognition models together as a single, feedforward neural network (Fig. LABEL:fig:VAE). To emphasize this intuition, let us again reorganize the joint relative entropy:
The second term, , can be interpreted as a reconstruction cost (see Section 6.3.2): the number of bits44 4 If a base-2 logarithm is used. required to encode the observations under the generative model, given latent states inferred with the recognition model. It measures the “lossiness” of the autoencoder. By itself, this term enforces a discriminative rather than generative cost. Note that, without further constraint, the reconstruction cost could be either greater or lower than the marginal entropy, ; i.e., the final two terms could together be either positive or negative.55 5 For example, let be a biased coin (entropy less than 1 bit). A model could simply copy through the observations (reconstruction cost of 0 bits), or again predict heads and tails with equal probability (reconstruction cost of 1 bit).
We have previously (Section 6.3.2) interpreted the first term as the sum of a “code cost,” , required to encode the latent states inferred by the recognition model; and , a regularizer that prevents the optimization from decreasing reconstruction and code costs merely by making recognition more confident. Alternatively, can be interpreted [27] as a regularizer that encourages the recognition posterior to resemble the generative prior. Intuitively, optimizing reconstruction cost obliges the emission density to “play well” with the recognition model (as well as to fit the observations); but to complete the generative model, we also need it to play well with the generative prior! This requires the recognition model and the generative prior to resemble each other. Indeed, the regularization that enforces this resemblance is the only place that the generative prior enters the loss.
It is important to realize that we do not expect this relative entropy to vanish in the optimal model, which will embody a compromise between this regularizer and the reconstruction cost (cf. Section 6.3.2). Instead, the joint relative entropy vanishes at the optimum, so the most we can say is
This equation tells us something intuitive: Since the left-hand side is non-negative, so must the right-hand side be; or in other words, the optimal reconstruction cost never exceeds the entropy of the data distribution.66 6 Returning to our previous example of a biased coin, we find that the optimal model should correctly identify heads-vs.-tails at least as frequently as the bias of the coin.
In the simplest VAEs, the generative priors are not parameterized, , as in Eq. 8.15, in which case the KL regularizer pushes the recognition model toward the generative prior but not vice versa. For the generative model considered in isolation, such a truly structureless prior distribution is desirable. The ideal latent variables are akin to the fundamental particles of physics: they should be as independent as possible, since any remaining structure is something that still needs to be explained. We prefer to let the emission, , capture all structure.
However, the recognition model makes additional demands on the generative prior: A simple calculation shows that if the recognition model, , matches the generative posterior, , and the generative distribution, , matches the data, , then the generative prior, , must match the “aggregated posterior,”
In brief, for optimal generative and recognition models,
must hold. To understand this requirement, we yet again rewrite††margin: Exercise LABEL:ex:JREELBOsurgery the joint relative entropy, Eq. 8.20, this time in terms of this relative entropy on “priors” [17]:
The penalty on information transmission prevents the reconstruction loss from being driven to zero, and vice versa. But the relative entropy in this equation—unlike the relative entropy in Eq. 8.20—does vanish in the optimal model.
8.2.1 Monte Carlo gradient estimators
In our discussion thus far of probabilistic autoencoders, we have considered only Gaussian generative (Eq. 8.15) and recognition (Eq. 8.12) models. To see what other possibilities are available, let us pause to appraise our situation.
Our aim throughout these last few chapters has been to fit models to observed, complex distributions of data, , but under the implicit guiding hypothesis that the complexity is a consequence of some kind of marginalization. Accordingly, we have attempted to “undo” the marginalization by constructing latent-variable generative models, , with relatively simple distributions, but which when marginalized result in complex distributions, .
However, fitting the joint distribution of observed and latent variables is not equivalent to fitting the marginal, since the former can be achieved in part simply by making inference under the model more deterministic. Ideally, one would just additionally penalize deterministic posteriors, but computing (typically) requires computing the marginal distribution, , which by our original hypothesis is too complicated to work with.
We therefore opted instead to separate the inferential and generative functions, introducing a second model, , for the former. The generative-model joint distribution is consequently fit by minimizing . This can likewise be improved without improving the marginal fitness, , merely by reducing posterior entropy—in this case of the recognition model, —so a regularizer is still required. But the crucial difference is that this regularizer does not depend on the intractable generative posterior distribution, .
Since we no longer need to invert the generative model, it can be almost arbitrarily complex. But what about the recognition model, ? The key constraint here is that the joint relative entropy is an expectation under this distribution, and we also need to compute gradients with respect to its parameters, . The joint relative entropy is computed from the joint cross entropy and the recognition entropy (ignoring the constant data entropy). Let us assume that the latter can be computed in closed form (it can for most distributions), and focus on the former, , since in this term the recognition parameters occur only in the expectation (the surprisal depends on the generative parameters only). Therefore, we need to compute
where . The fundamental problem is that for functions that are at all complicated in —for example, in generative models employing neural networks, like Eq. 8.15—the expectations will not be available in closed-form. Instead, we shall need to make use of sample averages, i.e., Monte Carlo estimates††margin: Monte Carlo estimates of this gradient.
In deciding how to carry out this derivative, we are concerned with the constraints that the usual desiderata for estimators—consistency, unbiasedness, low variance—impose on the class of tractable recognition distributions and on the dimensionality of the latent space. Up to this point, we have considered a single pair of estimators, Eqs. 8.16 and 8.17, based on the identities in Eqs. 8.13 and 8.14. But these identities hold only for normal recognition distributions.
The score-function gradient estimator.
Arguably the simplest and most general gradient estimator can be derived by passing the derivative directly into the integral defined by the expectation:
where we have used Leibniz’s rule on the first line.77 7 This is generally licit for differentiable recognition distributions; for more precise conditions, see [33]. The point of the second line is to turn the expression back into an expectation under , in order to allow for a Monte Carlo estimate. The derivative on the final line is known as the score function††margin: score function , from which this gradient estimator inherits its name.
The score-function estimator is consistent and unbiased [33], but the variance scales poorly with latent dimension. This can be seen (e.g.) by considering the gradient estimator in the case of factorial recognition distributions:
Clearly, the variance of this estimator scales with the number of terms, , in the sum—even though dimensions may have little or no relationship with the cost . Therefore, in problems with large latent spaces, a prohibitively large number of samples must be drawn at each gradient step [27, 37].
On the other hand, the score-function estimator makes only very weak assumptions about and (the generative-model surprisal); and the cost of computing it scales (favorably) as , with and the dimensionality of the parameters and the cost of evaluating [33]. Furthermore, variance can often be reduced with the method of “control variates”; e.g., using as our estimator
Because the expected score is always zero (see Section B.2), this alteration does not change the expected value of the estimator, but can potentially lower its variance. More sophisticated control variates are also possible.
The pathwise gradient estimator.
Still, since we anticipate working with high-dimensional latent spaces, it behooves us to find an estimator whose variance is independent of this dimension. Here we exploit the fact that many continuous random variables with highly expressive distributions can be written as transformations of random variables with simpler, “base” distributions. This will allow us to separate parameter dependence (in the transformation) from the sampling procedure (for the base random variable). For example, any computationally tractable inverse CDF can be used to transform a uniformly distributed random variable into a random variable with the corresponding PDF:
Similarly, random variables with distributions parameterized by location and scale can frequently be expressed as affine functions of random variables with the “standard” version of the distribution:
More ambitiously, as long as the “path” from auxiliary variables to latent variables is invertible, we can construct complex distributions from simple ones with the change-of-variables formula:
Indeed, the two preceding examples qualify as special cases of this technique. More subtle variations still on this theme allow for sampling from yet more distributions [27, 41, 33].
In all such cases, the function (notice that we have additionally allowed it to depend on ) can be used in conjunction with the “law of the unconscious statistician” (LotUS) to remove the parameters from the expectation in Eq. 8.22:
Crucially, under this formulation, the gradient passes into the cost function itself. Therefore, dimensions of the latent space that have little effect on the cost are guaranteed to have little effect on the estimator itself—in contrast to the score-function estimator. This is what allows low-variance estimates to be made even in high-dimensional spaces. The computational cost also turns out to be identical to that of the score-function estimator [33].
Nomenclature.
This class of density estimator, with neural-network parameterized generative and recognition models, has come to be known as the variational autoencoder [27] We have seen how the model resembles the classical autoencoder. As for the term “variational,” it is something of a misnomer; we shall see how that term entered the literature shortly (Section 8.4).
The “pathwise gradient estimator” that made this model practical has for its part come to be known as the “reparameterization trick.” I have followed the terminology of Mohamed and colleagues [33].
8.2.2 Examples
With this model, we have made considerable progress toward the goal of generating high-quality samples from complex distributions. …
Independent Gaussian random variables
Let us first proceed with the example we have been considering so far, Eqs. 8.12 and 8.15, but for simplicity impose a few more restrictions. In particular, we insist that the generative covariance be isotropic, and the recognition covariance matrix be diagonal:
(8.26) |
Then the emission energy, Eq. 8.18, and its derivatives simplify to
These gradients can be accumulated with backpropagation through the entire “autoencoder.” Parameters are then updated so as to descend the (joint) relative-entropy gradients, Eqs. 8.16, 8.17, and 8.19, which require the expectations of the above gradients under the recognition distribution. These can be approximated with sample averages. Notice that no pathwise gradient estimator is required.
We have avoided the pathwise gradient estimator by employing the identities Eqs. 8.3 and 8.4. However, it is useful to consider the more generic approach. In particular, we will simply derive the loss function, since in modern software packages, this is handed directly to an automatic differentiator. Let us take the perspective of an autoencoder, i.e. the last line in Eq. 8.20.
Anticipating, we calculate the log partition functions:
Then the KL regularizer is
In moving to the third line we twice used the identity Eq. B.13. Notice that this obviated the need for a sample average under the recognition distribution. Note also that the KL regularizer never depends on the generative-model parameters.
The reconstruction error is a cross entropy:
We have reparameterized the average on the last line in anticipation of differentiating with respect to the parameters of and .
Discrete VAEs with the Gumbel-Softmax trick
So far we have considered VAEs with Gaussian recognition distributions (Eq. 8.12), but what if we want discrete latent variables (as in e.g. the GMM)? As it stands, the pathwise gradient estimator will not work, because any function that discretizes will be flat almost everywhere. Any gradient that must pass through this function will be zero.
The most obvious workaround is to relax the discretization into some real-valued function, and we consider this first, although in a more subtle variation. Less obviously, we can employ a discretization that still passes real-valued information….
We begin simply with the relaxation itself and then bring in the VAE at the end. Suppose we want to draw categorical samples—which we interpret as one-hot vectors, —according to real-valued natural parameters . The standard way to do this is first to exponentiate (to make positive) and normalize the vector of natural parameters,
then to draw a uniformly-distributed random variable, ; and finally to select the category at which the cumulative mass (under Eq. 8.27) first reaches . (Picturesquely, one can imagine dividing up the interval into bins of widths given by the . The bin into which falls determines the category.)
However, now suppose we want to differentiate through the sampling operation. The inverse cumulative mass function is flat almost everywhere, so the derivative gives us no information. Perhaps we could smooth the CDF, although it is not obvious how we ought to.88 8 The two papers that proposed the Gumbel-softmax reparameterization do not even consider this possibility [32, 19]. Alternatively, perhaps there are other ways to sample from Eq. 8.27.
Gumbel perturbations.
Consider a set of independent, Gumbel-distributed, random variables, each with its own mean . The cumulative-distribution and probability-density functions of each such random variable are, respectively,
Equivalently, these random variables are produced by adding independent, zero-mean, Gumbel perturbations, , to our set of means: .
Distribution of the argmax.
If we know that , then the probability that is the largest Gumbel random variable is the probability that all the other variables are smaller than . In anticipation of what follows, we interpret the output of the argmax as a one-hot vector, and assign the symbol to this vector-valued function:
Then
To convert this conditional probability into the marginal probability that is the largest, we simply multiply by the probability of and integrate over all possible values of :
In words, the probabilities of each Gumbel variate being the largest are given by the soft(arg)max function of their means. Eq. 8.28 matches Eq. 8.27, so we have arrived at an alternative method for drawing categorical samples: “Perturb” the vector by independent, zero-mean Gumbel noise and then select the index of the largest element of the vector.
Relaxing the categorical distribution.
Unfortunately, the final step of the sampling procedure—the argmax—is not differentiable, so it seems we have made no progress over the standard sampling technique. On the other hand, it is more or less clear how to approximate the argmax—with a soft(arg)max, whose softness () we can titrate:
The softmax function is evidently differentiable:
For very soft functions (), the gradient magnitudes are all manageable—indeed, less than 1—but the elements of become nearly equal (no matter the values of ), and the approximation to categorical random variables is bad. At , we are very nearly approximating by its mean, although because is perturbed by Gumbel noise before passing through the softmax, this is not quite the case and samples will vary even for fixed . For very hard functions (), approaches a one-hot vector—that is, an actual categorical sample. But in this same limit, , the gradient becomes unbounded. In practice, then, the softness is typically decreased (starting from 1) over the course of learning.
If we interpret Eq. 8.29 not as an approximation but as the definition of , then we can no longer say that is categorically distributed (except in the , which we never reach). Instead, has a novel distribution, which its inventors call the “concrete” (a portmanteau of “continuous” and “discrete”) [32] or “Gumbel-softmax” [19] distribution. We will simply refer to it as the soft categorical distribution. What is the probability density for this distribution?
…
Using the Gumbel-softmax trick in a loss function.
Let the loss be an average over some function of , , and the natural parameters be functions of some other parameters , i.e. . First, we reparameterize to allow the derivative to pass into the average:
Now making the approximation,