10.2 EFH-like models
We return now to the undirected models of Chapter 3.
Recall from our introduction to latent-variable density estimation above, Section 6.2, one of the motivations for introducing latent variables into our density-estimation problem: they can provide flexibility in modeling the observed data.
Directed graphical models, and therefore expectation-maximization, are appropriate when we have furthermore a notion of how these latent variables are (marginally) distributed, and how the observations are conditionally distributed given a setting of the latent variables; that is, when we have some idea of both
In a way, however, this is a misleading description of our assumptions.
For the directed generative models of the kind lately discussed, our construction of
In any case, we do specify a form for
Is the price of the trade worthwhile? Learning in the EFH and related models will require inference, naturally, but it will also require an expectation under the intractable model joint distribution. This expectation can be replaced with a sample average, but generating samples is, as we have just noted, equally difficult. However, after deriving the exact learning procedure (Section 10.2.1), we shall consider an alternative, approximate learning procedure, which obviates (or at least reduces) the need for prolonged Gibbs sampling (Section 10.2.2), allowing us to have our cake and eat it, too. The EFH also enjoys an even more remarkable property, which is an ability to be composed hierarchically, along with a guarantee that this improves learning (more precisely, it lowers a bound). We shall explore such models, called “deep belief networks (DBNs)” in Section 10.2.3.
In what follows, we derive learning rules as generally as possible, usually for Boltzmann machines tout court, for which they can often be expressed elegantly, if not solved simply.
10.2.1 Learning with exponential-family harmoniums
Relative-entropy minimization in energy-based models.
We start with joint distributions of a rather general form known as a Boltzmann distribution33
3
The negative energy was originally referred to as the “harmony” [45], and omitting the negative sign does indeed frequently make the formulae prettier.
But the base measure has already spoken for
This is the generic form of distributions parameterized by undirected graphical models, although in this case we have made no assumptions about the underlying graph structure—indeed, we have not yet even distinguished the roles of
The goal, as usual, is to minimize the relative entropy of the distributions of observed data,
where the average is under the “hybrid” joint distribution
Here we are in roughly the opposite situation: the joint distribution is not tractable, due to the normalizer
In words, the gradient is equal to the difference between the expected energy gradient under two different distributions: the “hybrid joint distribution,” constructed by combining the data distribution (over
KL-minimization in the harmonium.
Now we make an assumption about the roles of
In this case, the joint distribution will take the form of a Boltzmann distribution, Eq. 10.14, with energy
The parameter gradients are therefore just the elegant
(See Section B.1 in the appendices for a refresher on derivatives with respect to matrices. For brevity, from here on, we omit the arguments of the sufficient statistics.) Substituting these energy gradients into the gradient of the loss function (Eqs. 10.15 and 10.16) yields
Evidently, learning is complete when the expected sufficient statistics of the model and data (or, more precisely, the model and the hybrid joint distribution)
match.
However, the elegance of these equations disguises a great difficulty, alluded to at the outset:
The first average in each pair is under the joint distribution, which we can compute only up to the intractable normalizer.
This certainly rules out setting these derivatives to zero and solving for the parameters in closed formed.
Although we can use gradient descent instead, this doesn’t relieve us of the need to evaluate, at least approximately, these averages.
One obvious possibility is Gibbs sampling, and indeed the conditional distributions Eq. 10.17 are generally chosen so as to make block sampling possible—sample the entire vector
10.2.2 The method of contrastive divergence
We reiterate the major limitation to the learning algorithm just described.
We can see from Eq. 10.19 that traveling the gradient of the relative entropy requires expectations under the model joint or marginals (in the positive terms).
Since this integral or sum usually cannot be computed analytically, the only obvious way to take it is via Gibbs sampling from the model.
Although EFHs are designed to facilitate such a sampling procedure—the lack of intralayer connections makes it possible to sample all of the hidden variables conditioned on the visible ones, and vice versa—it will still generally take many steps of sampling to “burn in” the network, and then even more for subsequent “thinning” to get independent samples.
And then there is a second problem: a procedure based on sampling will introduce variance into our estimate of the gradient; and furthermore, this variance depends on the current value of the parameters (
Here is an alternative that appears, at first glance, to be very crude [14]. We recall the reason that Gibbs sampling requires a long burn-in period: our choice of initializing vector, being arbitrary, may have very low probability under the model. The burn-in travels down the Markov chain away from it towards the “thick” parts of the distribution we aim to sample. What if we had a better way to initialize the chain? In particular, suppose we initialize it at samples from the data distribution, and then take only a few steps along the Markov chain. Certainly, late in training, when the model distribution resembles the data distribution, this is a sensible procedure. But early in training, data samples are unlikely to have high probability under the model. Still, perhaps this “bias”—toward the regions of observation space with high probability under the data distribution—is a small price to pay for the reduction in variance that would have been accumulated through many steps of Gibbs sampling.
To try to make this more precise, we write down a mathematical expression for our approximation [13].
We replace the expectation under the model distribution in Eq. 10.16 (for Boltzmann machines generally; or, for EFHs in particular, in Eq. 10.19) with averaging under a distribution we call
At the end of this Markov chain lies the model distribution; hence we can say
where in the last expression we have emphasized that samples from
Contrastive divergence for energy-based models.
From here we try to work backwards: what loss function might yield this gradient? Hinton proposes “contrastive divergence,” i.e., the difference between the original relative entropy and a different one:
This is an interesting quantity.
First of all, it is positive.
Why?
Because the
We might take this as a specific instance of a more general procedure: if the gradient of a function is difficult, replace it with a different function with the same minimum but an easier gradient. But is the gradient of the contrastive divergence easier to evaluate than the gradient of the (standard) relative entropy?
We have already worked out the gradient of the first relative entropy, Eqs. 10.15 and 10.16, but we break the computation into steps to reuse portions for the derivative of the second divergence. For brevity, we suppress the arguments:
On the third line we have substituted for the fourth term by exploiting the analogy with the second term (which we worked out in Eq. 10.16 above).
The final line is seemingly quite close to the desired gradient, Eq. 10.20.
Is this close enough?
Hinton reports that extensive simulation seems to indicate that this last term is small in practice, and that in any case its inclusion rarely changes the approximate version (Eq. 10.20) by more than 90 degrees; that is, neglecting it seldom makes the gradient point in the wrong direction [13].
And inefficiencies introduced by ignoring this term can perhaps be alleviated by choosing an
Contrastive divergence for the harmonium.
The derivation in Eq. 10.22 is general to models based on the Boltzmann distribution.
Applying the contrastive-divergence approximate learning rule, Eq. 10.20, to exponential-family harmoniums just means substituting
Thought of as a neural network, we can say that each sample vector drives activity in the layer above, which reciprocally drives activity in the layer below, which drives activity in the layer above; after which “synaptic strengths” change proportional to pairwise correlations (average products) between the pre- and post-synaptic units, with the initial contributions being Hebbian and the final contributions being anti-Hebbian.
For CD
Some examples.
10.2.3 Learning with deep belief networks
We summarize our progress in learning in EFH-like models.
Since we will shortly need to introduce subscripts to the random variables, we avoid clutter from here on by setting aside contrastive divergence and working out the results in terms of the original loss function (see Eq. 10.15)—allowing us to dispense with superscripts.
Here, then, we return to using
To recapitulate: the EFH represents, in some sense, a generative model more “agnostic” than the directed ones learned with EM: rather than committing to a prior distribution of latent variables,
Supposing even that our learning procedure does minimize the relative entropy of the data
From marginal to joint relative entropy.
Our derivation of just such a procedure begins with an idea: perhaps only certain parts of the trained, but insufficient, EFH want improvement.
Specifically, we shall commit ourselves to using the emission distribution it has learned,
So we turn for inspiration to expectation-maximization (EM), the learning procedure for directed graphical models, which is in some sense what our problem has become, since we now have a parameterized emission,
With these definitions in hand, the joint relative entropy can be written (cf. Eq. 6.6) as
The final line reminds us that we are proposing to optimize only the parameters
Unrolling the EFH.
Now we make a very clever observation [13].
Consider the “hybrid” graphical model in Fig. LABEL:fig:DBNb.††margin:
Insert figure of (a) EFH, and (b) inverted EFH with one directed connection “hanging” off.
Samples from this model can be generated by, first, prolonged Gibbs sampling from the “inverted” EFH (notice the labels), followed by a single sample from the bottommost
layer via the directed connections.
But if the weight matrix defining these connections is the transpose of the EFH’s weight matrix, then samples from this model are identical to samples from the visible layer of a standard EFH, Fig. LABEL:fig:DBNa!
Likewise, the recognition distributions from the models in Figs. LABEL:fig:DBNa and LABEL:fig:DBNb,
Suppose then we let the prior over
This closely resembles the M step of EM with invertible generative models, and the logic of the bound is the same.
As soon as
So much for the M step.
However, there is no subsequent E step because there is no obvious way to bring the recognition model,
Recursive application of the procedure.
The second stage of training just described amounts to training a second EFH—albeit on the empirical data as transformed by the first EFH’s recognition model, Eq. 10.24, rather than the data in their raw form.
Eq. 10.25 shows that this can likewise be written as minimizing a (marginal) relative entropy.
This observation suggests, what is true, that the entire procedure can be applied recursively.
After training the second (inverted) EFH, the MRE in Eq. 10.25 may not be zero.
Then a second directed layer can be “unrolled” from the EFH spool (Fig. LABEL:fig:DBNc), and trained on a new “data distribution,” consisting of the “data distribution,”
Can we guarantee that this will continue to decrease (or at least not increase) the model fit to the observed data?
In fact we cannot, but we can guarantee something nearly as good:
Either the model fits the data better, i.e. we reduce
To see this, we apply the argument, given in the previous section for the first “unrolled” EFH, to a second.
††margin:
Put in some figures for this!! Maybe you can use previous figs, but you should also have a (c) with three sets of parameters, so four layers.
Decoupling
(that is, improving the prior over
by way of decreasing a JRE upper bound that is initially tight (analogous to Eq. 10.25).
Now,
or it renders this level’s posterior distribution,
After the initial improvement of the prior over
Picturesquely, we can describe the whole procedure as follows.
After training the first EFH, we freeze the way that “first-order features” give rise to data.
But we need more flexibility in the prior distribution of first-order features.
So we train a new EFH to generate them.
Our training procedure makes this second EFH more likely to generate features that either generate good data,
The procedure is only guaranteed to keep lowering bounds (or bounds on bounds, etc.) if the new EFHs are introduced with their parameters initialized to the previous EFH’s parameter values. In practice, however, this is almost always relaxed, because the new EFH is seldom even chosen to have the appropriate shape (same number of hidden units as the previous EFH’s number of visible units). This allows for very general models; and although not guaranteed to keep improving performance, the introduction of deeper EFHs can be roughly justified along similar lines: it allows for more flexible priors over increasingly complex features.
Summary.
We summarize the entire procedure for training deep belief networks with a pair of somewhat hairy definitions and a pair of learning rules. Pairs are required simply to maintain a connection to the notational distinction between latent and observed variables in the original EFH; conceptually, the training is the same at every layer.
for |
||||
for |
Then:
for |
|||||
(10.27) |
Failure modes.
For an EFH to match the data distribution, its prior distribution (over
However, the converse is not true, since the match of prior and aggregated posterior only requires that
which does not imply a unique solution for