10.2 EFH-like models

We return now to the undirected models of Chapter 3. Recall from our introduction to latent-variable density estimation above, Section 6.2, one of the motivations for introducing latent variables into our density-estimation problem: they can provide flexibility in modeling the observed data. Directed graphical models, and therefore expectation-maximization, are appropriate when we have furthermore a notion of how these latent variables are (marginally) distributed, and how the observations are conditionally distributed given a setting of the latent variables; that is, when we have some idea of both p^(𝒙^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{\theta}}\right)} and p^(𝒚^|𝒙^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{%\theta}}\right)}. In moving to models based on the architecture of the exponential-family harmonium (EFH), we trade an assumption about the form of p^(𝒙^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{\theta}}\right)} for one about the form of p^(𝒙^|𝒚^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{%\theta}}\right)}.

In a way, however, this is a misleading description of our assumptions. For the directed generative models of the kind lately discussed, our construction of p^(𝒙^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{\theta}}\right)} (and, perhaps, p^(𝒚^|𝒙^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{%\theta}}\right)}) is often—though not always—motivated by some problem-specific semantic facts about the latent variables. In the (admittedly toy) problem in Fig. LABEL:fig:DE, for example, the prior is a Bernoulli distribution because the hidden variable is taken to be the outcome of a coin flip. By contrast, when adopting the harmonium as our latent-variable density estimator, we seldom have concrete ideas about the form of p^(𝒙^|𝒚^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{%\theta}}\right)}, and its selection is motivated mostly by mathematical convenience. Perhaps we require the hidden variables to have support in a certain set (e.g., the positive integers), but even this is uncommon. Most often, the individual elements of the vector of hidden units have no precise semantic content; rather, the vector as a whole represents no more than a kind of flexible set of parameters. After learning, they can be interpreted as detectors of “features” of the data, but there is still usually no a priori reason to assume that these features should be distributed according to p^(𝒙^|𝒚^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{%\theta}}\right)}. That is one reason why the RBM is so much more common than other exponential-family harmoniums: a binary vector will generally be just as suitable to represent features as, say, a vector of positive integers. In this, EFHs have a strong kinship with deterministic neural networks.

In any case, we do specify a form for p^(𝒙^|𝒚^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{%\theta}}\right)}, at the expense of freedom to choose p^(𝒙^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{\theta}}\right)}. Unfortunately, as discussed in Section 3.1, this method for making inference tractable (indeed, trivial), renders the joint distribution intractable. In particular, it is now the joint distribution, as well as the marginals over both 𝑿^{\bm{\hat{X}}} and 𝒀^{\bm{\hat{Y}}}, that is specifiable only up to the normalizer (“partition function”). We therefore lose the ability to take (exact) expectations under these distributions, and are left with Monte Carlo techniques. But we have also undermined our ability to draw samples: In directed models, we simply sample from the prior distribution (which we have) and then use this sample to sample from the emission probability (which we also have). In the harmonium, it is necessary instead to Gibbs sample—which can at least be performed layer-wise rather than unit-wise, but which nevertheless is painfully slow. So whereas for directed models, drawing one sample vector from the joint distribution requires one pass through all the variables, it may require (depending on the mixing rate of the Markov chain induced by the sampling procedure) scores to hundreds in the EFH.

Is the price of the trade worthwhile? Learning in the EFH and related models will require inference, naturally, but it will also require an expectation under the intractable model joint distribution. This expectation can be replaced with a sample average, but generating samples is, as we have just noted, equally difficult. However, after deriving the exact learning procedure (Section 10.2.1), we shall consider an alternative, approximate learning procedure, which obviates (or at least reduces) the need for prolonged Gibbs sampling (Section 10.2.2), allowing us to have our cake and eat it, too. The EFH also enjoys an even more remarkable property, which is an ability to be composed hierarchically, along with a guarantee that this improves learning (more precisely, it lowers a bound). We shall explore such models, called “deep belief networks (DBNs)” in Section 10.2.3.

In what follows, we derive learning rules as generally as possible, usually for Boltzmann machines tout court, for which they can often be expressed elegantly, if not solved simply.

10.2.1 Learning with exponential-family harmoniums

Relative-entropy minimization in energy-based models.

We start with joint distributions of a rather general form known as a Boltzmann distribution33 3 The negative energy was originally referred to as the “harmony” [45], and omitting the negative sign does indeed frequently make the formulae prettier. But the base measure has already spoken for hh, as entropy has for HH—the capital η\eta being indistinguishable from a capital hh. So we work with energy, EE, instead. :

p^(𝒙^,𝒚^;𝜽)=1Z(𝜽)exp{-E(𝒙^,𝒚^,𝜽)},Z(𝜽)=𝒙^𝒚^exp{-E(𝒙^,𝒚^,𝜽)}.\begin{split}{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{},%\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{%};\bm{\theta}}\right)}&=\frac{1}{Z(\bm{\theta})}{\exp\mathopen{}\mathclose{{}%\left\{-E\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{},\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{},\bm{\theta}}\right)}\right\}},\\Z(\bm{\theta})&=\sum_{\bm{\hat{x}}{}}\sum_{\bm{\hat{y}}{}}{\exp\mathopen{}%\mathclose{{}\left\{-E\mathopen{}\mathclose{{}\left(\bm{\hat{x}},\bm{\hat{y}},%\bm{\theta}}\right)}\right\}}.\end{split} (10.14)

This is the generic form of distributions parameterized by undirected graphical models, although in this case we have made no assumptions about the underlying graph structure—indeed, we have not yet even distinguished the roles of 𝑿^{\bm{\hat{X}}} and 𝒀^{\bm{\hat{Y}}}. Critically, without further constraints on the energy, the normalizer, Z(𝜽)Z(\bm{\theta}), is intractable for sufficiently large 𝑿^{\bm{\hat{X}}} and 𝒀^{\bm{\hat{Y}}}, because the number of summands in Eq. 10.14 is exponential in dim(𝑿^)+dim(𝒀^)\dim\mathopen{}\mathclose{{}\left({\bm{\hat{X}}}}\right)+\dim\mathopen{}%\mathclose{{}\left({\bm{\hat{Y}}}}\right). In some special cases of continuous random variables, the calculation of the normalizer may reduce to tractable integrals, but these cases are exceptional.

The goal, as usual, is to minimize the relative entropy of the distributions of observed data, p(𝒚){p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[%named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}, and of model observations, p^(𝒚^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)}. The method is to follow its gradient in parameter space. Recalling from Eqs. 4.6 and 6.3:

dd𝜽DKL{p(𝒀)p^(𝒀;𝜽)}=𝔼𝑿^,𝒀[-dd𝜽logp^(𝑿^,𝒀;𝜽)],{\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}}%\operatorname*{\text{D}_{\text{KL}}}\mathopen{}\mathclose{{}\left\{{p\mathopen%{}\mathclose{{}\left({\bm{Y}}}\right)}\middle\|{\hat{p}\mathopen{}\mathclose{{%}\left({\bm{Y}};\bm{\theta}}\right)}}\right\}=\mathbb{E}_{{\bm{\hat{X}}}{},{%\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[-{\frac{\mathop{}\!\mathrm{d}{}}{%\mathop{}\!\mathrm{d}{\bm{\theta}}}}\log{\hat{p}\mathopen{}\mathclose{{}\left(%{\bm{\hat{X}}},{\bm{Y}};\bm{\theta}}\right)}}\right]}, (10.15)

where the average is under the “hybrid” joint distribution p^(𝒙|𝒚;𝜽)p(𝒚){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{x}\middle|\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{y};\bm{\theta}}\right)}{p\mathopen{}\mathclose%{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{%rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}{}%}\right)}. This is the point at which we opted, above, to use EM—i.e., to abandon the expectation under p^(𝒙^|𝒚^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{%\theta}}\right)}. That was because we had in hand a tractable expression for the joint distribution, p^(𝒙^,𝒚^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{},\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)}, so that if the expectation could be taken under some separate “recognition model,” the problem would be as simple as fitting a fully observed model (with the additional requirement that this recognition model be kept close to the generative-model posterior). Indeed, as we have lately emphasized, p^(𝒙^|𝒚^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{%\theta}}\right)} is not always even available.

Here we are in roughly the opposite situation: the joint distribution is not tractable, due to the normalizer Z(𝜽)Z(\bm{\theta}) (which, note well, is a function of the parameters we wish to optimize); but the posterior distribution p^(𝒙^|𝒚^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{%\theta}}\right)} is. We will therefore proceed by applying Eq. 10.15 directly to the Boltzmann distribution, Eq. 10.14, but expecting the intractable normalizer to block progress at some point:

𝔼𝑿^,𝒀[-dd𝜽logp^(𝑿^,𝒀;𝜽)]=𝔼𝑿^,𝒀[dlogZd𝜽+dEd𝜽]=dlogZd𝜽+𝔼𝑿^,𝒀[dEd𝜽]=1Zdd𝜽𝒙^𝒚^exp{-E(𝒙^,𝒚^,𝜽)}+𝔼𝑿^,𝒀[dEd𝜽]=-𝒙^𝒚^1Zexp{-E(𝒙^,𝒚^,𝜽)}dEd𝜽+𝔼𝑿^,𝒀[dEd𝜽]=𝔼𝑿^,𝒀[dEd𝜽]-𝔼𝑿^,𝒀^[dEd𝜽].\begin{split}\mathbb{E}_{{\bm{\hat{X}}}{},{\bm{Y}}{}}{\mathopen{}\mathclose{{}%\left[-{\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}}%\log{\hat{p}\mathopen{}\mathclose{{}\left({\bm{\hat{X}}},{\bm{Y}};\bm{\theta}}%\right)}}\right]}&=\mathbb{E}_{{\bm{\hat{X}}}{},{\bm{Y}}{}}{\mathopen{}%\mathclose{{}\left[{\frac{\mathop{}\!\mathrm{d}{\log Z}}{\mathop{}\!\mathrm{d}%{\bm{\theta}}}}+{\frac{\mathop{}\!\mathrm{d}{E}}{\mathop{}\!\mathrm{d}{\bm{%\theta}}}}}\right]}\\&={\frac{\mathop{}\!\mathrm{d}{\log Z}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}}+%\mathbb{E}_{{\bm{\hat{X}}}{},{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[{\frac{%\mathop{}\!\mathrm{d}{E}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}}}\right]}\\&=\frac{1}{Z}{\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{\theta}%}}}\sum_{\bm{\hat{x}}{}}\sum_{\bm{\hat{y}}{}}{\exp\mathopen{}\mathclose{{}%\left\{-E\mathopen{}\mathclose{{}\left(\bm{\hat{x}},\bm{\hat{y}},\bm{\theta}}%\right)}\right\}}+\mathbb{E}_{{\bm{\hat{X}}}{},{\bm{Y}}{}}{\mathopen{}%\mathclose{{}\left[{\frac{\mathop{}\!\mathrm{d}{E}}{\mathop{}\!\mathrm{d}{\bm{%\theta}}}}}\right]}\\&=-\sum_{\bm{\hat{x}}{}}\sum_{\bm{\hat{y}}{}}\frac{1}{Z}{\exp\mathopen{}%\mathclose{{}\left\{-E\mathopen{}\mathclose{{}\left(\bm{\hat{x}},\bm{\hat{y}},%\bm{\theta}}\right)}\right\}}{\frac{\mathop{}\!\mathrm{d}{E}}{\mathop{}\!%\mathrm{d}{\bm{\theta}}}}+\mathbb{E}_{{\bm{\hat{X}}}{},{\bm{Y}}{}}{\mathopen{}%\mathclose{{}\left[{\frac{\mathop{}\!\mathrm{d}{E}}{\mathop{}\!\mathrm{d}{\bm{%\theta}}}}}\right]}\\&=\mathbb{E}_{{\bm{\hat{X}}}{},{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[{%\frac{\mathop{}\!\mathrm{d}{E}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}}}\right]}-%\mathbb{E}_{{\bm{\hat{X}}}{},{\bm{\hat{Y}}}{}}{\mathopen{}\mathclose{{}\left[{%\frac{\mathop{}\!\mathrm{d}{E}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}}}\right]}.%\end{split} (10.16)

In words, the gradient is equal to the difference between the expected energy gradient under two different distributions: the “hybrid joint distribution,” constructed by combining the data distribution (over 𝒀{\bm{Y}}) with the generative model’s posterior distribution (over 𝑿^{\bm{\hat{X}}}); and the complete generative model.

KL-minimization in the harmonium.

Now we make an assumption about the roles of 𝑿^{\bm{\hat{X}}} and 𝒀^{\bm{\hat{Y}}}. As we saw in Section 3.1, constraining the conditional distributions p^(𝒚^|𝒙^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{%\theta}}\right)} and p^(𝒙^|𝒚^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{%\theta}}\right)} to be (1) in exponential families and (2) consistent with each other implies that they take the form

p^(𝒙^|𝒚^;𝜽)=h(𝒙^)exp{(𝒃x^+𝐖y^x^𝑼(𝒚^))T𝑻(𝒙^)-A(𝜼(𝒚^))},p^(𝒚^|𝒙^;𝜽)=k(𝒚^)exp{(𝒃y^+𝐖y^x^T𝑻(𝒙^))T𝑼(𝒚^)-B(𝜻(𝒙^))}.\begin{split}{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{}\middle|%\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{%};\bm{\theta}}\right)}&=h(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{%pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{\hat{x}})\exp\mathopen{}\mathclose{{}\left\{%\mathopen{}\mathclose{{}\left(\bm{b}_{\hat{x}}+\mathbf{W}_{\hat{y}\hat{x}}{\bm%{U}}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}})%}\right)^{\text{T}}{\bm{T}}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named%]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{\hat{x}})-A(\bm{\eta}(\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}))}\right%\},\\{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{%\theta}}\right)}&=k(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{%pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{\hat{y}})\exp\mathopen{}\mathclose{{}\left\{%\mathopen{}\mathclose{{}\left(\bm{b}_{\hat{y}}+\mathbf{W}_{\hat{y}\hat{x}}^{%\text{T}}{\bm{T}}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{%pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{\hat{x}})}\right)^{\text{T}}{\bm{U}}(%\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}})%-B(\bm{\zeta}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{%pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{\hat{x}}))}\right\}.\end{split} (10.17)

In this case, the joint distribution will take the form of a Boltzmann distribution, Eq. 10.14, with energy

E(𝒙^,𝒚^,𝜽)=-𝒃y^T𝑼(𝒚^)-𝒃x^T𝑻(𝒙^)-𝑼(𝒚^)T𝐖y^x^T𝑻(𝒙^)-log(h(𝒙^)k(𝒚^)).E(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}},%\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}},%\bm{\theta})=-\bm{b}_{\hat{y}}^{\text{T}}{\bm{U}}(\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}})-\bm{b}_{%\hat{x}}^{\text{T}}{\bm{T}}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named%]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{\hat{x}})-{\bm{U}}(\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}})^{\text{T%}}\mathbf{W}_{\hat{y}\hat{x}}^{\text{T}}{\bm{T}}(\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}})-\log%\mathopen{}\mathclose{{}\left(h(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[%named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{\hat{x}})k(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}})}\right). (10.18)

The parameter gradients are therefore just the elegant

dEd𝐖y^x^=-𝑻𝑼T    dEd𝒃x^=-𝑻    dEd𝒃y^T=-𝑼T.{\frac{\mathop{}\!\mathrm{d}{E}}{\mathop{}\!\mathrm{d}{\mathbf{W}_{\hat{y}\hat%{x}}}}}=-{\bm{T}}{\bm{U}}^{\text{T}}\hskip 36.135pt{\frac{\mathop{}\!\mathrm{d%}{E}}{\mathop{}\!\mathrm{d}{\bm{b}_{\hat{x}}}}}=-{\bm{T}}\hskip 36.135pt{\frac%{\mathop{}\!\mathrm{d}{E}}{\mathop{}\!\mathrm{d}{\bm{b}_{\hat{y}}^{\text{T}}}}%}=-{\bm{U}}^{\text{T}}.

(See Section B.1 in the appendices for a refresher on derivatives with respect to matrices. For brevity, from here on, we omit the arguments of the sufficient statistics.) Substituting these energy gradients into the gradient of the loss function (Eqs. 10.15 and 10.16) yields

dDKL{p(𝒀)p^(𝒀;𝜽)}d𝐖y^x^=𝔼𝑿^,𝒀^[𝑻𝑼T]-𝔼𝑿^,𝒀[𝑻𝑼T]dDKL{p(𝒀)p^(𝒀;𝜽)}d𝒃x^=𝔼𝑿^,𝒀^[𝑻]-𝔼𝑿^,𝒀[𝑻]dDKL{p(𝒀)p^(𝒀;𝜽)}d𝒃y^T=𝔼𝑿^,𝒀^[𝑼T]-𝔼𝑿^,𝒀[𝑼T].\begin{split}{\frac{\mathop{}\!\mathrm{d}{\operatorname*{\text{D}_{\text{KL}}}%\mathopen{}\mathclose{{}\left\{{p\mathopen{}\mathclose{{}\left({\bm{Y}}}\right%)}\middle\|{\hat{p}\mathopen{}\mathclose{{}\left({\bm{Y}};\bm{\theta}}\right)}%}\right\}}}{\mathop{}\!\mathrm{d}{\mathbf{W}_{\hat{y}\hat{x}}}}}&=\mathbb{E}_{%{\bm{\hat{X}}}{},{\bm{\hat{Y}}}{}}{\mathopen{}\mathclose{{}\left[{\bm{T}}{\bm{%U}}^{\text{T}}}\right]}-\mathbb{E}_{{\bm{\hat{X}}}{},{\bm{Y}}{}}{\mathopen{}%\mathclose{{}\left[{\bm{T}}{\bm{U}}^{\text{T}}}\right]}\\{\frac{\mathop{}\!\mathrm{d}{\operatorname*{\text{D}_{\text{KL}}}\mathopen{}%\mathclose{{}\left\{{p\mathopen{}\mathclose{{}\left({\bm{Y}}}\right)}\middle\|%{\hat{p}\mathopen{}\mathclose{{}\left({\bm{Y}};\bm{\theta}}\right)}}\right\}}}%{\mathop{}\!\mathrm{d}{\bm{b}_{\hat{x}}}}}&=\mathbb{E}_{{\bm{\hat{X}}}{},{\bm{%\hat{Y}}}{}}{\mathopen{}\mathclose{{}\left[{\bm{T}}}\right]}-\mathbb{E}_{{\bm{%\hat{X}}}{},{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[{\bm{T}}}\right]}\\{\frac{\mathop{}\!\mathrm{d}{\operatorname*{\text{D}_{\text{KL}}}\mathopen{}%\mathclose{{}\left\{{p\mathopen{}\mathclose{{}\left({\bm{Y}}}\right)}\middle\|%{\hat{p}\mathopen{}\mathclose{{}\left({\bm{Y}};\bm{\theta}}\right)}}\right\}}}%{\mathop{}\!\mathrm{d}{\bm{b}_{\hat{y}}^{\text{T}}}}}&=\mathbb{E}_{{\bm{\hat{X%}}}{},{\bm{\hat{Y}}}{}}{\mathopen{}\mathclose{{}\left[{\bm{U}}^{\text{T}}}%\right]}-\mathbb{E}_{{\bm{\hat{X}}}{},{\bm{Y}}{}}{\mathopen{}\mathclose{{}%\left[{\bm{U}}^{\text{T}}}\right]}.\end{split} (10.19)

Evidently, learning is complete when the expected sufficient statistics of the model and data (or, more precisely, the model and the hybrid joint distribution) match. However, the elegance of these equations disguises a great difficulty, alluded to at the outset: The first average in each pair is under the joint distribution, which we can compute only up to the intractable normalizer. This certainly rules out setting these derivatives to zero and solving for the parameters in closed formed. Although we can use gradient descent instead, this doesn’t relieve us of the need to evaluate, at least approximately, these averages. One obvious possibility is Gibbs sampling, and indeed the conditional distributions Eq. 10.17 are generally chosen so as to make block sampling possible—sample the entire vector 𝒙^\bm{\hat{x}} given 𝒚^\bm{\hat{y}} and the entire vector 𝒚^\bm{\hat{y}} given 𝒙^\bm{\hat{x}}.

10.2.2 The method of contrastive divergence

We reiterate the major limitation to the learning algorithm just described. We can see from Eq. 10.19 that traveling the gradient of the relative entropy requires expectations under the model joint or marginals (in the positive terms). Since this integral or sum usually cannot be computed analytically, the only obvious way to take it is via Gibbs sampling from the model. Although EFHs are designed to facilitate such a sampling procedure—the lack of intralayer connections makes it possible to sample all of the hidden variables conditioned on the visible ones, and vice versa—it will still generally take many steps of sampling to “burn in” the network, and then even more for subsequent “thinning” to get independent samples. And then there is a second problem: a procedure based on sampling will introduce variance into our estimate of the gradient; and furthermore, this variance depends on the current value of the parameters (𝜽\bm{\theta}). Hinton likens it to “a horizontal sheet of tin that is resonating in such a way that some parts have strong vertical oscillations and other parts are motionless. Sand scattered on the tin will accumulate in the motionless areas even though the time-averaged gradient is zero everywhere” [13].

Here is an alternative that appears, at first glance, to be very crude [14]. We recall the reason that Gibbs sampling requires a long burn-in period: our choice of initializing vector, being arbitrary, may have very low probability under the model. The burn-in travels down the Markov chain away from it towards the “thick” parts of the distribution we aim to sample. What if we had a better way to initialize the chain? In particular, suppose we initialize it at samples from the data distribution, and then take only a few steps along the Markov chain. Certainly, late in training, when the model distribution resembles the data distribution, this is a sensible procedure. But early in training, data samples are unlikely to have high probability under the model. Still, perhaps this “bias”—toward the regions of observation space with high probability under the data distribution—is a small price to pay for the reduction in variance that would have been accumulated through many steps of Gibbs sampling.

To try to make this more precise, we write down a mathematical expression for our approximation [13]. We replace the expectation under the model distribution in Eq. 10.16 (for Boltzmann machines generally; or, for EFHs in particular, in Eq. 10.19) with averaging under a distribution we call p1(𝒙1,𝒚1;𝜽){p^{1}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{x}^{1},\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{y}^{1};\bm{\theta}}\right)}. It is perhaps best understood in terms of sampling. Make four successive draws, from: first the data distribution, p0(𝒚𝟎){{p}^{0}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{{y}^{0}}{}}\right)}; then the posterior, p^(𝒙0|𝒚𝟎;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{x}^{0}\middle|\bm{{y}^{0}};\bm{\theta}}\right)}; followed by the emission, p^(𝒚1|𝒙0;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\bm{y}^{1}\middle|\leavevmode\color[rgb]%{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{x}^{0};\bm{\theta}%}\right)}; and finally from the posterior again, p^(𝒙1|𝒚1;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{x}^{1}\middle|\bm{y}^{1};\bm{\theta}}\right)}. The first two draws are frow p0(𝒙0,𝒚0;𝜽){p^{0}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{x}^{0},\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{y}^{0};\bm{\theta}}\right)}; the second two are from p1(𝒙1,𝒚1;𝜽){p^{1}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{x}^{1},\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{y}^{1};\bm{\theta}}\right)}. Under this notational convention, the “hybrid joint distribution,” p^(𝒙𝟎|𝒚𝟎;𝜽)p0(𝒚𝟎){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{{x}^{0}}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{{y}^{0}};\bm{%\theta}}\right)}{{p}^{0}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{{y}^{0}}{}}\right)}, would be called p0(𝒙0,𝒚0;𝜽){p^{0}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{x}^{0},\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{y}^{0};\bm{\theta}}\right)}: the distribution formed by “zero” steps of Gibbs sampling (more precisely, one half step). And likewise, taking nn steps yields samples from pn(𝒙n,𝒚n;𝜽){p^{n}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{x}^{n},\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{y}^{n};\bm{\theta}}\right)}.

At the end of this Markov chain lies the model distribution; hence we can say p(𝒙,𝒚;𝜽)=p^(𝒙,𝒚;𝜽){p^{\infty}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{x}^{\infty},\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{y}^{\infty};\bm{\theta}}\right)}={\hat{p}%\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[%named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{{x}^{\infty}}{},\leavevmode\color[rgb]{.5,.5,.5%}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.%5}\pgfsys@color@gray@fill{.5}\bm{{y}^{\infty}};\bm{\theta}}\right)} [13]. In fact, for the remainder of this section, we shall use pp^{\infty} for the model joint. Now substituting p1(𝒙1,𝒚1;𝜽){p^{1}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{x}^{1},\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{y}^{1};\bm{\theta}}\right)} for p(𝒙,𝒚;𝜽){p^{\infty}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{{x}^{\infty}}{},\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{{y}^{\infty}}{};%\bm{\theta}}\right)} in Eq. 10.16, and noting our reliance on sample averages:

dd𝜽=?dEd𝜽𝑿0,𝒀𝟎-dEd𝜽𝑿1,𝒀1=dEd𝜽-dEd𝜽𝑿1,𝒀1|𝑿0,𝒀𝟎𝑿0,𝒀𝟎,{\frac{\mathop{}\!\mathrm{d}{\mathcal{L}}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}%}\stackrel{{\scriptstyle\text{?}}}{{=}}{\mathopen{}\mathclose{{}\left\langle{{%\frac{\mathop{}\!\mathrm{d}{E}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}}}}\right%\rangle_{{\bm{X}}^{0}{},{\bm{{Y}^{0}}}{}}}-{\mathopen{}\mathclose{{}\left%\langle{{\frac{\mathop{}\!\mathrm{d}{E}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}}}%}\right\rangle_{{\bm{X}}^{1}{},{\bm{Y}}^{1}{}}}={\mathopen{}\mathclose{{}\left%\langle{{\frac{\mathop{}\!\mathrm{d}{E}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}}-%{\mathopen{}\mathclose{{}\left\langle{{\frac{\mathop{}\!\mathrm{d}{E}}{\mathop%{}\!\mathrm{d}{\bm{\theta}}}}}}\right\rangle_{{\bm{X}}^{1}{},{\bm{Y}}^{1}{}|{%\bm{X}}^{0}{},{\bm{{Y}^{0}}}{}}}}}\right\rangle_{{\bm{X}}^{0}{},{\bm{{Y}^{0}}}%{}}}, (10.20)

where in the last expression we have emphasized that samples from p1(𝒙1,𝒚1;𝜽){p^{1}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{x}^{1},\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{y}^{1};\bm{\theta}}\right)} would typically be nested inside the sampling from p0(𝒙0,𝒚0;𝜽){p^{0}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{x}^{0},\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{y}^{0};\bm{\theta}}\right)}. We have supposed that the learning rule in Eq. 10.20 is the gradient of some (unknown) loss function \mathcal{L}. This is clearly an attractive learning rule from the perspective of number of required samples, but will it produce good density estimators?

Contrastive divergence for energy-based models.

From here we try to work backwards: what loss function might yield this gradient? Hinton proposes “contrastive divergence,” i.e., the difference between the original relative entropy and a different one:

(𝜽)..=DKL{p0(𝒀𝟎)p(𝒀𝟎;𝜽)}-DKL{pn(𝒀n;𝜽)p(𝒀n;𝜽)}.\mathcal{L}(\bm{\theta})\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\operatorname*{\text{D}_{\text{KL}}}\mathopen{}\mathclose{{}\left\{{{p}^{0}%\mathopen{}\mathclose{{}\left({\bm{{Y}^{0}}}}\right)}\middle\|{p^{\infty}%\mathopen{}\mathclose{{}\left({\bm{{Y}^{0}}};\bm{\theta}}\right)}}\right\}-%\operatorname*{\text{D}_{\text{KL}}}\mathopen{}\mathclose{{}\left\{{p^{n}%\mathopen{}\mathclose{{}\left({\bm{Y}}^{n};\bm{\theta}}\right)}\middle\|{p^{%\infty}\mathopen{}\mathclose{{}\left({\bm{Y}}^{n};\bm{\theta}}\right)}}\right\}. (10.21)

This is an interesting quantity. First of all, it is positive. Why? Because the nn-step distribution is “trapped” between p0p^{0} and pp^{\infty}. That is, Gibbs sampling down the Markov chain brings distributions closer to pp^{\infty}, which will consequently diverge less from “later” than “earlier” distributions. So the second term is always less then the first, and their difference always positive. Second, the contrastive divergence is zero at precisely the same points as the first term—i.e., the standard relative-entropy loss (as in Eq. 10.15). This can be seen by noting that the contrastive divergence can be zero only when the divergences are equal, i.e. when pnp0p^{n}\approx p^{0}. But this implies that p0p^{0} is an equilibrium distribution for a Markov chain with the transition operator given by a single step of Gibbs sampling under the model. Further applications of this transition operator—additional steps of sampling from the model—cannot change this equilibrium distribution. So in this case, p0p^{0} would be equal to pmp^{m} for all mm, and both divergences would vanish.

We might take this as a specific instance of a more general procedure: if the gradient of a function is difficult, replace it with a different function with the same minimum but an easier gradient. But is the gradient of the contrastive divergence easier to evaluate than the gradient of the (standard) relative entropy?

We have already worked out the gradient of the first relative entropy, Eqs. 10.15 and 10.16, but we break the computation into steps to reuse portions for the derivative of the second divergence. For brevity, we suppress the arguments:

dd𝜽=dd𝜽[DKL{p0p}-DKL{pnp}]=DKL{p0p}p0dp0d𝜽+DKL{p0p}pdpd𝜽-DKL{pnp}pndpnd𝜽-DKL{pnp}pdpd𝜽=0+(dEd𝜽𝑿0,𝒀0-dEd𝜽𝑿,𝒀)-DKL{pnp}pndpnd𝜽-(dEd𝜽𝑿n,𝒀n-dEd𝜽𝑿,𝒀)=dEd𝜽𝑿0,𝒀0-dEd𝜽𝑿n,𝒀n-DKL{pnp}pndpnd𝜽.\begin{split}{\frac{\mathop{}\!\mathrm{d}{\mathcal{L}}}{\mathop{}\!\mathrm{d}{%\bm{\theta}}}}&=\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{%\theta}}}\mathopen{}\mathclose{{}\left[\operatorname*{\text{D}_{\text{KL}}}%\mathopen{}\mathclose{{}\left\{p^{0}\middle\|p^{\infty}}\right\}-\operatorname%*{\text{D}_{\text{KL}}}\mathopen{}\mathclose{{}\left\{p^{n}\middle\|p^{\infty}%}\right\}}\right]\\&=\frac{\partial{\operatorname*{\text{D}_{\text{KL}}}\mathopen{}\mathclose{{}%\left\{p^{0}\middle\|p^{\infty}}\right\}}}{\partial{p^{0}}}\frac{\mathop{}\!%\mathrm{d}{p^{0}}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}+\frac{\partial{%\operatorname*{\text{D}_{\text{KL}}}\mathopen{}\mathclose{{}\left\{p^{0}%\middle\|p^{\infty}}\right\}}}{\partial{p^{\infty}}}\frac{\mathop{}\!\mathrm{d%}{p^{\infty}}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}-\frac{\partial{%\operatorname*{\text{D}_{\text{KL}}}\mathopen{}\mathclose{{}\left\{p^{n}%\middle\|p^{\infty}}\right\}}}{\partial{p^{n}}}\frac{\mathop{}\!\mathrm{d}{p^{%n}}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}-\frac{\partial{\operatorname*{\text{D%}_{\text{KL}}}\mathopen{}\mathclose{{}\left\{p^{n}\middle\|p^{\infty}}\right\}%}}{\partial{p^{\infty}}}\frac{\mathop{}\!\mathrm{d}{p^{\infty}}}{\mathop{}\!%\mathrm{d}{\bm{\theta}}}\\&=0+\mathopen{}\mathclose{{}\left({\mathopen{}\mathclose{{}\left\langle{{\frac%{\mathop{}\!\mathrm{d}{E}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}}}}\right\rangle%_{{\bm{X}}^{0}{},{\bm{Y}}^{0}{}}}-{\mathopen{}\mathclose{{}\left\langle{{\frac%{\mathop{}\!\mathrm{d}{E}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}}}}\right\rangle%_{{\bm{{X}^{\infty}}}{},{\bm{{Y}^{\infty}}}{}}}}\right)-\frac{\partial{%\operatorname*{\text{D}_{\text{KL}}}\mathopen{}\mathclose{{}\left\{p^{n}%\middle\|p^{\infty}}\right\}}}{\partial{p^{n}}}\frac{\mathop{}\!\mathrm{d}{p^{%n}}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}-\mathopen{}\mathclose{{}\left({%\mathopen{}\mathclose{{}\left\langle{{\frac{\mathop{}\!\mathrm{d}{E}}{\mathop{%}\!\mathrm{d}{\bm{\theta}}}}}}\right\rangle_{{\bm{X}}^{n}{},{\bm{Y}}^{n}{}}}-{%\mathopen{}\mathclose{{}\left\langle{{\frac{\mathop{}\!\mathrm{d}{E}}{\mathop{%}\!\mathrm{d}{\bm{\theta}}}}}}\right\rangle_{{\bm{{X}^{\infty}}}{},{\bm{{Y}^{%\infty}}}{}}}}\right)\\&={\mathopen{}\mathclose{{}\left\langle{{\frac{\mathop{}\!\mathrm{d}{E}}{%\mathop{}\!\mathrm{d}{\bm{\theta}}}}}}\right\rangle_{{\bm{X}}^{0}{},{\bm{Y}}^{%0}{}}}-{\mathopen{}\mathclose{{}\left\langle{{\frac{\mathop{}\!\mathrm{d}{E}}{%\mathop{}\!\mathrm{d}{\bm{\theta}}}}}}\right\rangle_{{\bm{X}}^{n}{},{\bm{Y}}^{%n}{}}}-\frac{\partial{\operatorname*{\text{D}_{\text{KL}}}\mathopen{}%\mathclose{{}\left\{p^{n}\middle\|p^{\infty}}\right\}}}{\partial{p^{n}}}\frac{%\mathop{}\!\mathrm{d}{p^{n}}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}.\end{split} (10.22)

On the third line we have substituted for the fourth term by exploiting the analogy with the second term (which we worked out in Eq. 10.16 above).

The final line is seemingly quite close to the desired gradient, Eq. 10.20. Is this close enough? Hinton reports that extensive simulation seems to indicate that this last term is small in practice, and that in any case its inclusion rarely changes the approximate version (Eq. 10.20) by more than 90 degrees; that is, neglecting it seldom makes the gradient point in the wrong direction [13]. And inefficiencies introduced by ignoring this term can perhaps be alleviated by choosing an nn that balances the cost of these inaccurate gradients against the computational costs of protracted Gibbs sampling. After all, since the final term must disappear as the number of these “contrastive-divergence steps” approaches infinity, as long as there is no reason to suspect a discontinuity in the Markov chain, there will be some finite number nn of CD steps that will shrink the nuisance term to an arbitrarily small size.

Contrastive divergence for the harmonium.

The derivation in Eq. 10.22 is general to models based on the Boltzmann distribution. Applying the contrastive-divergence approximate learning rule, Eq. 10.20, to exponential-family harmoniums just means substituting pn(𝒙n,𝒚n;𝜽){p^{n}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{x}^{n},\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{y}^{n};\bm{\theta}}\right)} for p^(𝒙^,𝒚^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{},\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)} in Eq. 10.19. Here we write the one-step version, emphasizing the source of the samples of p1(𝒙1,𝒚1;𝜽){p^{1}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{x}^{1},\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{y}^{1};\bm{\theta}}\right)} by rearranging the averaging brackets as we did in Eq. 10.20. We also let the sufficient statistics be the vector random variables themselves, 𝑼(𝒀^)=𝒀^,𝑻(𝑿^)=𝑿^{\bm{U}}({\bm{\hat{Y}}})={\bm{\hat{Y}}},{\bm{T}}({\bm{\hat{X}}})={\bm{\hat{X}}} in order to facilitate a biological interpretation:

Δ𝐖y^x^𝑿0𝒀0T-𝑿1𝒀1T𝑿1,𝒀1|𝑿0,𝒀0𝑿0,𝒀0,Δ𝒃x^𝑿0-𝑿1𝑿1|𝑿0,𝒀0𝑿0,𝒀0,Δ𝒃y^𝒀0-𝒀1𝒀1|𝑿0,𝒀0𝑿0,𝒀0.\begin{split}\Delta\mathbf{W}_{\hat{y}\hat{x}}&\propto{\mathopen{}\mathclose{{%}\left\langle{{\bm{X}}^{0}{{\bm{Y}}^{0}}^{\text{T}}-{\mathopen{}\mathclose{{}%\left\langle{{\bm{X}}^{1}{{\bm{Y}}^{1}}^{\text{T}}}}\right\rangle_{{\bm{X}}^{1%}{},{\bm{Y}}^{1}{}|{\bm{X}}^{0}{},{\bm{Y}}^{0}{}}}}}\right\rangle_{{\bm{X}}^{0%}{},{\bm{Y}}^{0}{}}},\\\Delta\bm{b}_{\hat{x}}&\propto{\mathopen{}\mathclose{{}\left\langle{{\bm{X}}^{%0}-{\mathopen{}\mathclose{{}\left\langle{{\bm{X}}^{1}}}\right\rangle_{{\bm{X}}%^{1}{}|{\bm{X}}^{0}{},{\bm{Y}}^{0}{}}}}}\right\rangle_{{\bm{X}}^{0}{},{\bm{Y}}%^{0}{}}},\\\Delta\bm{b}_{\hat{y}}&\propto{\mathopen{}\mathclose{{}\left\langle{{\bm{Y}}^{%0}-{\mathopen{}\mathclose{{}\left\langle{{\bm{Y}}^{1}}}\right\rangle_{{\bm{Y}}%^{1}{}|{\bm{X}}^{0}{},{\bm{Y}}^{0}{}}}}}\right\rangle_{{\bm{X}}^{0}{},{\bm{Y}}%^{0}{}}}.\end{split} (10.23)

Thought of as a neural network, we can say that each sample vector drives activity in the layer above, which reciprocally drives activity in the layer below, which drives activity in the layer above; after which “synaptic strengths” change proportional to pairwise correlations (average products) between the pre- and post-synaptic units, with the initial contributions being Hebbian and the final contributions being anti-Hebbian. For CDn{}_{n}, n>1n>1, the reciprocal activity simply continues for longer before synaptic changes are made.

Some examples.

10.2.3 Learning with deep belief networks

We summarize our progress in learning in EFH-like models. Since we will shortly need to introduce subscripts to the random variables, we avoid clutter from here on by setting aside contrastive divergence and working out the results in terms of the original loss function (see Eq. 10.15)—allowing us to dispense with superscripts. Here, then, we return to using p^\hat{p} rather than pp^{\infty} for the model distribution.

To recapitulate: the EFH represents, in some sense, a generative model more “agnostic” than the directed ones learned with EM: rather than committing to a prior distribution of latent variables, p^(𝒙^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}};\bm{\theta}}\right)}, it licenses a certain inference procedure to them, p^(𝒙^|𝒚^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}};\bm{%\theta}}\right)}. The price we pay for the ease of inference is difficulty in generating samples from, or taking expectations under, the unconditional (joint or marginal) model distributions. Computing the gradients for latent-variable density estimation (Eq. 10.16) requires just such operations, so the price may appear to be steep. However, an alternative procedure which requires very few samples, the method of contrastive divergence (Eq. 10.20), works well in practice. It can be justified either as a crude approximation to the original learning procedure, or as the approximate descent of an objective function slightly different from, but having the same minimum as, the original.

Supposing even that our learning procedure does minimize the relative entropy of the data p(𝒚){p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[%named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{y}}\right)} and model p^(𝒚^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}};\bm{\theta}}\right)} distributions, that divergence may not reach zero. The statistics of the data may be too rich to be representable by an EFH, or (say) one with a given number of hidden units. It would be nice to have a procedure for augmenting this trained EFH that is guaranteed to improve performance, i.e., to decrease relative entropy.

From marginal to joint relative entropy.

Our derivation of just such a procedure begins with an idea: perhaps only certain parts of the trained, but insufficient, EFH want improvement. Specifically, we shall commit ourselves to using the emission distribution it has learned, p^(𝒚^0|𝒙^0;𝜽0){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}_{0}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_{0};\bm{%\theta}_{0}}\right)}, but try to replace the prior, p^(𝒙^0;𝜽0){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_{0};\bm{\theta}_{0}}\right)}. (Here we have numbered the parameters—and the corresponding random variables—in anticipation of the introduction of more.) Of course, the whole point of the EFH is that the prior is learned only implicitly, so it may seem an infelicitous candidate for replacement. In particular, if we substitute a new prior, p^(𝒙^1;𝜽1){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_{1};\bm{\theta}_{1}}\right)}, we shall seemingly be unable to avail ourselves of the EFH learning procedures—e.g., the contrastive-divergence parameter updates, Eq. 10.23. These rules require the posterior, and although we have one consistent with p^(𝒙^0;𝜽0){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_{0};\bm{\theta}_{0}}\right)} and p^(𝒚^0|𝒙^0;𝜽0){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}_{0}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_{0};\bm{%\theta}_{0}}\right)}, to wit p^(𝒙^0|𝒚^0;𝜽0){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_{0}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}_{0};\bm{%\theta}_{0}}\right)}, we do not have in hand one consistent with p^(𝒙^1;𝜽1){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_{1};\bm{\theta}_{1}}\right)} and p^(𝒚^0|𝒙^0;𝜽0){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}_{0}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_{0};\bm{%\theta}_{0}}\right)}—call it p^(𝒙^1|𝒚^0;𝜽0,𝜽1){\ignorespaces\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_{1}%\middle|\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{%rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat%{y}}_{0};\bm{\theta}_{0},\bm{\theta}_{1}}\right)}. Nor is it clear how we should get one (although the reader is invited to think about this).

So we turn for inspiration to expectation-maximization (EM), the learning procedure for directed graphical models, which is in some sense what our problem has become, since we now have a parameterized emission, p^(𝒚^0|𝒙^0;𝜽0){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}_{0}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_{0};\bm{%\theta}_{0}}\right)}, and prior, p^(𝒙^1;𝜽1){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_{1};\bm{\theta}_{1}}\right)}, even if the latter has yet to be specified. Recall from Section 6.3 that our problem may be simplified if we trade descent in the marginal relative entropy of the data and model distributions for descent in an upper bound, the joint relative entropy. The slack in the bound is taken up by the relative entropy of a “recognition model,” i.e. another posterior distribution, and the posterior implied by the generative model (recall Eq. 6.6). In the present case, let us select the original EFH’s posterior, p^(𝒙^0|𝒚^0;𝜽0){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_{0}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}_{0};\bm{%\theta}_{0}}\right)}, as the recognition distribution. As we have lately noted, abandoning this model’s prior presumably spoils the validity of its posterior in the augmented model—that is, in general, p^(𝒙^|𝒚^0;𝜽0,𝜽1)p^(𝒙^|𝒚^0;𝜽0){\ignorespaces\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}\middle|%\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}_%{0};\bm{\theta}_{0},\bm{\theta}_{1}}\right)}\neq{\hat{p}\mathopen{}\mathclose{%{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{%rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat%{x}}\middle|\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor%}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{%\hat{y}}_{0};\bm{\theta}_{0}}\right)}. To emphasize that the right-hand side of the inequality (the original posterior) is a proxy for the left-hand side, we will subsequently refer to it as pˇ(𝒙ˇ1|𝒚0;𝜽0){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{\ignorespaces 1}\middle|%\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}_{0};%\bm{\theta}_{0}}\right)}, and its “aggregated” form as

pˇ(𝒙ˇ1;𝜽0)..=𝒚0pˇ(𝒙ˇ1|𝒚0;𝜽0)p(𝒚0){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{\ignorespaces 1};\bm{\theta}_{0}}%\right)}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\sum_{\bm{y}_{0}{}}{\check{p}\mathopen{}\mathclose{{}\left(\leavevmode%\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{%\ignorespaces 1}\middle|\bm{y}_{0};\bm{\theta}_{0}}\right)}{p\mathopen{}%\mathclose{{}\left(\bm{y}_{0}}\right)} (10.24)

With these definitions in hand, the joint relative entropy can be written (cf. Eq. 6.6) as

JRE({𝜽0,𝜽1},𝜽0)=DKL{pˇ(𝑿ˇ1|𝒀0;𝜽0)p(𝒀0)p^(𝒀0|𝑿ˇ1;𝜽0)p^(𝑿ˇ1;𝜽1)}=DKL{p(𝒀0)p^(𝒀0;𝜽0)}+DKL{pˇ(𝑿ˇ1|𝒀0;𝜽0)p^(𝑿ˇ1|𝒀0;𝜽0,𝜽1)}=DKL{pˇ(𝑿ˇ1;𝜽0)p^(𝑿ˇ1;𝜽1)}+c.\begin{split}\mathcal{L}_{\text{JRE}}(\{\bm{\theta}_{0},\bm{\theta}_{1}\},\bm{%\theta}_{0})&=\operatorname*{\text{D}_{\text{KL}}}\mathopen{}\mathclose{{}%\left\{{\check{p}\mathopen{}\mathclose{{}\left({\bm{\check{X}}}_{1}\middle|{%\bm{Y}}_{0};\bm{\theta}_{0}}\right)}{p\mathopen{}\mathclose{{}\left({\bm{Y}}_{%0}}\right)}\middle\|{\hat{p}\mathopen{}\mathclose{{}\left({\bm{Y}}_{0}\middle|%{\bm{\check{X}}}_{1};\bm{\theta}_{0}}\right)}{\hat{p}\mathopen{}\mathclose{{}%\left({\bm{\check{X}}}_{1};\bm{\theta}_{1}}\right)}}\right\}\\&=\operatorname*{\text{D}_{\text{KL}}}\mathopen{}\mathclose{{}\left\{{p%\mathopen{}\mathclose{{}\left({\bm{Y}}_{0}}\right)}\middle\|{\hat{p}\mathopen{%}\mathclose{{}\left({\bm{Y}}_{0};\bm{\theta}_{0}}\right)}}\right\}+%\operatorname*{\text{D}_{\text{KL}}}\mathopen{}\mathclose{{}\left\{{\check{p}%\mathopen{}\mathclose{{}\left({\bm{\check{X}}}_{1}\middle|{\bm{Y}}_{0};\bm{%\theta}_{0}}\right)}\middle\|{\ignorespaces\hat{p}\mathopen{}\mathclose{{}%\left({\bm{\check{X}}}_{1}\middle|{\bm{Y}}_{0};\bm{\theta}_{0},\bm{\theta}_{1}%}\right)}}\right\}\\&=\operatorname*{\text{D}_{\text{KL}}}\mathopen{}\mathclose{{}\left\{{\check{p%}\mathopen{}\mathclose{{}\left({\bm{\check{X}}}_{1};\bm{\theta}_{0}}\right)}%\middle\|{\hat{p}\mathopen{}\mathclose{{}\left({\bm{\check{X}}}_{1};\bm{\theta%}_{1}}\right)}}\right\}+c.\end{split} (10.25)

The final line reminds us that we are proposing to optimize only the parameters 𝜽1\bm{\theta}_{1} of the new prior, so most of the joint relative entropy is irrelevant. Does this help?

Unrolling the EFH.

Now we make a very clever observation [13]. Consider the “hybrid” graphical model in Fig. LABEL:fig:DBNb.margin: Insert figure of (a) EFH, and (b) inverted EFH with one directed connection “hanging” off. Samples from this model can be generated by, first, prolonged Gibbs sampling from the “inverted” EFH (notice the labels), followed by a single sample from the bottommost layer via the directed connections. But if the weight matrix defining these connections is the transpose of the EFH’s weight matrix, then samples from this model are identical to samples from the visible layer of a standard EFH, Fig. LABEL:fig:DBNa! Likewise, the recognition distributions from the models in Figs. LABEL:fig:DBNa and LABEL:fig:DBNb, pˇ(𝒙ˇ1|𝒚0;𝜽0){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{\ignorespaces 1}\middle|%\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}_{0};%\bm{\theta}_{0}}\right)} and p^(𝒙^1|𝒚^0;𝜽0,𝜽0){\ignorespaces\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_{1}%\middle|\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{%rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat%{y}}_{0};\bm{\theta}_{0},\bm{\theta}_{0}}\right)}, are identical.

Suppose then we let the prior over 𝑿^1{\bm{\hat{X}}}_{1} in the model we seek to learn, p^(𝒙^1;𝜽1){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_{1};\bm{\theta}_{1}}\right)}, be the marginal distribution over 𝑿^1{\bm{\hat{X}}}_{1} in a second, inverted EFH, as in Fig. LABEL:fig:DBNb. Moreover, let us initialize the parameters of this EFH at those of the first, fully trained EFH: 𝜽1=set𝜽0\bm{\theta}_{1}\stackrel{{\scriptstyle\text{set}}}{{=}}\bm{\theta}_{0}. At this point in training, the posterior relative entropy on the second line in Eq. 10.25 is zero, so following the gradient of Eq. 10.25 decreases—at least at first—the relative entropy of interest.

This closely resembles the M step of EM with invertible generative models, and the logic of the bound is the same. As soon as 𝜽1\bm{\theta}_{1} is changed, the bound on the marginal relative entropy (MRE) can loosen, because the posterior relative entropy (PRE) can increase (but never decrease below zero). But so far from being problematic, decreases in the joint relative entropy (JRE) that loosen the bound correspond to even greater decreases in the MRE. That is, if the PRE grows, it does so at the expense of the MRE; and if it doesn’t (i.e., it stays at zero), then the MRE still decreases. It is true that subsequent movement along this gradient can decrease the PRE, and therefore allow the MRE to increase, all while lowering the JRE. But the MRE can never increase past its value at the beginning of the changes of 𝜽1\bm{\theta}_{1}, since here the bound is tight, and all changes lower this bound. In fine, the network with a “rolled out” directed layer can never be worse (have higher MRE) than its point of departure, the original EFH of Fig. LABEL:fig:DBNa. This is shown graphically in Fig. LABEL:fig:DBNupperbound.margin: Insert schematic plot of JRE upper bounding MRE.

So much for the M step. However, there is no subsequent E step because there is no obvious way to bring the recognition model, pˇ(𝒙ˇ1|𝒚0;𝜽0){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{\ignorespaces 1}\middle|%\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}_{0};%\bm{\theta}_{0}}\right)}, back into agreement with the generative-model posterior, p^(𝒙^1|𝒚^0;𝜽0,𝜽1){\ignorespaces\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_{1}%\middle|\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{%rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat%{y}}_{0};\bm{\theta}_{0},\bm{\theta}_{1}}\right)}. We shall employ instead a different iterative improvement scheme, consisting in some sense of a series of M steps.

Recursive application of the procedure.

The second stage of training just described amounts to training a second EFH—albeit on the empirical data as transformed by the first EFH’s recognition model, Eq. 10.24, rather than the data in their raw form. Eq. 10.25 shows that this can likewise be written as minimizing a (marginal) relative entropy. This observation suggests, what is true, that the entire procedure can be applied recursively. After training the second (inverted) EFH, the MRE in Eq. 10.25 may not be zero. Then a second directed layer can be “unrolled” from the EFH spool (Fig. LABEL:fig:DBNc), and trained on a new “data distribution,” consisting of the “data distribution,” pˇ(𝒙ˇ1;𝜽0){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{\ignorespaces 1};\bm{\theta}_{0}}%\right)}, used to train its predecessor, transformed by its predecessor’s recognition model, pˇ(𝒚ˇ2|𝒙ˇ1;𝜽1){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\check{y}}_{\ignorespaces 2}\middle|%\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}%}_{1};\bm{\theta}_{1}}\right)}, to wit 𝒙ˇ1pˇ(𝒚ˇ2|𝒙ˇ1;𝜽1)pˇ(𝒙ˇ1;𝜽0)=..pˇ(𝒚ˇ2;𝜽0,𝜽1)\sum_{\bm{\check{x}}_{1}{}}{\check{p}\mathopen{}\mathclose{{}\left(\leavevmode%\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{y}}_{%\ignorespaces 2}\middle|\bm{\check{x}}_{1};\bm{\theta}_{1}}\right)}{\check{p}%\mathopen{}\mathclose{{}\left(\bm{\check{x}}_{1};\bm{\theta}_{0}}\right)}=%\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}{\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\check{y}}_{\ignorespaces 2};\bm{\theta}_{0},%\bm{\theta}_{1}}\right)}.

Can we guarantee that this will continue to decrease (or at least not increase) the model fit to the observed data? In fact we cannot, but we can guarantee something nearly as good: Either the model fits the data better, i.e. we reduce DKL{p(𝒀0)p^(𝒀0;𝜽0,𝜽1,𝜽2)}\operatorname*{\text{D}_{\text{KL}}}\mathopen{}\mathclose{{}\left\{{p\mathopen%{}\mathclose{{}\left({\bm{Y}}_{0}}\right)}\middle\|{\hat{p}\mathopen{}%\mathclose{{}\left({\bm{Y}}_{0};\bm{\theta}_{0},\bm{\theta}_{1},\bm{\theta}_{2%}}\right)}}\right\}; or the generative-model posteriors get closer to the recognition models. And this guarantee extends to deeper layers. In short, under this greedy, layer-wise training, either the resulting “deep belief network” becomes a better generative model, or it becomes a better recognition model.

To see this, we apply the argument, given in the previous section for the first “unrolled” EFH, to a second. margin: Put in some figures for this!! Maybe you can use previous figs, but you should also have a (c) with three sets of parameters, so four layers. Decoupling 𝜽2\bm{\theta}_{2} from 𝜽1\bm{\theta}_{1} and descending the gradient of

DKL{pˇ(𝒀ˇ2;𝜽0,𝜽1)p^(𝒀ˇ2;𝜽2)}\operatorname*{\text{D}_{\text{KL}}}\mathopen{}\mathclose{{}\left\{{\check{p}%\mathopen{}\mathclose{{}\left({\bm{\check{Y}}}_{2};\bm{\theta}_{0},\bm{\theta}%_{1}}\right)}\middle\|{\hat{p}\mathopen{}\mathclose{{}\left({\bm{\check{Y}}}_{%2};\bm{\theta}_{2}}\right)}}\right\}

(that is, improving the prior over 𝒀^2{\bm{\hat{Y}}}_{2}) will at least initially decrease

DKL{pˇ(𝑿ˇ1;𝜽0)p^(𝑿ˇ1;𝜽1,𝜽2)},\operatorname*{\text{D}_{\text{KL}}}\mathopen{}\mathclose{{}\left\{{\check{p}%\mathopen{}\mathclose{{}\left({\bm{\check{X}}}_{1};\bm{\theta}_{0}}\right)}%\middle\|{\hat{p}\mathopen{}\mathclose{{}\left({\bm{\check{X}}}_{1};\bm{\theta%}_{1},\bm{\theta}_{2}}\right)}}\right\},

by way of decreasing a JRE upper bound that is initially tight (analogous to Eq. 10.25). Now, p^(𝒙^1;𝜽1,𝜽2){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_{1};\bm{\theta}_{1},\bm{\theta}_{2}}%\right)} is likewise a prior over the subsequent layer, 𝒀^0{\bm{\hat{Y}}}_{0}, so decreasing this marginal relative entropy decreases a second JRE upper bound (Eq. 10.25), albeit in this case not one that is initially tight. Therefore, decreasing this MRE either decreases the MRE of the next layer,

DKL{p(𝒀0)p^(𝒀0;𝜽0,𝜽1,𝜽2)},\operatorname*{\text{D}_{\text{KL}}}\mathopen{}\mathclose{{}\left\{{p\mathopen%{}\mathclose{{}\left({\bm{Y}}_{0}}\right)}\middle\|{\hat{p}\mathopen{}%\mathclose{{}\left({\bm{Y}}_{0};\bm{\theta}_{0},\bm{\theta}_{1},\bm{\theta}_{2%}}\right)}}\right\},

or it renders this level’s posterior distribution, p^(𝒙^1|𝒚^0;𝜽0,𝜽1,𝜽2){\ignorespaces\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_{1}%\middle|\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{%rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat%{y}}_{0};\bm{\theta}_{0},\bm{\theta}_{1},\bm{\theta}_{2}}\right)}, a better match to the recognition model, pˇ(𝒙ˇ1|𝒚0;𝜽0){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{\ignorespaces 1}\middle|%\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}_{0};%\bm{\theta}_{0}}\right)}. And note that this last MRE is the one we actually care about: the deep belief network’s fit to the observed data.

After the initial improvement of the prior over 𝒀^2{\bm{\hat{Y}}}_{2}, the first JRE bound can loosen, after which not all improvements in this prior translate into improvements in the prior over 𝑿^1{\bm{\hat{X}}}_{1}. But (1) any other improvement comes from making p^(𝒚^2|𝒙^1;𝜽1,𝜽2){\ignorespaces\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}_{2}%\middle|\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{%rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat%{x}}_{1};\bm{\theta}_{1},\bm{\theta}_{2}}\right)} more consistent with pˇ(𝒚ˇ2|𝒙ˇ1;𝜽1){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\check{y}}_{\ignorespaces 2}\middle|%\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}%}_{1};\bm{\theta}_{1}}\right)}; and (2) the prior over 𝑿^1{\bm{\hat{X}}}_{1}, p^(𝒙^1;𝜽1,𝜽2){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_{1};\bm{\theta}_{1},\bm{\theta}_{2}}%\right)}, can never get worse than it was before 𝜽2\bm{\theta}_{2} was decoupled from 𝜽1\bm{\theta}_{1}. Extending the entire argument to deeper belief networks is straightforward.

Picturesquely, we can describe the whole procedure as follows. After training the first EFH, we freeze the way that “first-order features” give rise to data. But we need more flexibility in the prior distribution of first-order features. So we train a new EFH to generate them. Our training procedure makes this second EFH more likely to generate features that either generate good data, 𝒀^{\bm{\hat{Y}}}, or are more consistent with the recognition model being used to generate the “training features.” This is akin to an M step in EM. When this learning process stalls out, we should like to tighten up the bound and begin again, as in the E step of EM, which would require somehow making the recognition distribution more like the posterior under our “hybrid” model. Instead, we simply repeat the procedure of replacing the prior—this time over “second-order features”—with another EFH. Although learning in this third EFH cannot be claimed to tighten the original bound, it does impose a tight bound on this bound: improvement in the generation of second-order features—the hidden variables for the first-order features—is guaranteed to improve first-order features, at least initially. Whether this improvement comes by way of improving data generation, or merely the bound on it, is not typically obvious.

The procedure is only guaranteed to keep lowering bounds (or bounds on bounds, etc.) if the new EFHs are introduced with their parameters initialized to the previous EFH’s parameter values. In practice, however, this is almost always relaxed, because the new EFH is seldom even chosen to have the appropriate shape (same number of hidden units as the previous EFH’s number of visible units). This allows for very general models; and although not guaranteed to keep improving performance, the introduction of deeper EFHs can be roughly justified along similar lines: it allows for more flexible priors over increasingly complex features.

Summary.

We summarize the entire procedure for training deep belief networks with a pair of somewhat hairy definitions and a pair of learning rules. Pairs are required simply to maintain a connection to the notational distinction between latent and observed variables in the original EFH; conceptually, the training is the same at every layer.

pˇ(𝒚ˇm;𝜽0,,𝜽m-1)\displaystyle{\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{y}}_{m};\bm%{\theta}_{0},\ldots,\bm{\theta}_{m-1}}\right)} ..=𝒙ˇm-1pˇ(𝒚ˇm|𝒙ˇm-1;𝜽m-1)pˇ(𝒙ˇm-1;𝜽0,,𝜽m-2)\displaystyle\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\sum_{\bm{\check{x}}_{m-1}{}}{\check{p}\mathopen{}\mathclose{{}\left(%\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{y}%}_{m}\middle|\bm{\check{x}}_{m-1};\bm{\theta}_{m-1}}\right)}{\check{p}%\mathopen{}\mathclose{{}\left(\bm{\check{x}}_{m-1};\bm{\theta}_{0},\ldots,\bm{%\theta}_{m-2}}\right)} for m1m\geq 1 even
pˇ(𝒙ˇm;𝜽0,,𝜽m-1)\displaystyle{\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{m};\bm%{\theta}_{0},\ldots,\bm{\theta}_{m-1}}\right)} ..=𝒚ˇm-1pˇ(𝒙ˇm|𝒚ˇm-1;𝜽m-1)pˇ(𝒚ˇm-1;𝜽0,,𝜽m-2)\displaystyle\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\sum_{\bm{\check{y}}_{m-1}{}}{\check{p}\mathopen{}\mathclose{{}\left(%\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}%}_{m}\middle|\bm{\check{y}}_{m-1};\bm{\theta}_{m-1}}\right)}{\check{p}%\mathopen{}\mathclose{{}\left(\bm{\check{y}}_{m-1};\bm{\theta}_{0},\ldots,\bm{%\theta}_{m-2}}\right)} for m1m\geq 1 odd.

Then:

Δ𝜽m\displaystyle\Delta\bm{\theta}_{m} -dd𝜽mDKL{pˇ(𝒀ˇm;𝜽0,,𝜽m-1)p^(𝒀ˇm;𝜽m)}\displaystyle\propto-{\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm%{\theta}_{m}}}}\operatorname*{\text{D}_{\text{KL}}}\mathopen{}\mathclose{{}%\left\{{\check{p}\mathopen{}\mathclose{{}\left({\bm{\check{Y}}}_{m};\bm{\theta%}_{0},\ldots,\bm{\theta}_{m-1}}\right)}\middle\|{\hat{p}\mathopen{}\mathclose{%{}\left({\bm{\check{Y}}}_{m};\bm{\theta}_{m}}\right)}}\right\} for m1m\geq 1 even
Δ𝜽m\displaystyle\Delta\bm{\theta}_{m} -dd𝜽mDKL{pˇ(𝑿ˇm;𝜽0,,𝜽m-1)p^(𝑿ˇm;𝜽m)}\displaystyle\propto-{\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm%{\theta}_{m}}}}\operatorname*{\text{D}_{\text{KL}}}\mathopen{}\mathclose{{}%\left\{{\check{p}\mathopen{}\mathclose{{}\left({\bm{\check{X}}}_{m};\bm{\theta%}_{0},\ldots,\bm{\theta}_{m-1}}\right)}\middle\|{\hat{p}\mathopen{}\mathclose{%{}\left({\bm{\check{X}}}_{m};\bm{\theta}_{m}}\right)}}\right\} for m1 odd.\displaystyle\text{for $m\geq 1$ odd}. (10.27)

Failure modes.

For an EFH to match the data distribution, its prior distribution (over 𝑿^{\bm{\hat{X}}}) must match the “aggregrated” posterior:

pˇ(𝒙ˇ1;𝜽0)=𝒚0pˇ(𝒙ˇ1|𝒚0;𝜽0)p(𝒚0)p^(𝒚^0;𝜽0)=p(𝒚0)=𝒚0p^(𝒙^0|𝒚0;𝜽0)p^(𝒚0;𝜽0)=p^(𝒙^0;𝜽0).\begin{split}{\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{%\ignorespaces 1};\bm{\theta}_{0}}\right)}&=\sum_{\bm{y}_{0}{}}{\check{p}%\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[%named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{\check{x}}_{\ignorespaces 1}\middle|\bm{y}_{0};%\bm{\theta}_{0}}\right)}{p\mathopen{}\mathclose{{}\left(\bm{y}_{0}}\right)}\\{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}_{0};\bm{\theta}_{0}}\right)}={p%\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[%named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{y}_{0}}\right)}\implies&=\sum_{\bm{y}_{0}{}}{%\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_{0}\middle|\bm{y}_{0};\bm{\theta}_{0}%}\right)}{\hat{p}\mathopen{}\mathclose{{}\left(\bm{y}_{0};\bm{\theta}_{0}}%\right)}={\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5%}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.%5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_{0};\bm{\theta}_{0}}\right)}.\end{split}

However, the converse is not true, since the match of prior and aggregated posterior only requires that

𝒚0[p(𝒚0)-p^(𝒚0;𝜽0)]p^(𝒙^0|𝒚0;𝜽0)=0,\sum_{\bm{y}_{0}{}}\mathopen{}\mathclose{{}\left[{p\mathopen{}\mathclose{{}%\left(\bm{y}_{0}}\right)}-{\hat{p}\mathopen{}\mathclose{{}\left(\bm{y}_{0};\bm%{\theta}_{0}}\right)}}\right]{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode%\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}_{0}%\middle|\bm{y}_{0};\bm{\theta}_{0}}\right)}=0,

which does not imply a unique solution for p^(𝒚^0;𝜽0){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}_{0};\bm{\theta}_{0}}\right)}. Now suppose that after training, the first EFH in a DBN satisfies this equation. Then the second EFH, trained on its aggregated posterior, will learn only to match the first EFH—indeed, if it is initialized at 𝜽1=set𝜽0\bm{\theta}_{1}\stackrel{{\scriptstyle\text{set}}}{{=}}\bm{\theta}_{0}, then the loss will start at a global minimum and the second EFH will not be updated at all. This occurs even if the first EFH did not match the data distribution. So the addition of more layers in this case will not improve the model fit.

margin: Goodfellow’s critique; notes on the assumption of a factorial posterior; complementary priors; etc.

10.2.4 Dynamical models: the TRBM, RTRBM, and rEFH