Chapter 9 Learning with Reparameterizations

9.0.1 A duality between generative and discriminative learning

As we have seen, in generative models, expressive power typically comes at the price of ease of inference (and vice versa). So far we have explored three different strategies for this trade off: Severely limit expressive power to conjugate or pseudo-conjugate prior distributions in order to allow for exact inference (Chapter 7); allow arbitrarily expressive generative models, but then approximate inference with a separate recogniton model that is arbitrary, but highly parameterized (Sections 8.1 and 8.2), or a generic homogenizer (Section 8.3), or again correct up to simplifying but erroneous independence assumptions (Section 8.4). Now we introduce a fourth strategy: let the latent variables of the model be related to the observed variables by an invertible (and therefore deterministic) transformation. This makes the model marginal p^(𝒚^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)} computable in closed-form with the standard rules of calculus for changing variables under an integral. Consequently, the marginal relative entropy, DKL{p(𝒀)p^(𝒀;𝜽)}\operatorname*{\text{D}_{\text{KL}}}\mathopen{}\mathclose{{}\left\{{p\mathopen% {}\mathclose{{}\left({\bm{Y}}}\right)}\middle\|{\hat{p}\mathopen{}\mathclose{{% }\left({\bm{Y}};\bm{\theta}}\right)}}\right\}, can be descended directly, rather than indirectly via the joint relative entropy, and EM is not needed.

Below we explore two such models in the usual way, starting with simple linear transformations and then moving to more complicated functions. But let us begin with a more abstract formulation, in order to make contact with the unsupervised, discriminative learning problems of Section 5.2. In that section, we related observations 𝒀^{\bm{\hat{Y}}} to (hypothetical) “latent variables” 𝒁^{\bm{\hat{Z}}} via a deterministic, invertible transformation, Eq. 5.25. Here we shall make a similar specification, although to emphasize that this is a generative model, we write the observations as a function of the latent variables, 𝑿^{\bm{\hat{X}}}, rather than vice versa. Still, to exhibit the relationship with the discriminative model, we denote this function as the inverse of some “recognition” function, 𝒅(𝒚,𝜽)\bm{d}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb% }{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y},\bm{% \theta}):

𝒅-1(𝒙^,𝜽).\bm{d}^{-1}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor% }{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{% \hat{x}},\bm{\theta}). (9.1)

For generative models, it is also necessary to specify a prior distribution over the latent variables. Although the discriminative model makes no such specification, its training objective—maximization of the latent (“output”) entropy—will tend to produce independent outputs. Therefore, we shall assume that the generative model’s latent variables are independent—and, for now, nothing else:

p^𝑿^(𝒙^)=k=1Kp^X^k(x^k)=k=1Kψkx^k(x^k).{\hat{p}_{{\bm{\hat{X}}}}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{}}\right)% }=\prod_{k=1}^{{K}}{\hat{p}_{{\hat{X}}_{k}}\mathopen{}\mathclose{{}\left(% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{x}_{k}}% \right)}=\prod_{k=1}^{{K}}\frac{\partial{\psi_{k}}}{\partial{\hat{x}_{k}}}% \mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[% named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\hat{x}_{k}}\right). (9.2)

In the second equality, we have simply expressed the probability distributions in terms of the cumulative distribution functions (CDFs), ψk(x^k)\psi_{k}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{% rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{x}_% {k}), or more precisely their derivatives, in anticipation of the results below.

In the discriminative models of Section 5.2, rather than being specified explicitly, the distribution of latent variables was inherited from the data distribution, p(𝒚){p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[% named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}, via the change-of-variables formula. Here something like the reverse obtains: The distribution of the (model) observations p^(𝒚^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)} is inherited, via the deterministic transformation Eq. 9.1 and the change-of-variables formula, from the distribution of latent variables, Eq. 9.2:

p^(𝒚^;𝜽)=p^𝑿^(𝒅(𝒚^,𝜽);𝜽)|𝒅𝒚T(𝒚^,𝜽)|=k=1Kψkx^k(dk(𝒚^,𝜽))|𝒅𝒚T(𝒚^,𝜽)|.{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)}={\hat{p}_{{\bm% {\hat{X}}}}\mathopen{}\mathclose{{}\left(\bm{d}(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{},\bm{% \theta});\bm{\theta}}\right)}\mathopen{}\mathclose{{}\left\lvert\frac{\partial% {\bm{d}}}{\partial{\bm{y}}^{\text{T}}}(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}},\bm{\theta})}\right\rvert\\ =\prod_{k=1}^{{K}}\frac{\partial{\psi_{k}}}{\partial{\hat{x}_{k}}}\mathopen{}% \mathclose{{}\left(d_{k}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{y}},\bm{\theta})}\right)\mathopen{}% \mathclose{{}\left\lvert\frac{\partial{\bm{d}}}{\partial{\bm{y}}^{\text{T}}}(% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}},% \bm{\theta})}\right\rvert. (9.3)

We have recovered Eq. 5.29, but from a different model [3]. Both models use the same map between observations and “latent” variables, given by Eq. 9.1. But the discriminative model proceeds to pass the outputs of this map, 𝒙ˇ\bm{\check{x}}, through squashing functions 𝝍(𝒙ˇ)\bm{\psi}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{% rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{% \check{x}}), whereas the generative model instead interprets 𝝍(𝒙^)\bm{\psi}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{% rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat% {x}}) as the (prior) CDFs of those variables, 𝒙^\bm{\hat{x}}. Thus the “outputs” 𝒛ˇ\bm{\check{z}} of the discriminative model are the latent variables of the generative model passed through their own CDFs. If we define 𝒛^ . . =𝝍(𝒙^)\bm{\hat{z}}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.} }}=\bm{\psi}(\bm{\hat{x}}) for the generative model, then we can say that the generative 𝒁^{\bm{\hat{Z}}} are distributed independently (because 𝑿^{\bm{\hat{X}}} are independent and 𝝍()\bm{\psi}(\cdot) acts elementwise) and uniformly (because the CDFs exactly flatten the distribution of 𝑿^{\bm{\hat{X}}}). This is consistent with the discriminative objective: maximizing the entropy of 𝒁ˇ{\bm{\check{Z}}} will likewise tend to distribute them independently and uniformly.

Thus, in the deterministic, invertible setting, maximizing mutual information through a discriminative model is equivalent to density estimation in a generative model. It is also useful to re-express the density-estimation problem in terms of 𝑿^{\bm{\hat{X}}}:

-(𝒀;𝒁ˇ)=DKL{p(𝒀)kKψkxk(dk(𝒀,𝜽))|𝒅𝒚T(𝒀,𝜽)|}=DKL{pˇ(𝑿;𝜽)p^𝑿^(𝑿)},-\mathcal{I}\mathopen{}\mathclose{{}\left({\bm{Y}};{\bm{\check{Z}}}}\right)=% \operatorname*{\text{D}_{\text{KL}}}\mathopen{}\mathclose{{}\left\{{p\mathopen% {}\mathclose{{}\left({\bm{Y}}}\right)}\middle\|\prod_{k}^{{K}}\frac{\partial{% \psi_{k}}}{\partial{x_{k}}}(d_{k}({\bm{Y}},\bm{\theta}))\mathopen{}\mathclose{% {}\left\lvert\frac{\partial{\bm{d}}}{\partial{\bm{y}}^{\text{T}}}({\bm{Y}},\bm% {\theta})}\right\rvert}\right\}=\operatorname*{\text{D}_{\text{KL}}}\mathopen{% }\mathclose{{}\left\{{\check{p}\mathopen{}\mathclose{{}\left({\bm{X}};\bm{% \theta}}\right)}\middle\|{\hat{p}_{{\bm{\hat{X}}}}\mathopen{}\mathclose{{}% \left({\bm{X}}}\right)}}\right\}, (9.4)

where pˇ(𝒙ˇ;𝜽){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\check{x}}{};\bm{\theta}}\right)} is the distribution induced by the recognition function applied to the observed data, 𝒅(𝒀,𝜽)\bm{d}({\bm{Y}},\bm{\theta}). The first equality is Eq. 5.28, and the second can be derived simply by noting that relative entropy is invariant under reparameterization (in this case with 𝒅(𝒚,𝜽)\bm{d}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb% }{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y},\bm{% \theta})). In light of the second equality, and dropping the language of discriminative and generative models, we can summarize this approach to model fitting like this:

We require a reparameterization of the data, 𝐗ˇ=𝐝(𝐘,𝛉){\bm{\check{X}}}=\bm{d}({\bm{Y}},\bm{\theta}), to be distributed close (in the KL sense) to some factorial prior distribution, k=1Kψkx^k(x^k)\prod_{k=1}^{{K}}\frac{\partial{\psi_{k}}}{\partial{\hat{x}_{k}}}\mathopen{}% \mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\hat{x}_{k}}\right); or, equivalently, to be maximally entropic when “squashed” by ψk(xˇk)\psi_{k}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{% rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\check{x% }_{k}) (for all kk).