5.2 Unsupervised learning

So far we have considered discriminative models in which the “outputs,” ${\bm{\check{X}}}$ , have been observed. Indeed, it might seem rather quixotic to try to learn a discriminative model purely from its inputs, ${\bm{Y}}$ . Nevertheless, let us consider the intuitive objective of maximizing information transmission [1], and see where it leads us.

5.2.1 “InfoMax” in deterministic, invertible models

To motivate the objective, consider the deterministic input-output relationship

\displaystyle\bm{\check{x}}

\displaystyle=\bm{d}(\bm{y},\bm{\theta}),

\displaystyle\bm{\check{z}}

\displaystyle=\bm{\psi}(\bm{\check{x}}).

(5.25)

Here the “final” output $\bm{\check{z}}$ is computed by passing each element of the vector $\bm{\check{x}}$ through some sort of invertible, element-wise “squashing” function (cf. a layer of a neural network, Eq. 5.17), i.e. a monotonic function from the reals to some bounded interval on the real line:

\{\bm{\psi}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor%}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{%\check{x}})\}_{k}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\psi_{k}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor%}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\check%{x}_{k}),\>\>\>\>\frac{\partial{\psi_{k}}}{\partial{\check{x}_{k}}}(%\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\check{x}_{k}%)>0\text{ or }\frac{\partial{\psi_{k}}}{\partial{\check{x}_{k}}}(\leavevmode%\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\check{x}_{k})<0.

(5.26)

We will justify the need for such functions below. Now, when there are fewer outputs than inputs, $\dim\mathopen{}\mathclose{{}\left(\bm{\check{z}}}\right)<\dim\mathopen{}%\mathclose{{}\left(\bm{y}}\right)$ , changing $\bm{\theta}$ so as to maximize mutual information between inputs and outputs will yield a function $\bm{d}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb%}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y},\bm{%\theta})$ that recodes its inputs more efficiently. However, we begin with a seemingly more näive, but mathematically tractable, alternative: Let the number of outputs equal the number inputs, $\dim\mathopen{}\mathclose{{}\left(\bm{\check{z}}}\right)=\dim\mathopen{}%\mathclose{{}\left(\bm{y}}\right)$ , and furthermore let $\bm{d}$ be invertible (in its first argument), so that $\bm{y}=\bm{d}^{-1}(\bm{\psi}^{-1}(\bm{\check{z}}),\bm{\theta})$ . This may seem somewhat perverse, since invertible transformations are automatically information preserving. However, recall that mutual information nevertheless depends on $\bm{\theta}$ :

\mathcal{I}\mathopen{}\mathclose{{}\left({\bm{Y}};{\bm{\check{Z}}}}\right)=%\text{H}{\mathopen{}\mathclose{{}\left[{\bm{\check{Z}}};\bm{\theta}}\right]}-%\text{H}{\mathopen{}\mathclose{{}\left[{\bm{\check{Z}}}|{\bm{Y}};\bm{\theta}}%\right]}=\text{H}{\mathopen{}\mathclose{{}\left[{\bm{\check{Z}}};\bm{\theta}}%\right]}.

(5.27)

That is, although the conditional entropy is zero for all values of $\bm{\theta}$ (under the assumption that $\bm{d}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb%}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y},\bm{%\theta})$ doesn’t lose its invertibility), the output entropy still depends on $\bm{\theta}$ . In fine, for invertible $\bm{d}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb%}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y},\bm{%\theta})$ , maximizing mutual information amounts to maximizing output entropy.⁸⁸ 8 If there were noise in the map from ${\bm{Y}}$ to ${\bm{\check{Z}}}$ , this would add an extra term to the gradient, and the goal of maximizing output entropy would have to be balanced against noise-proofing the transmission. Since a vector random variable ${\bm{\check{Z}}}$ is maximally entropic only if its components are statistically independent, this suggests that the “InfoMax” criterion can underwrite a form of independent-components analysis^†^†margin: independent-components analysis . That is, maximizing input-output mutual information or output entropy will “unmix” the inputs ${\bm{Y}}$ into their independent components.

Now we can say why the squashing function is necessary. If the input-output map were unbounded, the output entropy could be driven arbitrarily large. For concreteness, and without loss of generality, we can let $\psi_{k}$ be an increasing function, with its range the interval $[0,1]$ . Now let us try to re-express the objective in Eq. 5.27. More precisely, we consider its negation, $-\text{H}{\mathopen{}\mathclose{{}\left[{\bm{\check{Z}}};\bm{\theta}}\right]}$ , for consistency with the standard objectives of this book, which are losses. First we note that, although we have not specified a distribution for ${\bm{\check{Z}}}$ , ${\check{p}_{{\bm{\check{Z}}}}\mathopen{}\mathclose{{}\left(\leavevmode\color[%rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{z}}}\right)}$ is inherited directly from the data distribution ${p_{{\bm{Y}}}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}$ via the deterministic relationship in Eq. 5.25. In particular, since the relationship is (by assumption) invertible, the two distributions are related by the standard change-of-variables formula, which we apply on the third line:

\begin{split}-\mathcal{I}\mathopen{}\mathclose{{}\left({\bm{Y}};{\bm{\check{Z}%}}}\right)=-\text{H}{\mathopen{}\mathclose{{}\left[{\bm{\check{Z}}}}\right]}&=%\mathbb{E}_{{\bm{\check{Z}}}{}}{\mathopen{}\mathclose{{}\left[\log{\check{p}_{%{\bm{\check{Z}}}}\mathopen{}\mathclose{{}\left({\bm{\check{Z}}}}\right)}}%\right]}\\&=\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[\log{\check{p}_{{\bm{%\check{Z}}}}\mathopen{}\mathclose{{}\left(\bm{\psi}(\bm{d}({\bm{Y}},\bm{\theta%}))}\right)}}\right]}\\&=\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[\log\frac{{p_{{\bm{Y}}%}\mathopen{}\mathclose{{}\left({\bm{Y}}}\right)}}{\mathopen{}\mathclose{{}%\left\lvert\frac{\partial{\bm{\psi}}}{\partial{\bm{\check{x}}}^{\text{T}}}(\bm%{d}({\bm{Y}},\bm{\theta}))\frac{\partial{\bm{d}}}{\partial{\bm{y}}^{\text{T}}}%({\bm{Y}},\bm{\theta})}\right\rvert}}\right]}\\&=\operatorname*{\text{D}_{\text{KL}}}\mathopen{}\mathclose{{}\left\{{p_{{\bm{%Y}}}\mathopen{}\mathclose{{}\left({\bm{Y}}}\right)}\middle\|\prod_{k}^{{K}}%\frac{\partial{\psi_{k}}}{\partial{\check{x}_{k}}}(d_{k}({\bm{Y}},\bm{\theta})%)\mathopen{}\mathclose{{}\left\lvert\frac{\partial{\bm{d}}}{\partial{\bm{y}}^{%\text{T}}}({\bm{Y}},\bm{\theta})}\right\rvert}\right\}.\end{split}

(5.28)

The last line follows because the squashing functions are assumed to act element-wise, so the determinant is just the product of the diagonal elements. Evidently, maximizing mutual information through an invertible transformation is identical to minimizing our standard loss, the relative entropy between data and model distributions, with the latter equal to the Jacobian determinant of that transformation:

{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)}\mathrel{\vbox{%\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\prod_{k}^{{K}}\frac{\partial{\psi_{k}}}{\partial{\check{x}_{k}}}(d_{k}(%\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}},%\bm{\theta}))\mathopen{}\mathclose{{}\left\lvert\frac{\partial{\bm{d}}}{%\partial{\bm{y}}^{\text{T}}}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[%named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{\hat{y}},\bm{\theta})}\right\rvert.

(5.29)

The assimilation to density estimation can be completed by reinterpreting Eq. 5.25 as defining (via its inverse) the emission density of a generative model [3]. We defer this reinterpretion until Chapter 6, when we take up learning in generative models in earnest.

InfoMax independent-components analysis.

One interesting special case of the unsupervised, discriminative learning problem just described is so-called InfoMax ICA [1]. Here, the discriminative map in Eq. 5.25 is defined to be a linear transformation by a full-rank square matrix:

\displaystyle\bm{\check{x}}

\displaystyle=\mathbf{{D}}\bm{y},

\displaystyle\check{z}_{k}

\displaystyle=\psi_{k}(\check{x}_{k}),\text{ for all }k.

(5.30)

We leave the output nonlinearity undefined for now, except to insist (as we have been) that it be invertible and bounded. The gradient, whether of the mutual information, output entropy, relative input entropy, or input cross entropy, is then

\begin{split}-\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\mathbf{{D}%}}}\mathcal{I}\mathopen{}\mathclose{{}\left({\bm{Y}};{\bm{\check{Z}}}}\right)=%-\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\mathbf{{D}}}}\text{H}{%\mathopen{}\mathclose{{}\left[{\bm{\check{Z}}}}\right]}&=\frac{\mathop{}\!%\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\mathbf{{D}}}}\operatorname*{\text{D}_{%\text{KL}}}\mathopen{}\mathclose{{}\left\{{p_{{\bm{Y}}}\mathopen{}\mathclose{{%}\left({\bm{Y}}}\right)}\middle\|\prod_{k}^{{K}}\frac{\partial{\psi_{k}}}{%\partial{\check{x}_{k}}}(\bm{d}_{k}^{\text{T}}{\bm{Y}})\mathopen{}\mathclose{{%}\left\lvert\mathbf{{D}}}\right\rvert}\right\}\\&=\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\mathbf{{D}}}}\mathbb{E%}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[-\log\mathopen{}\mathclose{{}%\left(\prod_{k}^{{K}}\frac{\partial{\psi_{k}}}{\partial{\check{x}_{k}}}(\bm{d}%_{k}^{\text{T}}{\bm{Y}})\mathopen{}\mathclose{{}\left\lvert\mathbf{{D}}}\right%\rvert}\right)}\right]}\\&=-\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[\sum_{k=1}^{{K}}\frac%{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\mathbf{{D}}}}\log\frac{%\partial{\psi_{k}}}{\partial{\check{x}_{k}}}(\bm{d}_{k}^{\text{T}}{\bm{Y}})}%\right]}-{\mathbf{{D}}}^{-\text{T}}\end{split}

(5.31)

(recalling the derivative of the log determinant). Here $\bm{d}_{k}^{\text{T}}$ is the $k^{\text{th}}$ row of $\mathbf{{D}}$ . For example, when $\psi_{k}(\check{x}_{k})$ is the logistic function,

\begin{split}\psi_{k}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{%pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\check{x}_{k})&=\frac{1}{1+\exp\mathopen{}%\mathclose{{}\left\{-\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{%pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\check{x}_{k}}\right\}}\\\implies\frac{\partial{\psi_{k}}}{\partial{\check{x}_{k}}}(\leavevmode\color[%rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\check{x}_{k})&=\psi_{%k}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\check{x}_{k}%)\mathopen{}\mathclose{{}\left(1-\psi_{k}(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\check{x}_{k})}\right)\\\implies\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\check{x}_{k}}}%\log\frac{\partial{\psi_{k}}}{\partial{\check{x}_{k}}}(\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\check{x}_{k})&=\frac{%\psi_{k}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{%rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\check{x%}_{k})\mathopen{}\mathclose{{}\left(1-\psi_{k}(\leavevmode\color[rgb]{.5,.5,.5%}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.%5}\pgfsys@color@gray@fill{.5}\check{x}_{k})}\right)\mathopen{}\mathclose{{}%\left(1-\psi_{k}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{%pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\check{x}_{k})}\right)-\psi_{k}(\leavevmode\color[%rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\check{x}_{k})\psi_{k}%(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\check{x}_{k}%)\mathopen{}\mathclose{{}\left(1-\psi_{k}(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\check{x}_{k})}\right)}{\psi_{k}(\leavevmode\color%[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\check{x}_{k})%\mathopen{}\mathclose{{}\left(1-\psi_{k}(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\check{x}_{k})}\right)}\\&=1-2\psi_{k}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{%pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\check{x}_{k}),\end{split}

the gradient inside the expectation becomes

\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{d}_{k}^{\text{T}}}}%\log\frac{\partial{\psi_{k}}}{\partial{\check{x}_{k}}}(\bm{d}_{k}^{\text{T}}%\bm{y})=\mathopen{}\mathclose{{}\left(1-2\psi_{k}(\bm{d}_{k}^{\text{T}}\bm{y})%}\right)\bm{y}\implies\sum_{k=1}^{{K}}\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}%\!\mathrm{d}{\mathbf{{D}}}}\log\frac{\partial{\psi_{k}}}{\partial{\check{x}_{k%}}}(\bm{d}_{k}^{\text{T}}\bm{y})=\bm{y}\mathopen{}\mathclose{{}\left(\bm{1}-2%\bm{\psi}(\mathbf{{D}}\bm{y})}\right)^{\text{T}}.

Setting Eq. 5.31 to zero yields an iterpretable, albeit implicit, equation for the optimal solution:

\mathbf{{D}}=\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[\mathopen{}%\mathclose{{}\left(2\bm{\psi}(\mathbf{{D}}{\bm{Y}})-\bm{1}}\right){\bm{Y}}^{%\text{T}}}\right]}^{-1}=\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[%\tanh\mathopen{}\mathclose{{}\left(\mathbf{{D}}{\bm{Y}}/2}\right){\bm{Y}}^{%\text{T}}}\right]}^{-1},

where the $\tanh$ function is applied element-wise. Thus when $\mathopen{}\mathclose{{}\left\lvert\bm{d}_{k}\bm{y}}\right\rvert/2<1$ (for all $k$ ), and $\tanh$ is approximately an identity function, $\mathbf{{D}}=\sqrt{2}\mathbb{E}{\mathopen{}\mathclose{{}\left[{\bm{Y}}{\bm{Y}}%^{\text{T}}}\right]}^{-1/2}$ , proportional to the whitening transformation.

Figure 5.1: A square neural network.

“Semi-supervised” clustering.

[[XXX]]