5.2 Unsupervised learning
So far we have considered discriminative models in which the “outputs,”
5.2.1 “InfoMax” in deterministic, invertible models
To motivate the objective, consider the deterministic input-output relationship
(5.25) |
Here the “final” output
We will justify the need for such functions below.
Now, when there are fewer outputs than inputs,
That is, although the conditional entropy is zero for all values of
Now we can say why the squashing function is necessary.
If the input-output map were unbounded, the output entropy could be driven arbitrarily large.
For concreteness, and without loss of generality, we can let
The last line follows because the squashing functions are assumed to act element-wise, so the determinant is just the product of the diagonal elements. Evidently, maximizing mutual information through an invertible transformation is identical to minimizing our standard loss, the relative entropy between data and model distributions, with the latter equal to the Jacobian determinant of that transformation:
The assimilation to density estimation can be completed by reinterpreting Eq. 5.25 as defining (via its inverse) the emission density of a generative model [3]. We defer this reinterpretion until Chapter 6, when we take up learning in generative models in earnest.
InfoMax independent-components analysis.
One interesting special case of the unsupervised, discriminative learning problem just described is so-called InfoMax ICA [1]. Here, the discriminative map in Eq. 5.25 is defined to be a linear transformation by a full-rank square matrix:
(5.30) |
We leave the output nonlinearity undefined for now, except to insist (as we have been) that it be invertible and bounded. The gradient, whether of the mutual information, output entropy, relative input entropy, or input cross entropy, is then
(recalling the derivative of the log determinant).
Here
the gradient inside the expectation becomes
Setting Eq. 5.31 to zero yields an iterpretable, albeit implicit, equation for the optimal solution:
where the
“Semi-supervised” clustering.
[[XXX]]