9.1 InfoMax ICA, revisited

Historically, this equivalence was noted first [3] for a specific model, InfoMax ICA [1], which we first encountered in Section 5.2. Consider the very simply “generative model” in which the observations are related to the “latent” variables by a square, full-rank matrix:

𝒚^=𝐂𝒙^=𝐃-1𝒙^.\bm{\hat{y}}=\mathbf{{C}}\bm{\hat{x}}=\mathbf{{D}}^{-1}\bm{\hat{x}}.

Substituting this relationship (cf. Eq. 9.1) into Eq. 9.3, we see that the marginal distribution of the observed variables is

p^(𝒚^;𝜽)=k=1Kψkx^k(𝒅kT𝒚^)|𝐃|,{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)}=\prod_{k=1}^{{%K}}\frac{\partial{\psi_{k}}}{\partial{\hat{x}_{k}}}\mathopen{}\mathclose{{}%\left(\bm{d}^{\text{T}}_{k}\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]%{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{\hat{y}}}\right)\mathopen{}\mathclose{{}\left%\lvert\mathbf{{D}}}\right\rvert, (9.5)

where again ψk\psi_{k} is the CDF of the corresponding “latent” variable, X^k{\hat{X}}_{k}, and 𝒅kT\bm{d}^{\text{T}}_{k} is a row of 𝐃\mathbf{{D}}. Clearly, fitting this marginal density follows the same gradient as in InfoMax ICA, Eq. 5.31.

dd𝐃DKL{p(𝒀)k=1Kψkxˇk(𝒅kT𝒀)|𝐃|}=-𝔼𝒀[k=1Kdd𝐃logψkxˇk(𝒅kT𝒀)]-𝐃-T.\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\mathbf{{D}}}}%\operatorname*{\text{D}_{\text{KL}}}\mathopen{}\mathclose{{}\left\{{p\mathopen%{}\mathclose{{}\left({\bm{Y}}}\right)}\middle\|\prod_{k=1}^{{K}}\frac{\partial%{\psi_{k}}}{\partial{\check{x}_{k}}}(\bm{d}_{k}^{\text{T}}{\bm{Y}})\mathopen{}%\mathclose{{}\left\lvert\mathbf{{D}}}\right\rvert}\right\}\\=-\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[\sum_{k=1}^{{K}}\frac{%\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\mathbf{{D}}}}\log\frac{%\partial{\psi_{k}}}{\partial{\check{x}_{k}}}(\bm{d}_{k}^{\text{T}}{\bm{Y}})}%\right]}-{\mathbf{{D}}}^{-\text{T}}.

That is, InfoMax ICA can be implemented as density estimation in a generative model with latent variables distributed independently and cumulatively according to 𝝍\bm{\psi} [3]; see Fig. 9.1.

X1X_{1}X2X_{2}X3X_{3}X4X_{4}Y1Y_{1}Y2Y_{2}Y3Y_{3}Y4Y_{4}𝐂\mathbf{{C}} NN
(A)
Y^1{\hat{Y}}_{1} \sum X^1{\hat{X}}_{1}Z^1{\hat{Z}}_{1} Y^2{\hat{Y}}_{2} \sum X^2{\hat{X}}_{2}Z^2{\hat{Z}}_{2} Y^3{\hat{Y}}_{3} \sum X^3{\hat{X}}_{3}Z^3{\hat{Z}}_{3}\vdots\vdots\vdots Y^K{\hat{Y}}_{{K}} \sum X^K{\hat{X}}_{{K}}Z^K{\hat{Z}}_{{K}} d11d_{11} d12d_{12} d13d_{13} d1Kd_{1{K}} d21d_{21} d22d_{22} d23d_{23} d2Kd_{2{K}} d31d_{31} d32d_{32} d33d_{33} d3Kd_{3{K}} dK1d_{{K}1} dK2d_{{K}2} dK3d_{{K}3} dKKd_{{K}{K}} d11d_{11} d12d_{12} d13d_{13} d1Kd_{1{K}} d21d_{21} d22d_{22} d23d_{23} d2Kd_{2{K}} d31d_{31} d32d_{32} d33d_{33} d3Kd_{3{K}} dK1d_{{K}1} dK2d_{{K}2} dK3d_{{K}3} dKKd_{{K}{K}}
(B)
Figure 9.1: “Infomax” independent-components analysis. (LABEL:sub@subfig:ICAgenerativemodel) The generative model. (LABEL:sub@subfig:ICAdiscriminativemodel) The discriminative model. Note that 𝐂=𝐃-1\mathbf{{C}}=\mathbf{{D}}^{-1}.

But we haven’t specified 𝝍\bm{\psi}! This omission may have seemed minor in the discriminative model—sigmoidal nonlinearities in neural networks are typically selected rather freely—but is striking in a generative model. And indeed, the choice matters. Suppose we had let the sigmoidal function be the CDF of a Gaussian. Then since we are modeling the observations as linear functions of the latent variables, 𝒀^=𝐂𝑿^{\bm{\hat{Y}}}=\mathbf{{C}}{\bm{\hat{X}}}, their marginal distribution (Eq. 9.5) is clearly another mean-zero normal distribution, in particular 𝒩(𝟎,𝐃T𝐃-1)\mathcal{N}\mathopen{}\mathclose{{}\left(\bm{0},\>{\mathbf{{D}}^{\text{T}}%\mathbf{{D}}}^{-1}}\right)margin: Exercise LABEL:ex:ICAwithGaussianCDFsMarginal . Minimizing the loss in Eq. 9.4 then amounts merely to fitting the covariance of the observed data: 𝐃*=𝔼[𝒀𝒀T]-1/2\mathbf{{D}}^{*}=\mathbb{E}{\mathopen{}\mathclose{{}\left[{\bm{Y}}{\bm{Y}}^{%\text{T}}}\right]}^{-1/2}. (This can also be shown by setting the gradient of Eq. 9.4, i.e. Eq. 5.31, to zero and solving for 𝐃\mathbf{{D}}.margin: Exercise LABEL:ex:ICAwithGaussianCDFsMinimizer )

If the observations are indeed normal, then whitening them in this way would indeed render them independent (since for jointly Gaussian random variables, uncorrelatedness implies independence)—but we do not need such an elaborate procedure to arrive at this conclusion! ICA is of interest precisely when the observations are not normal, in which case the optimal linear transformation cannot generally be stated a priori. Critically, squashing the data with the Gaussian CDF makes the outputs blind to the higher-order correlations, and is therefore not a suitable nonlinearity in cases of interest. In contrast, the (standard) logistic function is super-Gaussian (leptokurtotic), so InfoMax ICA with logistic outputs will generally do more than decorrelate its inputs. This may seem remarkable, given the visually minor discrepancy between the Gaussian CDF and the logistic function (Fig. LABEL:fig:; B.A. Olshausen, personal communication). Now we see the advantage of the generative perspective, from which this difference is more salient—and at long last, shed light on how to choose the feedforward nonlinearities, ψk\psi_{k}, in InfoMax ICA.