9.2 Nonlinear independent-component estimation

We now expand our view to nonlinear, albeit still invertible, transformations [6, 40, 7, 24]. In particular, consider a “generative function” 𝒄(𝒙^,𝜽)\bm{c}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb% }{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}% },\bm{\theta}) that consists of a series of invertible transformations. Once again, to emphasize that it is the inverse of a recognition or discriminative function, 𝒅(𝒚,𝜽)\bm{d}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb% }{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y},\bm{% \theta}), we write 𝒄\bm{c} as 𝒅-1\bm{d}^{-1}:

𝒚^=𝒄(𝒙^,𝜽)=𝒅-1(𝒙^,𝜽)=𝒅1-1𝒅2-1𝒅L-1-1𝒅L-1(𝒙^,𝜽).\bm{\hat{y}}=\bm{c}(\bm{\hat{x}},\bm{\theta})=\bm{d}^{-1}(\bm{\hat{x}},\bm{% \theta})=\bm{d}_{1}^{-1}\circ\bm{d}_{2}^{-1}\cdots\bm{d}_{{L}-1}^{-1}\circ\bm{% d}_{{L}}^{-1}(\bm{\hat{x}},\bm{\theta}). (9.6)

This change of variables is called a flowmargin: flow [40]. Let us still assume a factorial prior, Eq. 9.2, and furthermore that it does not depend on any parameters. Since the transformations are (by assumption) invertible, the change-of-variables formula still applies. Therefore, Eq. 9.3 still holds, but the Jacobian determinant of composed functions becomes the product of the individual Jacobian determinants:

p^(𝒚^;𝜽)=|𝝍𝒙^T(𝒅(𝒚^,𝜽);𝜽)||𝒅𝒚T(𝒚^,𝜽)|=l=0L|𝐉l(𝒚^)|,{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)}=\mathopen{}% \mathclose{{}\left\lvert\frac{\partial{\bm{\psi}}}{\partial{\bm{\hat{x}}}^{% \text{T}}}\mathopen{}\mathclose{{}\left(\bm{d}(\leavevmode\color[rgb]{.5,.5,.5% }\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.% 5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}},\bm{\theta});\bm{\theta}}\right)}% \right\rvert\mathopen{}\mathclose{{}\left\lvert\frac{\partial{\bm{d}}}{% \partial{\bm{y}}^{\text{T}}}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[% named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{y}},\bm{\theta})}\right\rvert\\ =\prod_{l=0}^{{L}}\mathopen{}\mathclose{{}\left\lvert\mathbf{J}_{l}(% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}})% }\right\rvert, (9.7)

with Jacobians given by

𝐉0(𝒚^)=𝝍𝒙^T(𝒅L𝒅1(𝒚^,𝜽)),\displaystyle\mathbf{J}_{0}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named% ]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{y}})=\frac{\partial{\bm{\psi}}}{\partial{% \bm{\hat{x}}}^{\text{T}}}\mathopen{}\mathclose{{}\left(\bm{d}_{{L}}\circ\cdots% \circ\bm{d}_{1}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{y}},\bm{\theta})}\right), 𝐉l=𝒅l𝒙lT(𝒅l-1𝒅1(𝒚^,𝜽)).\displaystyle\mathbf{J}_{l}=\frac{\partial{\bm{d}_{l}}}{\partial{\bm{x}_{l}}^{% \text{T}}}\mathopen{}\mathclose{{}\left(\bm{d}_{l-1}\circ\cdots\circ\bm{d}_{1}% (\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}},% \bm{\theta})}\right).

(For the sake of writing derivatives, we have named the argument of the lthl^{\text{th}} function 𝒙l\bm{x}_{l}. This makes 𝒙1=𝒚\bm{x}_{1}=\bm{y}.) The functions induced by multiplying the initial distribution (the Jacobian determinant 𝐉0(𝒚^)\mathbf{J}_{0}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{y}}) at left) by, in turn, the determinants of each of the L{L} Jacobians 𝐉l\mathbf{J}_{l} at right, are “automatically” normalized and positive, and consequently valid probability distributions. This sequence is accordingly called a normalizing flowmargin: normalizing flow [40].

Since the generative function 𝒄(𝒙^,𝜽)=𝒅-1(𝒙^,𝜽)\bm{c}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb% }{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}% },\bm{\theta})=\bm{d}^{-1}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]% {pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{x}},\bm{\theta}) is invertible, we can certainly compute the arguments to the Jacobians. However, to keep the problem tractable, we also need to be able to compute the Jacobian determinants efficiently. Generically, this computation is cubic in the dimension of the data. This is intolerable, so we will generally limit the expressiveness of each 𝒄l\bm{c}_{l} to achieve something more practical.

Perhaps the most obvious limitation is to require that the transformations be “volume preserving”; that is, to require that Jacobian determinants are always unity [6] This can be achieved, for example, by splitting a data vector into two parts, and requiring (1) that the flow at a particular step ll of only one of the parts may depend on the other (this ensures that the Jacobian is upper triangular); and (2) that the flows of both parts depend on their previous values only through an identity transformation (this ensures that the two diagonal blocks of the Jacobian are identity matrices). In equations,

[𝒙l+1a𝒙l+1b]=[𝒙la𝒙lb+𝒎(𝒙la,𝜽)]𝒅l𝒙lT=[𝐈𝟎𝒎𝒙laT𝐈]|𝒅l𝒙lT|=1.\begin{bmatrix}\bm{x}^{a}_{l+1}\\ \bm{x}^{b}_{l+1}\end{bmatrix}=\begin{bmatrix}\bm{x}^{a}_{l}\\ \bm{x}^{b}_{l}+\bm{m}(\bm{x}^{a}_{l},\bm{\theta})\end{bmatrix}\implies\frac{% \partial{\bm{d}_{l}}}{\partial{\bm{x}_{l}}^{\text{T}}}=\begin{bmatrix}\mathbf{% I}&\mathbf{0}\\ \frac{\partial{\bm{m}}}{\partial{\bm{x}^{a}_{l}}^{\text{T}}}&\mathbf{I}\end{% bmatrix}\implies\mathopen{}\mathclose{{}\left\lvert\frac{\partial{\bm{d}_{l}}}% {\partial{\bm{x}_{l}}^{\text{T}}}}\right\rvert=1.

… [[multiple layers of this]]

Now our loss is, as usual, the relative entropy. With the “recognition functions” 𝒅l\bm{d}_{l} of Eq. 9.6 and the corresponding model density of Eq. 9.7, the relative entropy becomes

DKL{p(𝒀)l=0L|𝐉l(𝒀)|}=𝔼𝒀[logp(𝒀)-k=1Klog(ψkx^k(𝒀,𝜽))-l=1Llog|𝒅l𝒙lT(𝒀,𝜽)|].\operatorname*{\text{D}_{\text{KL}}}\mathopen{}\mathclose{{}\left\{{p\mathopen% {}\mathclose{{}\left({\bm{Y}}}\right)}\middle\|\prod_{l=0}^{{L}}\mathopen{}% \mathclose{{}\left\lvert\mathbf{J}_{l}({\bm{Y}})}\right\rvert}\right\}=\mathbb% {E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[\log{p\mathopen{}\mathclose{{}% \left({\bm{Y}}}\right)}-\sum_{k=1}^{{K}}\log\mathopen{}\mathclose{{}\left(% \frac{\partial{\psi_{k}}}{\partial{\hat{x}_{k}}}({\bm{Y}},\bm{\theta})}\right)% -\sum_{l=1}^{{L}}\log\mathopen{}\mathclose{{}\left\lvert\frac{\partial{\bm{d}_% {l}}}{\partial{\bm{x}_{l}}^{\text{T}}}({\bm{Y}},\bm{\theta})}\right\rvert}% \right]}. (9.8)

(For concision, the Jacobians are written as a function directly of 𝒀{\bm{Y}}.)

The discriminative dual.

The model defined by Eq. 9.7, along with the loss in Eq. 9.8, has been called “nonlinear independent component analysis” (NICE) [6]. To see if the name is apposite, we employ our discriminative/generative duality, reinterpreting the minimization of the relative-entropy in Eq. 9.8 as a maximization of mutual information between the data, 𝒀{\bm{Y}}, and a random variable 𝒁^{\bm{\hat{Z}}} defined by the (reverse) flow in Eq. 9.6, 𝒅\bm{d}, followed by (elementwise) transformation by the CDF of the prior. Can this still be thought of as an “unmixing” operation, as in InfoMax ICA? The question is acute particularly in the case where the prior is chosen to be normal, since (as we have just seen) ICA reduces to whitening in such circumstances.

In this case, the generative marginal given by the normalizing flow, Eq. 9.7, becomes

p^(𝒚^;𝜽)=𝒩(𝒅(𝒚^;𝜽);𝟎,𝐈)|𝒅𝒚T(𝒚^,𝜽)|.{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)}=\mathcal{N}% \mathopen{}\mathclose{{}\left(\bm{d}(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}};\bm{\theta});\bm{0},\>\mathbf{I}}% \right)\mathopen{}\mathclose{{}\left\lvert\frac{\partial{\bm{d}}}{\partial{\bm% {y}}^{\text{T}}}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{y}},\bm{\theta})}\right\rvert.

Despite the appearance of a normal distribution in this expression, this marginal distribution is certainly not normal—even though the generative prior, p^𝑿^(𝒙^;𝜽){\hat{p}_{{\bm{\hat{X}}}}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{% \theta}}\right)}, is. So fitting this p^(𝒚^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)} to the data will not in general merely fit their second-order statistics.

[[Connection to HMC]]