5.1 Supervised Learning

In the classical notion of a discriminative model, the inputs and outputs are completely observed. That is, classically, learning in discriminative models is supervised. We shall nevertheless have occasion to explore unsupervised learning in discriminative models—in the next section. Here, we describe supervised learning as an instance of the general approach laid out in Section 4.2, to wit, minimizing relative or cross entropy. Since we are only modeling the conditional distribution, this means the conditional entropies. But remember that an expectation is still taken under p(𝒚){p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[%named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}:

𝜽*=argmin𝜽{DKL{p(𝑿|𝒀)pˇ(𝑿|𝒀)}}=argmin𝜽{𝔼𝑿,𝒀[-logpˇ(𝑿|𝒀)]}.\begin{split}\bm{\theta}^{*}&=\operatorname*{argmin}_{\bm{\theta}}\mathopen{}%\mathclose{{}\left\{\operatorname*{\text{D}_{\text{KL}}}\mathopen{}\mathclose{%{}\left\{{p\mathopen{}\mathclose{{}\left({\bm{X}}\middle|{\bm{Y}}}\right)}%\middle\|{\check{p}\mathopen{}\mathclose{{}\left({\bm{X}}\middle|{\bm{Y}}}%\right)}}\right\}}\right\}\\&=\operatorname*{argmin}_{\bm{\theta}}\mathopen{}\mathclose{{}\left\{\mathbb{E%}_{{\bm{X}}{},{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[-\log{\check{p}%\mathopen{}\mathclose{{}\left({\bm{X}}\middle|{\bm{Y}}}\right)}}\right]}}%\right\}.\end{split} (5.1)

How we proceed from here depends on our answers to the two questions posed at the outset.

5.1.1 Linear regression

Here we begin our account of supervised learning with the simplest model: the (multivariate) normal distribution, whose mean depends linearly on the inputs:

pˇ(𝒙ˇ|𝒚)=τ-K/2|𝚺xˇ|y|-1/2exp{-12(𝒙ˇ-𝐃𝒚)T𝚺xˇ|y-1(𝒙ˇ-𝐃𝒚)}.{\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\check{x}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}=\tau%^{-{K}/2}\mathopen{}\mathclose{{}\left\lvert\mathbf{{\Sigma}}_{\check{x}|y}}%\right\rvert^{-1/2}\exp\mathopen{}\mathclose{{}\left\{-\frac{1}{2}\mathopen{}%\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{%pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{\check{x}}-\mathbf{{D}}\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}}\right)^{\text{%T}}\mathbf{{\Sigma}}^{-1}_{\check{x}|y}\mathopen{}\mathclose{{}\left(%\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}%}-\mathbf{{D}}\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{%pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{y}}\right)}\right\}.

Note that this could easily be extended to an affine function, 𝝁xˇ|y=𝐃𝒀ˇ+𝒅\bm{\mu}_{\check{x}|y}=\mathbf{{D}}{\bm{\check{Y}}}+\bm{d}, but the notation is simpler if we assume that both the inputs and outputs, 𝑿{\bm{X}} and 𝒀{\bm{Y}}, have been centered, in which case 𝒅\bm{d} is otiose. Alternatively, a fixed value of 1 can be appended to the vector 𝒚\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}, and 𝒅\bm{d} appended as a final column on 𝐃\mathbf{{D}}—after all, we will make no assumptions about the distribution of 𝒀{\bm{Y}}.

To find 𝐃\mathbf{{D}}, we differentiate the loss in Eq. 5.1:

dd𝐃𝔼𝑿,𝒀[-logpˇ(𝑿|𝒀)]=𝔼𝑿,𝒀[dd𝐃12(𝑿-𝐃𝒀)T𝚺xˇ|y-1(𝑿-𝐃𝒀)]=-𝚺xˇ|y-1𝔼𝑿,𝒀[(𝑿-𝐃𝒀)𝒀T]=set0𝐃=𝔼𝑿,𝒀[𝑿𝒀T]𝔼𝑿,𝒀[𝒀𝒀T]-1𝑿𝒀T𝑿,𝒀𝒀𝒀T𝒀-1,\begin{split}\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\mathbf{{D}}%}}\mathbb{E}_{{\bm{X}}{},{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[-\log{%\check{p}\mathopen{}\mathclose{{}\left({\bm{X}}\middle|{\bm{Y}}}\right)}}%\right]}&=\mathbb{E}_{{\bm{X}}{},{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[%\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\mathbf{{D}}}}\frac{1}{2}%\mathopen{}\mathclose{{}\left({\bm{X}}-\mathbf{{D}}{\bm{Y}}}\right)^{\text{T}}%\mathbf{{\Sigma}}^{-1}_{\check{x}|y}\mathopen{}\mathclose{{}\left({\bm{X}}-%\mathbf{{D}}{\bm{Y}}}\right)}\right]}\\&=-\mathbf{{\Sigma}}^{-1}_{\check{x}|y}\mathbb{E}_{{\bm{X}}{},{\bm{Y}}{}}{%\mathopen{}\mathclose{{}\left[\mathopen{}\mathclose{{}\left({\bm{X}}-\mathbf{{%D}}{\bm{Y}}}\right){\bm{Y}}^{\text{T}}}\right]}\stackrel{{\scriptstyle\text{%set}}}{{=}}0\\\implies\mathbf{{D}}&=\mathbb{E}_{{\bm{X}}{},{\bm{Y}}{}}{\mathopen{}\mathclose%{{}\left[{\bm{X}}{\bm{Y}}^{\text{T}}}\right]}\mathbb{E}_{{\bm{X}}{},{\bm{Y}}{}%}{\mathopen{}\mathclose{{}\left[{\bm{Y}}{\bm{Y}}^{\text{T}}}\right]}^{-1}\\&\approx{\mathopen{}\mathclose{{}\left\langle{{\bm{X}}{\bm{Y}}^{\text{T}}}}%\right\rangle_{{\bm{X}}{},{\bm{Y}}{}}}{\mathopen{}\mathclose{{}\left\langle{{%\bm{Y}}{\bm{Y}}^{\text{T}}}}\right\rangle_{{\bm{Y}}{}}^{-1}},\end{split} (5.2)

where the move to a sample average in the final line reflects the fact that we have only samples from the data distribution. The final line is famous enough to have earned its own name, the normal equationsmargin: normal equations . Acquiring 𝐃\mathbf{{D}} through the normal equations is known as linear regressionmargin: linear regression . In fact, our optimization is not the only route to the normal equations, and to gain more intuition about linear regression we shall examine several of these. But first we investigate a variation on the optimization just derived.

Regularization.

The inverted term 𝒀𝒀T𝒀{\mathopen{}\mathclose{{}\left\langle{{\bm{Y}}{\bm{Y}}^{\text{T}}}}\right%\rangle_{{\bm{Y}}{}}} is a sum of outer products. Therefore the resulting matrix, although obviously square, is not invertible if there are fewer samples than dimensions of 𝒀{\bm{Y}}, in which case a unique solution for 𝐃\mathbf{{D}} does not exist. In that case a pseudo-inverse can be used to solve Eq. 5.2. However, as we shall see, the standard pseudo-inverse is merely a special case of a more general and elegant solution to the problem of underdeterminedmargin: underdetermined normal equations.

Our point of departure is to interpret the rows of 𝐃\mathbf{{D}} as independent random vectors33 3 Recall that random vectors use bold italic capitals, whereas matrices are assigned bold Roman capitals., 𝑫1T,,𝑫KT{\bm{D}}_{1}^{\text{T}},\ldots,{\bm{D}}_{{K}}^{\text{T}}, that are also marginally independent of 𝒀{\bm{Y}}:

pˇ(𝒅1,,𝒅K|𝒚)=k=1Kpˇ(𝒅k)=1Zdexp{-k=1KEk(𝒅k)}.{\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{d}_{1},\ldots,\bm{d}_{{K}}|\leavevmode\color[%rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}}\right)}=\prod_%{k=1}^{{K}}{\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{d}_{k}}\right)}=%\frac{1}{Z_{d}}\exp\mathopen{}\mathclose{{}\left\{-\sum_{k=1}^{{K}}E_{k}(%\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{d}_{k})}%\right\}. (5.3)

Then in place of the original conditional distribution, pˇ(𝒙ˇ|𝒚){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\check{x}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}, we use the augmented conditional pˇ(𝒙ˇ,𝒅1,,𝒅K|𝒚ˇ){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\check{x}},\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{d}_{1},\ldots,\bm{d}_{{K}}\middle|\leavevmode%\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{y}}}\right)}, a kind of posterior distribution over 𝑫k{\bm{D}}_{k} (for all kk):

dd𝐃𝔼𝑿,𝒀[-logpˇ(𝑿,𝒅k|𝒀)]=dd𝐃𝔼𝑿,𝒀[-log(pˇ(𝑿|𝒀,𝒅1,,𝒅K)pˇ(𝒅1,,𝒅K|𝒀))]=-𝚺xˇ|y-1𝔼𝑿,𝒀[(𝑿-𝐃𝒀)𝒀T]+dd𝐃k=1KEk(𝒅k)=set0\begin{split}\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\mathbf{{D}}%}}\mathbb{E}_{{\bm{X}}{},{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[-\log{%\check{p}\mathopen{}\mathclose{{}\left({\bm{X}},\bm{d}_{k}\middle|{\bm{Y}}}%\right)}}\right]}&=\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{%\mathbf{{D}}}}\mathbb{E}_{{\bm{X}}{},{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left%[-\log\mathopen{}\mathclose{{}\left({\check{p}\mathopen{}\mathclose{{}\left({%\bm{X}}\middle|{\bm{Y}},\bm{d}_{1},\ldots,\bm{d}_{{K}}}\right)}{\check{p}%\mathopen{}\mathclose{{}\left(\bm{d}_{1},\ldots,\bm{d}_{{K}}|{\bm{Y}}}\right)}%}\right)}\right]}\\&=-\mathbf{{\Sigma}}^{-1}_{\check{x}|y}\mathbb{E}_{{\bm{X}}{},{\bm{Y}}{}}{%\mathopen{}\mathclose{{}\left[\mathopen{}\mathclose{{}\left({\bm{X}}-\mathbf{{%D}}{\bm{Y}}}\right){\bm{Y}}^{\text{T}}}\right]}+\frac{\mathop{}\!\mathrm{d}{}}%{\mathop{}\!\mathrm{d}{\mathbf{{D}}}}\sum_{k=1}^{{K}}E_{k}(\bm{d}_{k})%\stackrel{{\scriptstyle\text{set}}}{{=}}0\end{split}

For example, consider a zero-mean, Gaussian prior distribution over 𝑫k{\bm{D}}_{k}. Intuitively, this encodes our belief that the parameters are most likely to be zero, with the penalty for being non-zero growing quadratically with distance. In this case, the energy and its gradient are

Ek(𝒅k)\displaystyle E_{k}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{%pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{d}_{k}) =12𝒅kT𝚺d-1𝒅k,\displaystyle=\frac{1}{2}\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{%pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{d}_{k}^{\text{T}}\mathbf{{\Sigma}}^{-1}_{d}%\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{d}_{k}, dEkd𝒅kT(𝒅k)\displaystyle\frac{\mathop{}\!\mathrm{d}{E_{k}}}{\mathop{}\!\mathrm{d}{%\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{d}_{k}}^{%\text{T}}}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}%{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{d}_%{k}) =𝒅kT𝚺d-1.\displaystyle=\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{%pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{d}_{k}^{\text{T}}\mathbf{{\Sigma}}^{-1}_{d}. (5.4)

We have assumed the same covariance matrix for all kk. For simplicity, we also let the emission noise be isotropic,

𝚺xˇ|y=setσxˇ|y2𝐈.\mathbf{{\Sigma}}_{\check{x}|y}\stackrel{{\scriptstyle\text{set}}}{{=}}\sigma^%{2}_{\check{x}|y}\mathbf{I}. (5.5)

Then we can solve for 𝐃\mathbf{{D}} in closed-form:

0=-σxˇ|y-2𝔼𝑿,𝒀[(𝑿-𝐃𝒀)𝒀T]+𝐃𝚺d-1𝐃𝑿𝒀T𝑿,𝒀(𝒀𝒀T𝒀+σxˇ|y2𝚺d-1)-1.\begin{split}0&=-\sigma^{-2}_{\check{x}|y}\mathbb{E}_{{\bm{X}}{},{\bm{Y}}{}}{%\mathopen{}\mathclose{{}\left[\mathopen{}\mathclose{{}\left({\bm{X}}-\mathbf{{%D}}{\bm{Y}}}\right){\bm{Y}}^{\text{T}}}\right]}+\mathbf{{D}}\mathbf{{\Sigma}}^%{-1}_{d}\\\implies\mathbf{{D}}&\approx{\mathopen{}\mathclose{{}\left\langle{{\bm{X}}{\bm%{Y}}^{\text{T}}}}\right\rangle_{{\bm{X}}{},{\bm{Y}}{}}}\mathopen{}\mathclose{{%}\left({\mathopen{}\mathclose{{}\left\langle{{\bm{Y}}{\bm{Y}}^{\text{T}}}}%\right\rangle_{{\bm{Y}}{}}}+\sigma^{2}_{\check{x}|y}\mathbf{{\Sigma}}^{-1}_{d}%}\right)^{-1}.\end{split} (5.6)

[[Since the energy in Eq. 5.4 represents an L2L^{2} norm, this is known as Tikhonov, ridge, or L2L^{2}-regularized regressionmargin: Tikhonov, ridge, or L2L^{2}-regularized regression .]] As long as the rank of 𝚺d\mathbf{{\Sigma}}_{d} is full, so is the rank of the inverted term (adding positive definite matrices cannot reduce rank), so the normal equations are now solvable even for rank-deficient 𝒀𝒀T𝒀{\mathopen{}\mathclose{{}\left\langle{{\bm{Y}}{\bm{Y}}^{\text{T}}}}\right%\rangle_{{\bm{Y}}{}}}. This term says, intuitively, that the solution ignores the inputs, 𝒀{\bm{Y}}, in proportion as we are confident that their corresponding weights are zero—i.e., the larger 𝚺d-1\mathbf{{\Sigma}}^{-1}_{d}—but also in proportion to the noisiness of the input-output relationship, σxˇ|y2\sigma^{2}_{\check{x}|y}.

Newton-Raphson.

The least-squares penalty can be solved in one step because the resulting cost function is quadratic in 𝐃\mathbf{{D}}. Still, for costs that are not quadratic, but nevertheless convex, the (celebrated) Newton-Raphson algorithm is guaranteed to find the unique global solution. In anticipation of encountering such costs (Section 5.1.2), we apply this method to linear regression.

The basic Newton-Raphson method aims to find the roots (zeros) of a scalar function of a scalar input, g(θ)g(\theta), as follows. From a given point in parameter space, θ(i)\theta^{(i)}, move along the local line tangent to g(θ(i))g(\theta^{(i)}) to the point where it intercepts the θ\theta axis; call this point θ(i+1)\theta^{(i+1)} and iterate. For appropriately smooth functions, each root θ*\theta^{*}, at which g(θ*)=0g(\theta^{*})=0, is guaranteed to be the convergent value of this procedure when intialized from some surrounding neighborhood. Mathematically, the algorithm amounts to enforcing

slope=riserundgdθ(i)=setg(θ(i))-0θ(i)-θ(i+1)θ(i+1)=θ(i)-g(θ(i))/dgdθ(i).\begin{split}\text{slope}=\frac{\text{rise}}{\text{run}}\implies\frac{\mathop{%}\!\mathrm{d}{g}}{\mathop{}\!\mathrm{d}{\theta^{(i)}}}\stackrel{{\scriptstyle%\text{set}}}{{=}}\frac{g(\theta^{(i)})-0}{\theta^{(i)}-\theta^{(i+1)}}\implies%\theta^{(i+1)}=\theta^{(i)}-g(\theta^{(i)})/\frac{\mathop{}\!\mathrm{d}{g}}{%\mathop{}\!\mathrm{d}{\theta^{(i)}}}.\end{split}

The procedure is easily extended to vector-valued functions of vector inputs, as long as the vectors have the same length (otherwise the extension is more complicated). This is especially germane to our problem of finding extrema of a function f(𝜽)f(\bm{\theta}), since the roots of the function 𝒈..=f\bm{g}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=f^{\prime} are extrema of ff. Replacing the scalar-valued functions above with their vector and matrix counterparts yields:

𝜽(i+1)T=𝜽(i)T-f𝜽T(𝜽(i))(2f𝜽𝜽T(𝜽(i)))-1.{\bm{\theta}^{(i+1)}}^{\text{T}}={\bm{\theta}^{(i)}}^{\text{T}}-\frac{\partial%{f}}{\partial{\bm{\theta}}^{\text{T}}}(\bm{\theta}^{(i)}){\mathopen{}%\mathclose{{}\left(\frac{\partial^{2}{f}}{\partial{\bm{\theta}}\partial{\bm{%\theta}}^{\text{T}}}(\bm{\theta}^{(i)})}\right)}^{-1}. (5.7)

In our case, the parameters are in the form of a matrix, 𝐃\mathbf{{D}}, which would make the Hessian a tensor. To avoid this ugliness, let us work with one row of the matrix, 𝒅kT\bm{d}_{k}^{\text{T}}, at a time. Let us also, for simplicity, continue with the assumption that the emission noise is isotropic, Eq. 5.5. Then from Eq. 5.2, the gradient and Hessian of the cross entropy HH are

H𝒅kT(𝒅k)\displaystyle\frac{\partial{H}}{\partial{\bm{d}_{k}}^{\text{T}}}(\leavevmode%\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{d}_{k}) =-σxˇ|y-2𝔼Xk,𝒀[(Xk-𝒅kT𝒀)𝒀T],\displaystyle=-\sigma^{-2}_{\check{x}|y}\mathbb{E}_{{X}_{k}{},{\bm{Y}}{}}{%\mathopen{}\mathclose{{}\left[\mathopen{}\mathclose{{}\left({X}_{k}-%\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{d}_{k}^{%\text{T}}{\bm{Y}}}\right){\bm{Y}}^{\text{T}}}\right]}, 2H𝒅k𝒅kT(𝒅k)\displaystyle\frac{\partial^{2}{H}}{\partial{\bm{d}_{k}}\partial{\bm{d}_{k}}^{%\text{T}}}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}%{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{d}_%{k}) =σxˇ|y-2𝔼𝒀[𝒀𝒀T].\displaystyle=\sigma^{-2}_{\check{x}|y}\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}%\mathclose{{}\left[{\bm{Y}}{\bm{Y}}^{\text{T}}}\right]}.

Consequently, the Newton-Raphson update becomes

𝒅k(i+1)T=𝒅k(i)T+𝔼Xk,𝒀[(Xk-𝒅k(i)T𝒀)𝒀T]𝔼𝒀[𝒀𝒀T]-1=𝒅k(i)T+(𝔼Xk,𝒀[Xk𝒀T]-𝒅k(i)𝔼𝒀[𝒀𝒀T])𝔼𝒀[𝒀𝒀T]-1=𝔼Xk,𝒀[Xk𝒀T]𝔼𝒀[𝒀𝒀T]-1𝐃=𝔼𝑿,𝒀[𝑿𝒀T]𝔼𝒀[𝒀𝒀T]-1,\begin{split}{\bm{d}_{k}^{(i+1)}}^{\text{T}}&={\bm{d}_{k}^{(i)}}^{\text{T}}+%\mathbb{E}_{{X}_{k}{},{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[\mathopen{}%\mathclose{{}\left({X}_{k}-{\bm{d}_{k}^{(i)}}^{\text{T}}{\bm{Y}}}\right){\bm{Y%}}^{\text{T}}}\right]}\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[{%\bm{Y}}{\bm{Y}}^{\text{T}}}\right]}^{-1}\\&={\bm{d}_{k}^{(i)}}^{\text{T}}+\mathopen{}\mathclose{{}\left(\mathbb{E}_{{X}_%{k}{},{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[{X}_{k}{\bm{Y}}^{\text{T}}}%\right]}-{\bm{d}_{k}^{(i)}}\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}%\left[{\bm{Y}}{\bm{Y}}^{\text{T}}}\right]}}\right)\mathbb{E}_{{\bm{Y}}{}}{%\mathopen{}\mathclose{{}\left[{\bm{Y}}{\bm{Y}}^{\text{T}}}\right]}^{-1}\\&=\mathbb{E}_{{X}_{k}{},{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[{X}_{k}{\bm{%Y}}^{\text{T}}}\right]}\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[{%\bm{Y}}{\bm{Y}}^{\text{T}}}\right]}^{-1}\\\implies\mathbf{{D}}&=\mathbb{E}_{{\bm{X}}{},{\bm{Y}}{}}{\mathopen{}\mathclose%{{}\left[{\bm{X}}{\bm{Y}}^{\text{T}}}\right]}\mathbb{E}_{{\bm{Y}}{}}{\mathopen%{}\mathclose{{}\left[{\bm{Y}}{\bm{Y}}^{\text{T}}}\right]}^{-1},\end{split}

where on the last line we have simply collected up the updates for all rows. We see that for any starting point, the solution (which doesn’t depend on that point) is reached in one step, as expected. Thus the Newton-Raphson algorithm provides another route to the normal equations.

Moment matching under additive noise.

Linear regression is so ubiquitous that the assumption of Gaussian noise, although frequently justifiable by reference to the central limit theorem, may feel overly restrictive. Let us therefore retain the assumption of a linear map with additive, independent, zero-mean noise,

𝑿ˇ=𝐃𝒀+𝒁ˇ,{\bm{\check{X}}}=\mathbf{{D}}{\bm{Y}}+{\bm{\check{Z}}}, (5.8)

but no longer require the noise, 𝒁ˇ{\bm{\check{Z}}}, to be normally distributed. Now, without a probability model, it is no longer possible to minimize the relative entropy between data and model distributions. Instead, we will match their first two moments. More precisely, we shall require that

𝔼𝑿ˇ[𝑿ˇ]\displaystyle\mathbb{E}_{{\bm{\check{X}}}{}}{\mathopen{}\mathclose{{}\left[{%\bm{\check{X}}}}\right]} =set𝑿𝑿\displaystyle\stackrel{{\scriptstyle\text{set}}}{{=}}{\mathopen{}\mathclose{{}%\left\langle{{\bm{X}}}}\right\rangle_{{\bm{X}}{}}} 𝔼𝑿ˇ,𝒀[𝑿ˇ𝒀T]\displaystyle\mathbb{E}_{{\bm{\check{X}}}{},{\bm{Y}}{}}{\mathopen{}\mathclose{%{}\left[{\bm{\check{X}}}{\bm{Y}}^{\text{T}}}\right]} =set𝑿𝒀T𝑿,𝒀\displaystyle\stackrel{{\scriptstyle\text{set}}}{{=}}{\mathopen{}\mathclose{{}%\left\langle{{\bm{X}}{\bm{Y}}^{\text{T}}}}\right\rangle_{{\bm{X}}{},{\bm{Y}}{}}}

Notice that we do not need to specify the expected outer product of the outputs. Using the fact that the noise is independent of the inputs and zero-mean, we see that

𝔼𝑿ˇ[𝑿ˇ𝑿ˇT]=𝔼𝒀,𝒁ˇ[(𝐃𝒀+𝒁ˇ)(𝐃𝒀+𝒁ˇ)T]=𝐃𝔼𝒀[𝒀𝒀T]𝐃T+𝚺zˇ.\mathbb{E}_{{\bm{\check{X}}}{}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}%{\bm{\check{X}}}^{\text{T}}}\right]}=\mathbb{E}_{{\bm{Y}}{},{\bm{\check{Z}}}{}%}{\mathopen{}\mathclose{{}\left[(\mathbf{{D}}{\bm{Y}}+{\bm{\check{Z}}})(%\mathbf{{D}}{\bm{Y}}+{\bm{\check{Z}}})^{\text{T}}}\right]}=\mathbf{{D}}\mathbb%{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[{\bm{Y}}{\bm{Y}}^{\text{T}}}%\right]}\mathbf{{D}}^{\text{T}}+\mathbf{\Sigma}_{\check{z}}.

So assuming 𝑿𝑿T𝑿{\mathopen{}\mathclose{{}\left\langle{{\bm{X}}{\bm{X}}^{\text{T}}}}\right%\rangle_{{\bm{X}}{}}} is “bigger” than 𝐃𝒀𝒀T𝒀𝐃T\mathbf{{D}}{\mathopen{}\mathclose{{}\left\langle{{\bm{Y}}{\bm{Y}}^{\text{T}}}%}\right\rangle_{{\bm{Y}}{}}}\mathbf{{D}}^{\text{T}}, the noise covariance 𝚺zˇ\mathbf{\Sigma}_{\check{z}} can always make up the difference.

Furthermore, as lately noted, equality of means can be achieved simply by centering the data. That leaves 𝑿𝒀T𝑿,𝒀{\mathopen{}\mathclose{{}\left\langle{{\bm{X}}{\bm{Y}}^{\text{T}}}}\right%\rangle_{{\bm{X}}{},{\bm{Y}}{}}}. Applying our “model” (Eq. 5.8), using the fact that the means are zero, and then setting the model expectations equal to the data expectations, yields:

𝔼𝑿ˇ,𝒀[𝑿ˇ𝒀T]=𝔼𝒀,𝒁ˇ[(𝐃𝒀+𝒁ˇ)𝒀T]=𝐃𝔼𝒀ˇ[𝒀ˇ𝒀ˇT]𝐃𝑿𝒀T𝑿,𝒀𝒀𝒀T𝒀-1,\begin{split}\mathbb{E}_{{\bm{\check{X}}}{},{\bm{Y}}{}}{\mathopen{}\mathclose{%{}\left[{\bm{\check{X}}}{\bm{Y}}^{\text{T}}}\right]}&=\mathbb{E}_{{\bm{Y}}{},{%\bm{\check{Z}}}{}}{\mathopen{}\mathclose{{}\left[(\mathbf{{D}}{\bm{Y}}+{\bm{%\check{Z}}}){\bm{Y}}^{\text{T}}}\right]}=\mathbf{{D}}\mathbb{E}_{{\bm{\check{Y%}}}{}}{\mathopen{}\mathclose{{}\left[{\bm{\check{Y}}}{\bm{\check{Y}}}^{\text{T%}}}\right]}\\\implies\mathbf{{D}}&\approx{\mathopen{}\mathclose{{}\left\langle{{\bm{X}}{\bm%{Y}}^{\text{T}}}}\right\rangle_{{\bm{X}}{},{\bm{Y}}{}}}{\mathopen{}\mathclose{%{}\left\langle{{\bm{Y}}{\bm{Y}}^{\text{T}}}}\right\rangle_{{\bm{Y}}{}}^{-1}},%\end{split}

the normal equations.

In fine, assuming that the additive noise is Gaussian has the same net effect as fitting a kind of second-order approximation to the joint distribution. This should not be surprising, since the normal distribution is the most “agnostic” of all distributions that specify the first two moments.

The emission covariance.

Let us pause briefly to consider the meaning of, and whether we should be surprised by, the fact that 𝚺xˇ|y\mathbf{{\Sigma}}_{\check{x}|y} disappears from Eq. 5.2. It implies that use of the normal equations makes no assumptions about the “shape” (covariance) of the noise about points in the output space. For example, some outputs could be “more important” (lower-variance) than, or correlated with, others. These correspond respectively to differing values on the diagonal, and non-zero values in the off-diagonals, of the covariance matrix. The reason this makes no difference to the solution is that we have at our disposal a separate row in 𝐃\mathbf{{D}} for every output, Xˇk{\check{X}}_{k}: we can safely adjust weights for one output dimension without affecting any other. margin: But not under regularization; explain this.

What we do not have is a separate set of parameters for every pair of samples, 𝒙n,𝒚n{\bm{x}}_{n},{\bm{y}}_{n}. This caused no complications because the samples were assumed to be i.i.d. Nevertheless, the assumption of i.i.d. samples is not always appropriate. For example…. The data are then heteroscedasticmargin: heteroscedastic (differently dispersed) rather than homoscedasticmargin: homoscedastic (identically dispersed). Still, if the dependence between samples is only second-order—i.e., in the first-order correlations—and known, the linear regression has an only slightly more complicated complete solution. It is most most elegantly derived when the heteroscedastic data samples are represented explicitly.

Linear regression without probability distributions

Accordingly, let us assemble all the samples into two matrices:44 4 These are the transposes of the matrices used in most developments of linear regression, but they are consistent with all the other standard conventions used throughout this book.

𝐗\displaystyle\mathbf{{X}} ..=[𝒙1𝒙N]\displaystyle\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\begin{bmatrix}\hidden@noalign{}\hfil\textstyle\bm{x}_{1}&\cdots&\bm{x}_{{N%}}\end{bmatrix} 𝐘\displaystyle\mathbf{{Y}} ..=[𝒚1𝒚N].\displaystyle\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\begin{bmatrix}\hidden@noalign{}\hfil\textstyle\bm{y}_{1}&\cdots&\bm{y}_{{N%}}\end{bmatrix}.

In place of the preceding models, let us simply look for a solution to the linear equation

𝐗=set𝐃𝐘.\mathbf{{X}}\stackrel{{\scriptstyle\text{set}}}{{=}}\mathbf{{D}}\mathbf{{Y}}. (5.9)

The equation has a unique solution if and only if 𝐘\mathbf{{Y}} is square—that is, there are precisely as many samples (N{N}) as dimensions (M{M})—and full-rank. When 𝐘\mathbf{{Y}} is “tall” (more dimensions than samples, M>N{M}>{N}), the solution is underdetermined. We have lately addressed this with regularization. When 𝐘\mathbf{{Y}} is “fat” (more samples than dimensions, N>M{N}>{M}), the solution is overdetermined: no matrix 𝐃\mathbf{{D}} satisfies Eq. 5.9, and a notion of “best fit” must be imposed in order to select just one of the many approximately satisfactory matrices.

Still, let us proceed somewhat naïvely and simply look for an obvious linear-algebraic solution. If 𝐘\mathbf{{Y}} is full-rank in addition to being fat, then the Gram matrix (in sample space) 𝐘𝐘T\mathbf{{Y}}\mathbf{{Y}}^{\text{T}} is square (M×M{M}\times{M}) and full-rank, and therefore invertible. Hence, starting from Eq. 5.9,

𝐗𝐘T=𝐃𝐘𝐘T𝐗𝐘T(𝐘𝐘T)-1=𝐃.\mathbf{{X}}\mathbf{{Y}}^{\text{T}}=\mathbf{{D}}\mathbf{{Y}}\mathbf{{Y}}^{%\text{T}}\implies\mathbf{{X}}\mathbf{{Y}}^{\text{T}}\mathopen{}\mathclose{{}%\left(\mathbf{{Y}}\mathbf{{Y}}^{\text{T}}}\right)^{-1}=\mathbf{{D}}.

Again we have recovered the normal equations, this time apparently with even fewer assumptions. Lest the normal equations seem inevitable, let us recall that the choice to right multiply by the matrix 𝐘T\mathbf{{Y}}^{\text{T}} was, although perhaps obvious, also arbitrary. Note in particular that, under our assumption that 𝐘\mathbf{{Y}} is fat, choosing 𝐃\mathbf{{D}} according to the normal equations does not actually satisfy Eq. 5.9—indeed, no choice of 𝐃\mathbf{{D}} could, because the problem is overdetermined. What notion of “best fit” have we tacitly imposed in order to select one value for 𝐃\mathbf{{D}}?

The answer can be read off the derivation in Eq. 5.2 above: The normal equations arise from squared-error penalties, and accordingly the procedure is sometimes called the method of least squaresmargin: method of least squares . Evidently the assumption of normally distributed errors and the least-squares penalty are two sides of the same coin. In the case of a linear map, when 𝐘\mathbf{{Y}} is full rank, the least-squares solution exists. And since the quadratic is convex, the solution is unique; that is, the normal equations yield the global minimizer of the least-squares penalty. As a “sanity check,” we can also reassure ourselve that when Eq. 5.9 really does hold, i.e. when 𝐘\mathbf{{Y}} is invertible, the normal equations reduce to the unique solution, 𝐃=𝐗𝐘-1\mathbf{{D}}=\mathbf{{X}}\mathbf{{Y}}^{-1}.

Heteroscedastic data: weighted least-squares.

Now that we have examined linear regression thoroughly from the perspective of the data matrices 𝐗\mathbf{{X}} and 𝐘\mathbf{{Y}}, let us complicate the problem along the lines suggested above. Suppose that the samples in these matrices are not i.i.d. In particular, let 𝑿~k{\bm{{}\tilde{X}}}_{k} and 𝒀~m{\bm{{}\tilde{Y}}}_{m} be random vectors of all samples, for dimensions kk and mm of the output and input (resp.), and suppose Cov𝑿~k|𝒀~m[𝑿~k|𝒀~m]=𝚼\text{Cov}_{{\bm{{}\tilde{X}}}_{k}|{\bm{{}\tilde{Y}}}_{m}}{\mathopen{}%\mathclose{{}\left[{\bm{{}\tilde{X}}}_{k}\middle|{\bm{{}\tilde{Y}}}_{m}}\right%]}=\mathbf{\Upsilon} (fixed for all kk and mm). Then we can decorrelate 𝑿~k{\bm{{}\tilde{X}}}_{k} simply by multiplying by 𝚼-1/2\mathbf{\Upsilon}^{-1/2}. In terms of the data matrix 𝐗\mathbf{{X}}, in which each 𝒙~k\bm{{}\tilde{x}}_{k} is a row, not a column, this amounts to right multiplication. To maintain the linear relationship in Eq. 5.9, we multiply both sides by 𝚼-1/2\mathbf{\Upsilon}^{-1/2}. Replacing 𝐗\mathbf{{X}} with 𝐗𝚼-1/2\mathbf{{X}}\mathbf{\Upsilon}^{-1/2} and 𝐘\mathbf{{Y}} with 𝐘𝚼-1/2\mathbf{{Y}}\mathbf{\Upsilon}^{-1/2} turns the normal equations into

𝐃=𝐗𝚼-1𝐘T(𝐘𝚼-1𝐘T)-1.\mathbf{{D}}=\mathbf{{X}}\mathbf{\Upsilon}^{-1}\mathbf{{Y}}^{\text{T}}{(%\mathbf{{Y}}\mathbf{\Upsilon}^{-1}\mathbf{{Y}}^{\text{T}})}^{-1}. (5.10)

This variation is known as weighted least squaresmargin: weighted least squares .

Geometric arguments.

[[XXX]]

5.1.2 Generalized linear models

Having explored rather thoroughly the simplest model, let us revisit the methodological choices posed at the beginning of the chapter. In particular, suppose we relax the assumption that the conditional be Gaussian, but retain the assumption that the data are combined linearly across dimensions.margin: Some discussion of finite sufficient statistics to motivate exponential families [[….Use exponential families for the sake of finite sufficient stats….]] This class includes the multivariate Gaussian, but also many other familiar distirbutions—Poisson, multinomial, Gamma, Dirichlet, etc.—as special cases. On the one hand, for none of these other distribution is the cross entropy quadratic in the parameters (recall Eq. 5.2), so closed-form minimizations are not possible. On the other hand, the cross-entropy loss does retain convexity for any exponential-family distribution, so the Newton-Raphson algorithm lately derived is guaranteed to find the global optimum. As we shall see, each step of the algorithm can be rewritten as solving a weighted least-squares problem, as in Eq. 5.10, except that the weights change at every iteration.

Exponential families

General form:

pˇ(𝒙ˇ)=h(𝒙ˇ)exp{𝜼T𝒕(𝒙ˇ)-A(𝜼)}.{\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\check{x}}{}}\right)}=h(\leavevmode\color[rgb]%{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}})\exp%\mathopen{}\mathclose{{}\left\{\bm{\eta}^{\text{T}}\bm{t}(\leavevmode\color[%rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}})-A(\bm{%\eta})}\right\}. (5.11)

Cross-entropy minimizing parameters:

dd𝜼T𝔼𝑿[-logpˇ(𝑿)]=-𝔼𝑿[𝒕T(𝑿)-dAd𝜼T(𝜼)]=set0𝔼𝑿[𝒕(𝑿)]=𝔼𝑿ˇ[𝒕(𝑿ˇ)].\begin{split}\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{\eta}}^{%\text{T}}}\mathbb{E}_{{\bm{X}}{}}{\mathopen{}\mathclose{{}\left[-\log{\check{p%}\mathopen{}\mathclose{{}\left({\bm{X}}}\right)}}\right]}&=-\mathbb{E}_{{\bm{X%}}{}}{\mathopen{}\mathclose{{}\left[\bm{t}^{\text{T}}({\bm{X}})-\frac{\mathop{%}\!\mathrm{d}{A}}{\mathop{}\!\mathrm{d}{\bm{\eta}}^{\text{T}}}(\bm{\eta})}%\right]}\stackrel{{\scriptstyle\text{set}}}{{=}}0\\\implies\mathbb{E}_{{\bm{X}}{}}{\mathopen{}\mathclose{{}\left[\bm{t}({\bm{X}})%}\right]}&=\mathbb{E}_{{\bm{\check{X}}}{}}{\mathopen{}\mathclose{{}\left[\bm{t%}({\bm{\check{X}}})}\right]}.\end{split}

The optimal moment parameters occur when the gradient with respect to the natural parameters is zero….

Generalized linear model, assuming canonical response function:

pˇ(𝒙ˇ|𝒚)=h(𝒙ˇ)exp{𝒚ˇT𝐃T𝒙ˇ-A(𝐃𝒚ˇ)ϕ}{\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\check{x}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}=h(%\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}%})\exp\mathopen{}\mathclose{{}\left\{\frac{\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\check{y}}^{\text{T}}\mathbf{{D}}^{\text{T}}%\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}%}-A\mathopen{}\mathclose{{}\left(\mathbf{{D}}\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\check{y}}}\right)}{\phi}}\right\} (5.12)

Gradient:

dd𝐃𝔼𝑿,𝒀[-logpˇ(𝑿|𝒀)]=-𝔼𝑿,𝒀[1ϕ(𝑿-dAd𝜼)𝒀T]=-𝔼𝒀[1ϕ(𝔼𝑿|𝒀[𝑿|𝒚^]-𝔼𝑿ˇ|𝒀ˇ[𝑿ˇ|𝒚^])𝒀T]1ϕ(𝑿-𝝁xˇ|yˇ)𝒀T𝑿,𝒀.\begin{split}\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\mathbf{{D}}%}}\mathbb{E}_{{\bm{X}}{},{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[-\log{%\check{p}\mathopen{}\mathclose{{}\left({\bm{X}}\middle|{\bm{Y}}}\right)}}%\right]}&=-\mathbb{E}_{{\bm{X}}{},{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[%\frac{1}{\phi}\mathopen{}\mathclose{{}\left({\bm{X}}-\frac{\mathop{}\!\mathrm{%d}{A}}{\mathop{}\!\mathrm{d}{\bm{\eta}}}}\right){\bm{Y}}^{\text{T}}}\right]}\\&=-\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[\frac{1}{\phi}%\mathopen{}\mathclose{{}\left(\mathbb{E}_{{\bm{X}}{}|{\bm{Y}}}{\mathopen{}%\mathclose{{}\left[{\bm{X}}\middle|\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{}{}}\right]}-\mathbb{E}_{{\bm{\check{%X}}}{}|{\bm{\check{Y}}}}{\mathopen{}\mathclose{{}\left[{\bm{\check{X}}}\middle%|\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{%}{}}\right]}}\right){\bm{Y}}^{\text{T}}}\right]}\\&\approx\frac{1}{\phi}{\mathopen{}\mathclose{{}\left\langle{\mathopen{}%\mathclose{{}\left({\bm{X}}-\bm{\mu}_{\check{x}|\check{y}}}\right){\bm{Y}}^{%\text{T}}}}\right\rangle_{{\bm{X}}{},{\bm{Y}}{}}}.\end{split} (5.13)

When 𝝁xˇ|yˇ\bm{\mu}_{\check{x}|\check{y}} is a linear function of the parameters, 𝐃\mathbf{{D}}, as for Gaussian noise, this gradient can simply be set to zero to solve for the optimal parameters.

When it is not, we need to adopt an iterative approach. Here we will use the Newton-Raphson algorithm discussed above. This requires the Hessian; for simplicity, we write it for scalar outputs only:

d2d𝒅d𝒅T𝔼X,𝒀[-logpˇ(X|𝒀)]=𝔼𝒀[1ϕVarXˇ|𝒀ˇ[Xˇ|𝒚^]𝒀𝒀T]1ϕσxˇ|yˇ2𝒀𝒀TX,𝒀.\frac{\mathop{}\!\mathrm{d}{}^{2}\!}{\mathop{}\!\mathrm{d}{\bm{d}}\mathop{}\!%\mathrm{d}{\bm{d}}^{\text{T}}}\mathbb{E}_{{X}{},{\bm{Y}}{}}{\mathopen{}%\mathclose{{}\left[-\log{\check{p}\mathopen{}\mathclose{{}\left({X}\middle|{%\bm{Y}}}\right)}}\right]}=\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}%\left[\frac{1}{\phi}\text{Var}_{{\check{X}}{}|{\bm{\check{Y}}}}{\mathopen{}%\mathclose{{}\left[{\check{X}}\middle|\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{}{}}\right]}{\bm{Y}}{\bm{Y}}^{\text{T%}}}\right]}\approx\frac{1}{\phi}{\mathopen{}\mathclose{{}\left\langle{\sigma^{%2}_{\check{x}|\check{y}}{\bm{Y}}{\bm{Y}}^{\text{T}}}}\right\rangle_{{X}{},{\bm%{Y}}{}}}. (5.14)

Then the IRLS update is

𝒅(i+1)T=𝒅(i)T+(X-μxˇ|yˇ(i))𝒀TX,𝒀σxˇ|yˇ2𝒀𝒀TX,𝒀-1=σxˇ|yˇ2𝒅(i)T𝒀𝒀T+(X-μxˇ|yˇ(i))𝒀TX,𝒀σxˇ|yˇ2𝒀𝒀TX,𝒀-1=(𝒅(i)T𝒀+X-μxˇ|yˇ(i)σxˇ|yˇ2)σxˇ|yˇ2𝒀TX,𝒀σxˇ|yˇ2𝒀𝒀TX,𝒀-1.\begin{split}{\bm{d}^{(i+1)}}^{\text{T}}&={\bm{d}^{(i)}}^{\text{T}}+{\mathopen%{}\mathclose{{}\left\langle{\mathopen{}\mathclose{{}\left({X}-\mu^{(i)}_{%\check{x}|\check{y}}}\right){\bm{Y}}^{\text{T}}}}\right\rangle_{{X}{},{\bm{Y}}%{}}}{\mathopen{}\mathclose{{}\left\langle{\sigma^{2}_{\check{x}|\check{y}}{\bm%{Y}}{\bm{Y}}^{\text{T}}}}\right\rangle_{{X}{},{\bm{Y}}{}}}^{-1}\\&={\mathopen{}\mathclose{{}\left\langle{\sigma^{2}_{\check{x}|\check{y}}{\bm{d%}^{(i)}}^{\text{T}}{\bm{Y}}{\bm{Y}}^{\text{T}}+\mathopen{}\mathclose{{}\left({%X}-\mu^{(i)}_{\check{x}|\check{y}}}\right){\bm{Y}}^{\text{T}}}}\right\rangle_{%{X}{},{\bm{Y}}{}}}{\mathopen{}\mathclose{{}\left\langle{\sigma^{2}_{\check{x}|%\check{y}}{\bm{Y}}{\bm{Y}}^{\text{T}}}}\right\rangle_{{X}{},{\bm{Y}}{}}}^{-1}%\\&={\mathopen{}\mathclose{{}\left\langle{\mathopen{}\mathclose{{}\left({\bm{d}^%{(i)}}^{\text{T}}{\bm{Y}}+\frac{{X}-\mu^{(i)}_{\check{x}|\check{y}}}{\sigma^{2%}_{\check{x}|\check{y}}}}\right)\sigma^{2}_{\check{x}|\check{y}}{\bm{Y}}^{%\text{T}}}}\right\rangle_{{X}{},{\bm{Y}}{}}}{\mathopen{}\mathclose{{}\left%\langle{\sigma^{2}_{\check{x}|\check{y}}{\bm{Y}}{\bm{Y}}^{\text{T}}}}\right%\rangle_{{X}{},{\bm{Y}}{}}}^{-1}.\end{split} (5.15)

This is the solution to (i.e., the normal equations for) a weighted least-squares problem—cf. Eq. 5.10—but with the parenthetical quantity, rather than X{X}, as the output variable. To generate some intuition about what precisely this quantity is, let us derive IRLS from a different perspective.

We return to the cross-entropy gradient, Eq. 5.13. The problem with setting this gradient to zero and solving for the optimal parameters is, as noted above, that μxˇ|yˇ\mu_{\check{x}|\check{y}} is not generally a linear function of those parameters, 𝐃\mathbf{{D}}. However, if we concede that we must take an iterative approach, we can perhaps approximate this function as a linear displacement from the previous value of the parameters. That is, letting λ\lambda denote the inverse canonical link function, μxˇ|yˇ(i)=λ(𝒅(i)T𝒚)\mu^{(i)}_{\check{x}|\check{y}}=\lambda({\bm{d}^{(i)}}^{\text{T}}\bm{y}), we approximate

λ(𝒅(i+1)T𝒚)λ(𝒅(i)T𝒚)+λ𝒅(i)T(𝒅(i+1)-𝒅(i))=μxˇ|yˇ(i)+σxˇ|yˇ2(i)𝒚T(𝒅(i+1)-𝒅(i)).\lambda({\bm{d}^{(i+1)}}^{\text{T}}\bm{y})\approx\lambda({\bm{d}^{(i)}}^{\text%{T}}\bm{y})+\frac{\partial{\lambda}}{\partial{{\bm{d}^{(i)}}}^{\text{T}}}%\mathopen{}\mathclose{{}\left(\bm{d}^{(i+1)}-\bm{d}^{(i)}}\right)=\mu^{(i)}_{%\check{x}|\check{y}}+{\sigma^{2}_{\check{x}|\check{y}}}^{(i)}\bm{y}^{\text{T}}%\mathopen{}\mathclose{{}\left(\bm{d}^{(i+1)}-\bm{d}^{(i)}}\right).

Substituting this into the gradient, Eq. 5.13, we have

dd𝒅T𝔼𝑿,𝒀[-logpˇ(𝑿|𝒀)]1ϕ(X-μxˇ|yˇ(i)+σxˇ|yˇ2(i)(𝒅(i)-𝒅(i+1))T𝒀)𝒀TX,𝒀=set0𝒅(i+1)T=(𝒅(i)T𝒀+X-μxˇ|yˇ(i)σxˇ|yˇ2(i))σxˇ|yˇ2(i)𝒀TX,𝒀σxˇ|yˇ2(i)𝒀𝒀TX,𝒀-1.\begin{split}\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{d}{}}^{%\text{T}}}\mathbb{E}_{{\bm{X}}{},{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[-%\log{\check{p}\mathopen{}\mathclose{{}\left({\bm{X}}\middle|{\bm{Y}}}\right)}}%\right]}&\approx\frac{1}{\phi}{\mathopen{}\mathclose{{}\left\langle{\mathopen{%}\mathclose{{}\left({X}-\mu^{(i)}_{\check{x}|\check{y}}+{\sigma^{2}_{\check{x}%|\check{y}}}^{(i)}\mathopen{}\mathclose{{}\left(\bm{d}^{(i)}-\bm{d}^{(i+1)}}%\right)^{\text{T}}{\bm{Y}}}\right){\bm{Y}}^{\text{T}}}}\right\rangle_{{X}{},{%\bm{Y}}{}}}\stackrel{{\scriptstyle\text{set}}}{{=}}0\\\implies{\bm{d}^{(i+1)}}^{\text{T}}&={\mathopen{}\mathclose{{}\left\langle{%\mathopen{}\mathclose{{}\left({\bm{d}^{(i)}}^{\text{T}}{\bm{Y}}+\frac{{X}-\mu^%{(i)}_{\check{x}|\check{y}}}{{\sigma^{2}_{\check{x}|\check{y}}}^{(i)}}}\right)%{\sigma^{2}_{\check{x}|\check{y}}}^{(i)}{\bm{Y}}^{\text{T}}}}\right\rangle_{{X%}{},{\bm{Y}}{}}}{\mathopen{}\mathclose{{}\left\langle{{\sigma^{2}_{\check{x}|%\check{y}}}^{(i)}{\bm{Y}}{\bm{Y}}^{\text{T}}}}\right\rangle_{{X}{},{\bm{Y}}{}}%}^{-1}.\end{split}

This matches the update generated with Newton-Raphson, Eq. 5.15. Thus, IRLS can be seen as approximating the gradient as locally linear in the parameters; or, perhaps more intuitively, as approximating the loss (the cross entropy) as locally quadratic in these parameters. Accordingly, we seek a least-squares solution (because of the quadratic cost), but apply it iteratively (because the cost is an approximation). The loss differs at each iteration because the “outputs” change, but also because the data are reweighted by σxˇ|yˇ2\sigma^{2}_{\check{x}|\check{y}}. This is a reweighting, rather than just a weighting, because σxˇ|yˇ2(i){\sigma^{2}_{\check{x}|\check{y}}}^{(i)} depends on the current parameters, 𝒅(i)\bm{d}^{(i)}.

Iteratively reweighted least squares

IRLS for the Bernoulli distribution: Logistic regression

IRLS for the Poisson Distribution

Having derived IRLS, we get a feel for the algorithm by applying it to a few cases, beginnning with Poisson noise. For simplicity, we consider a single, scalar “output” or dependent variable; and the canonical link function, the natural logarithm. This makes the mean of the distribution an exponentiated (inverse canonical link) linear function of the “input” or independent variables, 𝒀{\bm{Y}}{}, i.e. 𝔼Xˇ|𝒀ˇ[Xˇ|𝒀]=exp{𝒅T𝒀}\mathbb{E}_{{\check{X}}{}|{\bm{\check{Y}}}}{\mathopen{}\mathclose{{}\left[{%\check{X}}\middle|{\bm{Y}}{}}\right]}=\exp\mathopen{}\mathclose{{}\left\{\bm{d%}^{\text{T}}{\bm{Y}}}\right\}. Recall that for Poisson random variables, the variance is equal to the mean. Inserting these values into the first line of Eq. 5.15, yields

𝒅(i+1)T=𝒅(i)T+𝔼X,𝒀[(X-exp{𝒅(i)T𝒀})𝒀T]𝔼𝒀[exp{𝒅(i)T𝒀}𝒀𝒀T]-1.{\bm{d}^{(i+1)}}^{\text{T}}={\bm{d}^{(i)}}^{\text{T}}+\mathbb{E}_{{X}{},{\bm{Y%}}{}}{\mathopen{}\mathclose{{}\left[\mathopen{}\mathclose{{}\left({X}-\exp%\mathopen{}\mathclose{{}\left\{{\bm{d}^{(i)}}^{\text{T}}{\bm{Y}}}\right\}}%\right){\bm{Y}}^{\text{T}}}\right]}\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}%\mathclose{{}\left[\exp\mathopen{}\mathclose{{}\left\{{\bm{d}^{(i)}}^{\text{T}%}{\bm{Y}}}\right\}{\bm{Y}}{\bm{Y}}^{\text{T}}}\right]}^{-1}.

IRLS for the Gamma Distribution

margin: Update to new notation and perhaps rewrite.

We consider the case where the scale parameter is known, but the shape parameter is not. The reverse is much more commonly considered (it is the “gamma” GLiM built into Matlab’s glmfit, for example), probably because the sufficient statistic for the shape parameter is just the sum over independent output samples, nxˇn\sum_{n}\check{x}_{n}. For simplicity, we consider outputs of length one (K=1{K}=1) only, which makes the parameters a (row) vector, 𝜽T\bm{\theta}^{\text{T}}. Now, the log-likelihood of the parameters under the gamma distribution is:

𝜽*=argmax𝜽{lognNq(xˇn|𝒚ˇn;𝜽)}=argmax𝜽{n=1Nlogq(xˇn|𝒚ˇn;𝜽)}=argmax𝜽{n=1N([logxˇnxˇn]T[ηkn(𝒚ˇn,𝜽)ηθn]-logΓ(ηkn(𝒚ˇn,𝜽)+1)+(ηkn(𝒚ˇn,𝜽)+1)log(-ηθn))}=argmax𝜽{n=1N([logxˇnxˇn]T[kn(𝒚ˇn,𝜽)-1-1/θ]-logΓ(kn(𝒚ˇn,𝜽))-kn(𝒚ˇn,𝜽)logθ)}\begin{split}\bm{\theta}^{*}&=\operatorname*{argmax}_{\bm{\theta}}\mathopen{}%\mathclose{{}\left\{\log\prod_{n}^{N}q(\check{x}_{n}|\bm{\check{y}}_{n};\bm{%\theta})}\right\}\\&=\operatorname*{argmax}_{\bm{\theta}}\mathopen{}\mathclose{{}\left\{\sum_{n=1%}^{N}\log q(\check{x}_{n}|\bm{\check{y}}_{n};\bm{\theta})}\right\}\\&=\operatorname*{argmax}_{\bm{\theta}}\mathopen{}\mathclose{{}\left\{\sum_{n=1%}^{N}\bigg{(}\begin{bmatrix}\log\check{x}_{n}\\\check{x}_{n}\end{bmatrix}^{\text{T}}\begin{bmatrix}\eta^{n}_{k}(\bm{\check{y}%}_{n},\bm{\theta})\\\eta^{n}_{\theta}\end{bmatrix}-\log\Gamma(\eta^{n}_{k}(\bm{\check{y}}_{n},\bm{%\theta})+1)+(\eta^{n}_{k}(\bm{\check{y}}_{n},\bm{\theta})+1)\log(-\eta^{n}_{%\theta})\bigg{)}}\right\}\\&=\operatorname*{argmax}_{\bm{\theta}}\mathopen{}\mathclose{{}\left\{\sum_{n=1%}^{N}\bigg{(}\begin{bmatrix}\log\check{x}_{n}\\\check{x}_{n}\end{bmatrix}^{\text{T}}\begin{bmatrix}k_{n}(\bm{\check{y}}_{n},%\bm{\theta})-1\\-1/\theta\end{bmatrix}-\log\Gamma(k_{n}(\bm{\check{y}}_{n},\bm{\theta}))-k_{n}%(\bm{\check{y}}_{n},\bm{\theta})\log\theta\bigg{)}}\right\}\\\end{split}

Here we have made explicit that the only natural parameters depending on the inputs—and the parameters—are those associated with the shape parameter: ηkn=kn-1\eta^{n}_{k}=k_{n}-1. The scale parameter is assumed constant across all samples. Since knk_{n} must be positive, we define the link function as:

ηkn=exp{𝜽T𝒚ˇn}-1.\eta^{n}_{k}=\exp\big{\{}\bm{\theta}^{\text{T}}\bm{\check{y}}_{n}\big{\}}-1.

The trailing 1 only clutters the derivation, so we work instead with knk_{n}, starting with its gradient with respect to the parameters:

kn=exp{𝜽T𝒚ˇn}dknd𝜽T=exp{𝜽T𝒚ˇn}𝒚ˇnT=kn𝒚ˇnT.k_{n}=\exp\{\bm{\theta}^{\text{T}}\bm{\check{y}}_{n}\}\implies\frac{\mathop{}%\!\mathrm{d}{k_{n}}}{\mathop{}\!\mathrm{d}{\bm{\theta}}^{\text{T}}}=\exp\{\bm{%\theta}^{\text{T}}\bm{\check{y}}_{n}\}\bm{\check{y}}_{n}^{\text{T}}=k_{n}\bm{%\check{y}}_{n}^{\text{T}}.

We shall also need the first and second the derivatives of the log-partition function with respect to ηkn\eta^{n}_{k}, i.e., the expected value and variance of logxˇn\log\check{x}_{n}:

dAdηkn=dAdkn=ψ(0)(kn)+logθ=..μn,dμndηkn=dμndkn=ψ(1)(kn)=..σn2,\begin{split}\frac{\mathop{}\!\mathrm{d}{A}}{\mathop{}\!\mathrm{d}{\eta^{n}_{k%}}}&=\frac{\mathop{}\!\mathrm{d}{A}}{\mathop{}\!\mathrm{d}{k_{n}}}=\psi^{(0)}(%k_{n})+\log\theta=\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}\mu_{n},\\\frac{\mathop{}\!\mathrm{d}{\mu_{n}}}{\mathop{}\!\mathrm{d}{\eta^{n}_{k}}}&=%\frac{\mathop{}\!\mathrm{d}{\mu_{n}}}{\mathop{}\!\mathrm{d}{k_{n}}}=\psi^{(1)}%(k_{n})=\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}\sigma^{2}_{n},\end{split} (5.16)

with ψ(i)()\psi^{(i)}(\cdot) the ithi^{\text{th}}-order polygamma function. Putting these pieces together, the gradient (with respect to the parameters) of the entire cost function above is

L𝜽T=n=1N(logxˇn-μn)kn𝒚ˇnT=𝒗\begin{split}\frac{\partial{L}}{\partial{\bm{\theta}}^{\text{T}}}&=\sum_{n=1}^%{N}(\log\check{x}_{n}-\mu_{n})k_{n}\bm{\check{y}}_{n}^{\text{T}}\\&=\mathbf{{}}\bm{v}\end{split}

where vn=(logxˇn-μn)knv_{n}=(\log\check{x}_{n}-\mu_{n})k_{n}.

The Newton-Raphson algorithm also requires the Hessian. Differentiating a second time we find

2L𝜽𝜽T=dd𝜽Tn=1N𝒚ˇn(logxˇn-μn)kn=n=1N𝒚ˇn(-dμndknTdknd𝜽Tkn+(logxˇn-μn)dknd𝜽T)=n=1N𝒚ˇn(-σn2kn+logxˇn-μn)kn𝒚ˇnT=𝚲,\begin{split}\frac{\partial^{2}{L}}{\partial{\bm{\theta}}\partial{\bm{\theta}}%^{\text{T}}}&=\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{\theta}%}^{\text{T}}}\sum_{n=1}^{N}\bm{\check{y}}_{n}(\log\check{x}_{n}-\mu_{n})k_{n}%\\&=\sum_{n=1}^{N}\bm{\check{y}}_{n}\bigg{(}-\frac{\mathop{}\!\mathrm{d}{\mu_{n}%}}{\mathop{}\!\mathrm{d}{k_{n}}^{\text{T}}}\frac{\mathop{}\!\mathrm{d}{k_{n}}}%{\mathop{}\!\mathrm{d}{\bm{\theta}}^{\text{T}}}k_{n}+(\log\check{x}_{n}-\mu_{n%})\frac{\mathop{}\!\mathrm{d}{k_{n}}}{\mathop{}\!\mathrm{d}{\bm{\theta}}^{%\text{T}}}\bigg{)}\\&=\sum_{n=1}^{N}\bm{\check{y}}_{n}\big{(}-\sigma^{2}_{n}k_{n}+\log\check{x}_{n%}-\mu_{n}\big{)}k_{n}\bm{\check{y}}_{n}^{\text{T}}\\&=\mathbf{{}}\mathbf{\Lambda}\mathbf{{}},\end{split}

with 𝚲\mathbf{\Lambda} a diagonal matrix with entries dnn=-kn2σn2+knlogxˇn-knμnd_{nn}=-k_{n}^{2}\sigma^{2}_{n}+k_{n}\log\check{x}_{n}-k_{n}\mu_{n}. Notice, however, that if we use the expected Hessian, the term logxˇn-μn\log\check{x}_{n}-\mu_{n} vanishes, leaving dnn=-kn2σn2d_{nn}=-k_{n}^{2}\sigma^{2}_{n}.

Nomenclature

5.1.3 Artificial neural networks

In moving from linear regression to generalized linear models (GLiMs), we relaxed the assumption that the conditional distribution pˇ(𝒙ˇ|𝒚){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\check{x}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)} be Gaussian, in particular allowing any exponential-family distribution. But in GLiMs, the map from inputs 𝒚\bm{y} to the (moment) parameters of the outputs 𝑿ˇ{\bm{\check{X}}} is still “almost” linear; more precisely, linear followed by a pointwise, monotonic nonlinearity. Let us now relax this assumption as well, that is, allow for more expressive maps.

One convenient way to achieve this is to compose multiple such maps together:

𝒛ˇl={𝝍l(𝐃l𝒛ˇl-1+𝒅l)l[1,L]𝒚l=0,\bm{\check{z}}_{l}=\begin{cases}\bm{\psi}_{l}\mathopen{}\mathclose{{}\left(%\mathbf{{D}}_{l}\bm{\check{z}}_{l-1}+\bm{d}_{l}}\right)&l\in[1,{L}]\\\bm{y}&l=0,\end{cases} (5.17)

where 𝝍l\bm{\psi}_{l} is a pointwise nonlinearity at every “layer” ll, and the “bias” terms 𝒅l\bm{d}_{l} have been included explicitly. The variables at the final layer, 𝒛ˇL\bm{\check{z}}_{{L}}, are then the moment parameters for pˇ(𝒙ˇ|𝒚){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\check{x}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}.

To see the increase in expressive power with layers, notice that 𝒛ˇ1\bm{\check{z}}_{1} cannot include any “interaction” terms involving the product of two elements of 𝒚\bm{y}, like yiyjy_{i}y_{j}, since 𝝍1\bm{\psi}_{1} acts elementwise. But interaction terms can appear even by L=2{L}=2, as seen in the simple special case

𝒛ˇ1=log([y1y2])\displaystyle\bm{\check{z}}_{1}=\log\mathopen{}\mathclose{{}\left(\begin{%bmatrix}\hidden@noalign{}\hfil\textstyle y_{1}\\\hidden@noalign{}\hfil\textstyle y_{2}\end{bmatrix}}\right) zˇ2=exp{[11]𝒛ˇ1},\displaystyle\check{z}_{2}=\exp\mathopen{}\mathclose{{}\left\{\begin{bmatrix}%\hidden@noalign{}\hfil\textstyle 1&1\end{bmatrix}\bm{\check{z}}_{1}}\right\},

which evidently yields zˇ2=y1y2\check{z}_{2}=y_{1}y_{2}.

In fact….margin: Universal function approximators; Cybenko 1989 etc.;

Backpropagation

It is another question how to set the parameters 𝐃l\mathbf{{D}}_{l} and 𝒅l\bm{d}_{l} at every layer in order to approximate any particular function. Part of the price of arbitrarily complex functions is non-convexity and expensive Hessian computations, and accordingly we set aside Newton-Raphson and focus on first-order methods. Computing the cross-entropy gradient with respect to all the parameters is conceptually straightforward, but to do so efficiently will require some punctilious bookkeeping. This is considerably simplified by differentiating with respect to matrices and vectors rather than scalars; the relevant rules are derived in Section B.1.

As throughout this chapter, we descend the gradient of the conditional cross-entropy. But to focus on the new aspects of this process introduced by the composition of functions, let us ignore the specific distribution family and refer to the surprisal simply as \mathcal{L}:

(𝒙ˇ,𝒚)..=-logpˇ(𝒙ˇ|𝒚)\mathcal{L}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor%}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{%\check{x}},\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}%{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y})%\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=-\log{\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\check{x}}{}%\middle|\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{%rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}{}%}\right)}

With some foresight, let us begin with the gradients with respect to the biases. Interpreting \mathcal{L} as a function of 𝒛ˇL\bm{\check{z}}_{{L}}, and then applying the chain rule55 5 Recall that the chain rule for derivatives with respect to column vectors accumulates terms on the left, rather than the right; see Section B.1 through Eq. 5.17, we find the gradient at the final layer to be

dd𝒅L=𝐉Ldd𝒛ˇL=..𝒃L,\frac{\mathop{}\!\mathrm{d}{\mathcal{L}}}{\mathop{}\!\mathrm{d}{\bm{d}_{{L}}}}%=\mathbf{J}_{{L}}\frac{\mathop{}\!\mathrm{d}{\mathcal{L}}}{\mathop{}\!\mathrm{%d}{\bm{\check{z}}_{{L}}}}=\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.%}}}\bm{b}_{{L}},

where 𝐉L\mathbf{J}_{{L}} is the Jacobian of the nonlinearity, 𝝍L\bm{\psi}_{{L}}, at layer L{L}. Because 𝝍L\bm{\psi}_{{L}} acts elementwise on its vector argument, 𝐉L=𝐉LT\mathbf{J}_{{L}}=\mathbf{J}_{{L}}^{\text{T}} is diagonal. We have assigned a symbol to this gradient in anticipation of reusing it.

Proceeding to the penultimate layer, we find

dd𝒅L-1=𝐉L-1𝐃L𝐉Ldd𝒛ˇL=𝐉L-1𝐃L𝒃L=..𝒃L-1,\begin{split}\frac{\mathop{}\!\mathrm{d}{\mathcal{L}}}{\mathop{}\!\mathrm{d}{%\bm{d}_{{L}-1}}}&=\mathbf{J}_{{L}-1}\mathbf{{D}}_{{L}}\mathbf{J}_{{L}}\frac{%\mathop{}\!\mathrm{d}{\mathcal{L}}}{\mathop{}\!\mathrm{d}{\bm{\check{z}}_{{L}}%}}\\&=\mathbf{J}_{{L}-1}\mathbf{{D}}_{{L}}\bm{b}_{{L}}=\mathrel{\vbox{\hbox{%\scriptsize.}\hbox{\scriptsize.}}}\bm{b}_{{L}-1},\end{split}

and at the third-last layer,

dd𝒅L-2=𝐉L-2𝐃L-1𝐉L-1𝐃L𝐉Ldd𝒛ˇL=𝐉L-2𝐃L-1𝒃L-1=..𝒃L-2.\begin{split}\frac{\mathop{}\!\mathrm{d}{\mathcal{L}}}{\mathop{}\!\mathrm{d}{%\bm{d}_{{L}-2}}}&=\mathbf{J}_{{L}-2}\mathbf{{D}}_{{L}-1}\mathbf{J}_{{L}-1}%\mathbf{{D}}_{{L}}\mathbf{J}_{{L}}\frac{\mathop{}\!\mathrm{d}{\mathcal{L}}}{%\mathop{}\!\mathrm{d}{\bm{\check{z}}_{{L}}}}\\&=\mathbf{J}_{{L}-2}\mathbf{{D}}_{{L}-1}\bm{b}_{{L}-1}=\mathrel{\vbox{\hbox{%\scriptsize.}\hbox{\scriptsize.}}}\bm{b}_{{L}-2}.\end{split}

This pattern now repeats at subsequent layers. Hence the sequence of bias gradients can be computed with the backward recursionmargin: implementation note: Because 𝐉l\mathbf{J}_{l} is diagonal, these matrix-vector products can be computed efficiently as elementwise vector products.

𝒃l={𝐉ldd𝒛ˇll=L𝐉l𝐃l+1𝒃l+1l=1,,L-1.\bm{b}_{l}=\begin{cases}\mathbf{J}_{l}\frac{\mathop{}\!\mathrm{d}{\mathcal{L}}%}{\mathop{}\!\mathrm{d}{\bm{\check{z}}_{l}}}&l={L}\\\mathbf{J}_{l}\mathbf{{D}}_{l+1}\bm{b}_{l+1}&l=1,\ldots,{L}-1.\end{cases} (5.18)

Next, we take derivatives with respect to the weight matrices, beginning with the output layer (cf. Eq. B.7):

dd𝐃L=𝐉Ldd𝒛ˇL𝒛ˇL-1T=𝒃L𝒛ˇL-1T.\frac{\mathop{}\!\mathrm{d}{\mathcal{L}}}{\mathop{}\!\mathrm{d}{\mathbf{{D}}_{%{L}}}}=\mathbf{J}_{{L}}\frac{\mathop{}\!\mathrm{d}{\mathcal{L}}}{\mathop{}\!%\mathrm{d}{\bm{\check{z}}_{{L}}}}\bm{\check{z}}_{{L}-1}^{\text{T}}=\bm{b}_{{L}%}\bm{\check{z}}_{{L}-1}^{\text{T}}.

Proceeding to the penultimate layer, we find:

dd𝐃L-1=𝐉L-1𝐃LT𝒃L𝒛ˇL-2T=𝒃L-1𝒛ˇL-2T.\frac{\mathop{}\!\mathrm{d}{\mathcal{L}}}{\mathop{}\!\mathrm{d}{\mathbf{{D}}_{%{L}-1}}}=\mathbf{J}_{{L}-1}\mathbf{{D}}_{{L}}^{\text{T}}\bm{b}_{{L}}\bm{\check%{z}}_{{L}-2}^{\text{T}}=\bm{b}_{{L}-1}\bm{\check{z}}_{{L}-2}^{\text{T}}.

Proceeding backwards to earlier layers continues the pattern, so the weight gradients can be computed with the same backward recursion, Eq. 5.18, and the formula

dd𝐃l=𝒃l𝒛ˇl-1T,l=1,,L.\frac{\mathop{}\!\mathrm{d}{\mathcal{L}}}{\mathop{}\!\mathrm{d}{\mathbf{{D}}_{%l}}}=\bm{b}_{l}\bm{\check{z}}_{l-1}^{\text{T}},\>\>\>l=1,\ldots,{L}. (5.19)

Eqs. 5.18 and 5.19 define the backpropagation algorithmmargin: backprop for a generic neural network, although applying it to any individual case requires specifying 𝝍l\bm{\psi}_{l} and pˇ(𝒙ˇ|𝒚){\check{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\check{x}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}, which we have left generic here. The weights and biases are updated simply by by moving downhill along the gradients of the cross entropy, which are the expected values of the gradients just derived:

dHd𝜽=dd𝜽𝔼𝑿,𝒀[(𝑿,𝒀)]=𝔼𝑿,𝒀[dd𝜽(𝑿,𝒀)]dd𝜽(𝑿,𝒀)𝑿,𝒀.\frac{\mathop{}\!\mathrm{d}{H}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}=\frac{%\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}\mathbb{E}_{{\bm{X%}}{},{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[\mathcal{L}\mathopen{}%\mathclose{{}\left({\bm{X}},{\bm{Y}}}\right)}\right]}=\mathbb{E}_{{\bm{X}}{},{%\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[\frac{\mathop{}\!\mathrm{d}{\mathcal{%L}}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}\mathopen{}\mathclose{{}\left({\bm{X}}%,{\bm{Y}}}\right)}\right]}\approx{\mathopen{}\mathclose{{}\left\langle{\frac{%\mathop{}\!\mathrm{d}{\mathcal{L}}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}%\mathopen{}\mathclose{{}\left({\bm{X}},{\bm{Y}}}\right)}}\right\rangle_{{\bm{X%}}{},{\bm{Y}}{}}}.

Hence,

Δ𝐃l-𝒃l𝒛ˇl-1T𝑿,𝒀,\displaystyle\Delta\mathbf{{D}}_{l}\propto-{\mathopen{}\mathclose{{}\left%\langle{\bm{b}_{l}\bm{\check{z}}_{l-1}^{\text{T}}}}\right\rangle_{{\bm{X}}{},{%\bm{Y}}{}}}, Δ𝒅l-𝒃l𝑿,𝒀,\displaystyle\Delta\bm{d}_{l}\propto-{\mathopen{}\mathclose{{}\left\langle{\bm%{b}_{l}}}\right\rangle_{{\bm{X}}{},{\bm{Y}}{}}}, (5.20)

although more sophisticated forms of gradient descent are typically used in practice.

Some examples.

Backpropagation through time (BPTT)

Suppose we have reason to believe that some of the structure of our network is “conserved” or repeated, as in Fig. LABEL:fig:XXX. For instance, we might imagine segments of the network to correspond to slices of time, in which case it is reasonable to treat every time slice as identical in structure, even though the inputs and outputs change over time. Or we might have reason to believe that certain spatial structures are repeated. In either case, the concrete meaning of the assumption is that the same parameters show up in multiple places in the network. The upshot for backprop is that a little care must be taken in distinguishing partial from total derivatives, after which the algorithm goes through as in the previous section.

To make this concrete, we consider a recurrent neural network (RNN)margin: recurrent neural network (RNN) with inputs 𝒀1,,𝒀Tp(𝒚1,,𝒚T){\bm{Y}}_{1},\ldots,{\bm{Y}}_{{T}}\sim p(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{y}_{1},\ldots,\bm{y}_{{T}}) and “hidden”-unit activities 𝒛ˇt\bm{\check{z}}_{t} recursively computed as

𝒛ˇt=𝝍(𝒛ˇt-1,𝒚t,𝜽).\bm{\check{z}}_{t}=\bm{\psi}(\bm{\check{z}}_{t-1},\bm{y}_{t},\bm{\theta}). (5.21)

For simplicity, we let the recurrent layer also be the output layer—i.e., the layer at which the loss function is evaluated—in which case the most general loss function can be expressed as (𝒛ˇ0,,𝒛ˇT,𝜽)\mathcal{L}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor%}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{%\check{z}}_{0},\ldots,\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{%pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{\check{z}}_{T},\bm{\theta}).66 6 We call this the loss function, but the complete loss is the average of this function under the data. Notice that we let it depend on the hidden units at every time step, as well as directly on the parameters. To compute parameter changes, we take the (total) derivative of the loss function with respect to those parameters:

dd𝜽T=𝜽T+t=0T𝒛ˇtTd𝒛ˇtd𝜽T=𝜽T+t=0Tdd𝒛ˇtT𝒛ˇt𝜽T\begin{split}\frac{\mathop{}\!\mathrm{d}{\mathcal{L}}}{\mathop{}\!\mathrm{d}{%\bm{\theta}}^{\text{T}}}&=\frac{\partial{\mathcal{L}}}{\partial{\bm{\theta}}^{%\text{T}}}+\sum_{t=0}^{T}\frac{\partial{\mathcal{L}}}{\partial{\bm{\check{z}}_%{t}}^{\text{T}}}\frac{\mathop{}\!\mathrm{d}{\bm{\check{z}}_{t}}}{\mathop{}\!%\mathrm{d}{\bm{\theta}}^{\text{T}}}\\&=\frac{\partial{\mathcal{L}}}{\partial{\bm{\theta}}^{\text{T}}}+\sum_{t=0}^{T%}\frac{\mathop{}\!\mathrm{d}{\mathcal{L}}}{\mathop{}\!\mathrm{d}{\bm{\check{z}%}_{t}}^{\text{T}}}\frac{\partial{\bm{\check{z}}_{t}}}{\partial{\bm{\theta}}^{%\text{T}}}\end{split} (5.22)

The first equality is just standard calculus; the second follows because 𝜽\bm{\theta} only occurs in the final terms in the chains of derivatives.77 7 In a common abuse of notation, I have written partial derivatives with respect to variables, rather than functions—otherwise the arguments would have to be included to disambiguate the derivatives of 𝝍\bm{\psi} at different moments in time. Thus, one can either compute the total effect of 𝜽\bm{\theta} on each 𝒛ˇt\bm{\check{z}}_{t}, and then compute the direct effects of the latter on \mathcal{L}; or one can compute the direct effect of 𝜽\bm{\theta} on each 𝒛ˇt\bm{\check{z}}_{t}, and then compute the total effect of the latter on \mathcal{L}.margin: Exercise LABEL:ex:partialtotal

Because of Eq. 5.21, the total effect on \mathcal{L} of 𝒛ˇt\bm{\check{z}}_{t}, the current hidden state, is its direct effect, plus the effects caused by its influence on future hidden states, 𝒛ˇt+k,k>0\bm{\check{z}}_{t+k},k>0. This can be expressed recursively as:

dd𝒛ˇtT=𝒛ˇtT+dd𝒛ˇt+1T𝒛ˇt+1𝒛ˇtT.\frac{\mathop{}\!\mathrm{d}{\mathcal{L}}}{\mathop{}\!\mathrm{d}{\bm{\check{z}}%_{t}}^{\text{T}}}=\frac{\partial{\mathcal{L}}}{\partial{\bm{\check{z}}_{t}}^{%\text{T}}}+\frac{\mathop{}\!\mathrm{d}{\mathcal{L}}}{\mathop{}\!\mathrm{d}{\bm%{\check{z}}_{t+1}}^{\text{T}}}\frac{\partial{\bm{\check{z}}_{t+1}}}{\partial{%\bm{\check{z}}_{t}}^{\text{T}}}. (5.23)

The recursion is initialized at d/d𝒛ˇTT=/𝒛ˇTT\mathop{}\!\mathrm{d}{\mathcal{L}}/\mathop{}\!\mathrm{d}{\bm{\check{z}}_{T}^{%\text{T}}}=\partial\mathcal{L}/\partial\bm{\check{z}}_{T}^{\text{T}} and run backwards in time.

The backprop-through time algorithm consists of Eqs. 5.22 and 5.23. Applying it to a particular model requires working out three partial derivatives, 𝒛ˇt+1/𝒛ˇtT\partial\bm{\check{z}}_{t+1}/\partial\bm{\check{z}}_{t}^{\text{T}}, /𝒛ˇtT\partial\mathcal{L}/\partial\bm{\check{z}}_{t}^{\text{T}}, and /𝜽T\partial\mathcal{L}/\partial\bm{\theta}^{\text{T}}, for a particular choice of recurrent function and loss function. Using the standard recurrent function for a neural network (cf. Eq. 5.17):

𝒛ˇt=𝝍(𝐃zˇzˇ𝒛ˇt-1+𝐃yzˇ𝒚t+𝒅zˇ),\bm{\check{z}}_{t}=\bm{\psi}(\mathbf{{D}}_{\check{z}\check{z}}\bm{\check{z}}_{%t-1}+\mathbf{{D}}_{y\check{z}}\bm{y}_{t}+\bm{d}_{\check{z}}),

and applying Eqs. 5.22 and 5.23 yields:

dd𝒛ˇtT\displaystyle\frac{\mathop{}\!\mathrm{d}{\mathcal{L}}}{\mathop{}\!\mathrm{d}{%\bm{\check{z}}_{t}}^{\text{T}}} =𝒛ˇtT+dd𝒛ˇt+1T𝐉t+1𝐃zˇzˇ;\displaystyle=\frac{\partial{\mathcal{L}}}{\partial{\bm{\check{z}}_{t}}^{\text%{T}}}+\frac{\mathop{}\!\mathrm{d}{\mathcal{L}}}{\mathop{}\!\mathrm{d}{\bm{%\check{z}}_{t+1}}^{\text{T}}}\mathbf{J}_{t+1}\mathbf{{D}}_{\check{z}\check{z}};
dd𝐃zˇzˇ\displaystyle\frac{\mathop{}\!\mathrm{d}{\mathcal{L}}}{\mathop{}\!\mathrm{d}{%\mathbf{{D}}_{\check{z}\check{z}}}} =𝐃zˇzˇ+t=0Tdd𝒛ˇtT𝒛ˇt𝐃zˇzˇ\displaystyle=\frac{\partial{\mathcal{L}}}{\partial{\mathbf{{D}}_{\check{z}%\check{z}}}}+\sum_{t=0}^{T}\frac{\mathop{}\!\mathrm{d}{\mathcal{L}}}{\mathop{}%\!\mathrm{d}{\bm{\check{z}}_{t}}^{\text{T}}}\frac{\partial{\bm{\check{z}}_{t}}%}{\partial{\mathbf{{D}}_{\check{z}\check{z}}}} =𝐃zˇzˇ+t=0T𝐉tdd𝒛ˇt𝒁ˇt-1T\displaystyle=\frac{\partial{\mathcal{L}}}{\partial{\mathbf{{D}}_{\check{z}%\check{z}}}}+\sum_{t=0}^{T}\mathbf{J}_{t}\frac{\mathop{}\!\mathrm{d}{\mathcal{%L}}}{\mathop{}\!\mathrm{d}{\bm{\check{z}}_{t}}}{\bm{\check{Z}}}_{t-1}^{\text{T}} =𝐃zˇzˇ+t=0T𝒃t𝒁ˇt-1T,\displaystyle=\frac{\partial{\mathcal{L}}}{\partial{\mathbf{{D}}_{\check{z}%\check{z}}}}+\sum_{t=0}^{T}\bm{b}_{t}{\bm{\check{Z}}}_{t-1}^{\text{T}},
dd𝐃yzˇ\displaystyle\frac{\mathop{}\!\mathrm{d}{\mathcal{L}}}{\mathop{}\!\mathrm{d}{%\mathbf{{D}}_{y\check{z}}}} =𝐃yzˇ+t=0Tdd𝒛ˇtT𝒛ˇt𝐃yzˇ\displaystyle=\frac{\partial{\mathcal{L}}}{\partial{\mathbf{{D}}_{y\check{z}}}%}+\sum_{t=0}^{T}\frac{\mathop{}\!\mathrm{d}{\mathcal{L}}}{\mathop{}\!\mathrm{d%}{\bm{\check{z}}_{t}}^{\text{T}}}\frac{\partial{\bm{\check{z}}_{t}}}{\partial{%\mathbf{{D}}_{y\check{z}}}} =𝐃yzˇ+t=0T𝐉tdd𝒛ˇt𝒀tT\displaystyle=\frac{\partial{\mathcal{L}}}{\partial{\mathbf{{D}}_{y\check{z}}}%}+\sum_{t=0}^{T}\mathbf{J}_{t}\frac{\mathop{}\!\mathrm{d}{\mathcal{L}}}{%\mathop{}\!\mathrm{d}{\bm{\check{z}}_{t}}}{\bm{Y}}_{t}^{\text{T}} =𝐃yzˇ+t=0T𝒃t𝒀tT,\displaystyle=\frac{\partial{\mathcal{L}}}{\partial{\mathbf{{D}}_{y\check{z}}}%}+\sum_{t=0}^{T}\bm{b}_{t}{\bm{Y}}_{t}^{\text{T}},
dd𝒅zˇT\displaystyle\frac{\mathop{}\!\mathrm{d}{\mathcal{L}}}{\mathop{}\!\mathrm{d}{%\bm{d}_{\check{z}}}^{\text{T}}} =𝒅zˇT+t=0Tdd𝒛ˇtT𝒛ˇt𝒅zˇT\displaystyle=\frac{\partial{\mathcal{L}}}{\partial{\bm{d}_{\check{z}}}^{\text%{T}}}+\sum_{t=0}^{T}\frac{\mathop{}\!\mathrm{d}{\mathcal{L}}}{\mathop{}\!%\mathrm{d}{\bm{\check{z}}_{t}}^{\text{T}}}\frac{\partial{\bm{\check{z}}_{t}}}{%\partial{\bm{d}_{\check{z}}}^{\text{T}}} =𝒅zˇT+t=0Tdd𝒛ˇtT𝐉t\displaystyle=\frac{\partial{\mathcal{L}}}{\partial{\bm{d}_{\check{z}}}^{\text%{T}}}+\sum_{t=0}^{T}\frac{\mathop{}\!\mathrm{d}{\mathcal{L}}}{\mathop{}\!%\mathrm{d}{\bm{\check{z}}_{t}}^{\text{T}}}\mathbf{J}_{t} =𝒅zˇT+t=0T𝒃tT.\displaystyle=\frac{\partial{\mathcal{L}}}{\partial{\bm{d}_{\check{z}}}^{\text%{T}}}+\sum_{t=0}^{T}\bm{b}_{t}^{\text{T}}.

Note that the subscript on the Jacobian of the nonlinearity, 𝐉t=𝐉tT=𝝍\mathbf{J}_{t}=\mathbf{J}_{t}^{\text{T}}=\bm{\psi}^{\prime}, indicates the time index of its output. To establish the final equalities, we have used the definition

𝒃t..=𝐉tdd𝒛ˇt.\bm{b}_{t}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\mathbf{J}_{t}\frac{\mathop{}\!\mathrm{d}{\mathcal{L}}}{\mathop{}\!\mathrm{%d}{\bm{\check{z}}_{t}}}. (5.24)

Notice the subtle difference with the backprop signal defined in the non-recurrent case: there 𝒃t\bm{b}_{t} was defined to be the gradient with respect to the biases, which indeed turned out to be the right-hand side of Eq. 5.24. But here we define 𝒃t\bm{b}_{t} directly in terms of the right-hand side, which is in the recurrent case but one contribution to the total bias gradient.