B.2 Probability and Statistics‣ Appendix B Mathematical Appendix ‣ Part II Learning ‣ An Introduction to Modern Statistical Learning

The exponential family and Generalized Linear Models (GLiMs)

Change of variables in probability densities

The score function

The score is defined as the gradient of the log-likelihood (with respect to the parameters, $\bm{\theta}$ ), $\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}\log{\hat{p}%\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[%named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)}$ . The mean of the score is zero:

\begin{split}\mathbb{E}_{{\bm{Y}}}{\mathopen{}\mathclose{{}\left[\log{\hat{p}%\mathopen{}\mathclose{{}\left({\bm{Y}}_{;\bm{\theta}}}\right)}}\right]}&=\int_%{\bm{y}{}}{\hat{p}\mathopen{}\mathclose{{}\left(\bm{y};\bm{\theta}}\right)}{%\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}}\log{\hat{p%}\mathopen{}\mathclose{{}\left(\bm{y};\bm{\theta}}\right)}\mathop{}\!\mathrm{d%}{\bm{y}{}}\\&=\int_{\bm{y}{}}{\hat{p}\mathopen{}\mathclose{{}\left(\bm{y};\bm{\theta}}%\right)}\frac{1}{{\hat{p}\mathopen{}\mathclose{{}\left(\bm{y};\bm{\theta}}%\right)}}{\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}}{%\hat{p}\mathopen{}\mathclose{{}\left(\bm{y};\bm{\theta}}\right)}\mathop{}\!%\mathrm{d}{\bm{y}{}}\\&=\int_{\bm{y}{}}{\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{%\theta}}}}{\hat{p}\mathopen{}\mathclose{{}\left(\bm{y};\bm{\theta}}\right)}%\mathop{}\!\mathrm{d}{\bm{y}{}}\\&={\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}}\int_{%\bm{y}{}}{\hat{p}\mathopen{}\mathclose{{}\left(\bm{y};\bm{\theta}}\right)}%\mathop{}\!\mathrm{d}{\bm{y}{}}\\&={\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{\theta}}}}(1)\\&=0.\end{split}

The variance of the score is known as the Fisher information. Because its mean is zero, it is also the expected square of the score.

The Fisher information for exponential-family random variables

This turns out to take a simple form. For a (vector) random variable ${\bm{Y}}$ and “parameters” $\bm{\theta}$ (that may themselves be random variables):

p(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}|\bm{%\theta})=p(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}%{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y}|%\bm{\eta})=h(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{%pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{y})\exp\bigg{\{}\bm{\eta}(\bm{\theta})^{\text{T%}}\bm{t}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{%rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{y})-%A(\bm{\eta}(\bm{\theta}))\bigg{\}},

the Fisher information is:

\begin{split}I(\bm{\theta})&=-\mathbb{E}_{{\bm{Y}}{}|\bm{\theta}}{\mathopen{}%\mathclose{{}\left[\frac{\partial^{2}{}}{\partial{\bm{\theta}}\partial{\bm{%\theta}}^{\text{T}}}\log{p\mathopen{}\mathclose{{}\left({\bm{Y}}\middle|\bm{%\theta}}\right)}\middle|\bm{\theta}{}}\right]}\\&=-\mathbb{E}_{{\bm{Y}}{}|\bm{\theta}}{\mathopen{}\mathclose{{}\left[\frac{%\partial^{2}{}}{\partial{\bm{\theta}}\partial{\bm{\theta}}^{\text{T}}}%\mathopen{}\mathclose{{}\left[\bm{\eta}(\bm{\theta})^{\text{T}}\bm{t}({\bm{Y}}%)-A(\bm{\eta}(\bm{\theta}))}\right]\middle|\bm{\theta}{}}\right]}\\&=-\mathbb{E}_{{\bm{Y}}{}|\bm{\theta}}{\mathopen{}\mathclose{{}\left[\sum_{i}%\frac{\partial^{2}{\eta{i}}}{\partial{\bm{\theta}}\partial{\bm{\theta}}^{\text%{T}}}t_{i}({\bm{Y}})-\frac{\partial{\bm{\eta}}^{\text{T}}}{\partial{\bm{\theta%}}}\frac{\partial^{2}{A}}{\partial{\bm{\eta}}\partial{\bm{\eta}}^{\text{T}}}%\frac{\partial{\bm{\eta}}}{\partial{\bm{\theta}}^{\text{T}}}-\sum_{i}\frac{%\partial^{2}{\eta{i}}}{\partial{\bm{\theta}}\partial{\bm{\theta}}^{\text{T}}}%\frac{\partial{A}}{\partial{\bm{\eta}{i}}}\middle|\bm{\theta}{}}\right]}\\&=\frac{\partial{\bm{\eta}}^{\text{T}}}{\partial{\bm{\theta}}}\text{Cov}_{{\bm%{Y}}{}|\bm{\theta}}{\mathopen{}\mathclose{{}\left[\bm{t}({\bm{Y}})\middle|\bm{%\theta}{}}\right]}\frac{\partial{\bm{\eta}}}{\partial{\bm{\theta}}^{\text{T}}}%,\end{split}

where in the last line we have used the fact that the derivatives of the log-normalizer are the cumulants of the sufficient statistics ( ${\bm{T}}{}$ ) under the distribution. A perhaps more interesting equivalent can be derived by noting that:

\frac{\partial{}}{\partial{\bm{\theta}}}\mathbb{E}_{{\bm{Y}}{}|\bm{\theta}}{%\mathopen{}\mathclose{{}\left[\bm{t}({\bm{Y}})\middle|\bm{\theta}{}}\right]}=%\frac{\partial{}}{\partial{\bm{\theta}}}\frac{\partial{A}}{\partial{\bm{\eta}}%^{\text{T}}}=\frac{\partial^{2}{A}}{\partial{\bm{\eta}}\partial{\bm{\eta}}^{%\text{T}}}\frac{\partial{\bm{\eta}}}{\partial{\bm{\theta}}^{\text{T}}}=\text{%Cov}_{{\bm{Y}}{}|\bm{\theta}}{\mathopen{}\mathclose{{}\left[\bm{t}({\bm{Y}})%\middle|\bm{\theta}{}}\right]}\frac{\partial{\bm{\eta}}}{\partial{\bm{\theta}}%^{\text{T}}}.

Therefore,

\mathopen{}\mathclose{{}\left(\frac{\partial{}}{\partial{\bm{\theta}}}\mathbb{%E}_{{\bm{Y}}{}|\bm{\theta}}{\mathopen{}\mathclose{{}\left[\bm{t}({\bm{Y}})%\middle|\bm{\theta}{}}\right]}}\right)^{\text{T}}{\text{Cov}_{{\bm{Y}}{}|}{%\mathopen{}\mathclose{{}\left[\bm{t}({\bm{Y}})\middle|\bm{\theta}{}}\right]}}^%{-1}\mathopen{}\mathclose{{}\left(\frac{\partial{}}{\partial{\bm{\theta}}}%\mathbb{E}_{{\bm{Y}}{}|\bm{\theta}}{\mathopen{}\mathclose{{}\left[\bm{t}({\bm{%Y}})\middle|\bm{\theta}{}}\right]}}\right)=\frac{\partial{\bm{\eta}}^{\text{T}%}}{\partial{\bm{\theta}}}\text{Cov}_{{\bm{Y}}{}|\bm{\theta}}{\mathopen{}%\mathclose{{}\left[\bm{t}({\bm{Y}})\middle|\bm{\theta}{}}\right]}\frac{%\partial{\bm{\eta}}}{\partial{\bm{\theta}}^{\text{T}}}=I(\bm{\theta}).

(B.12)

Markov chains

Discrete random variables

[[[table]]]

Useful identities

Expectations of quadratic forms.

Consider a vector random variable ${\bm{X}}$ with mean $\bm{\mu}$ and covariance $\mathbf{{\Sigma}}$ . We are interested in the expectation of a certain function of ${\bm{X}}$ , namely $\mathopen{}\mathclose{{}\left(\bm{b}-\mathbf{{C}}{\bm{X}}}\right)^{\text{T}}%\mathbf{A}\mathopen{}\mathclose{{}\left(\bm{b}-\mathbf{{C}}{\bm{X}}}\right)$ . This term can occur, for example, in the log probability of a Gaussian distribution about $\mathbf{{C}}{\bm{X}}$ . To calculate the expectation, we define a new variable

{\bm{Z}}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\mathbf{A}^{1/2}\mathopen{}\mathclose{{}\left(\bm{b}-\mathbf{{C}}{\bm{X}}}\right)

and then employ the cyclic-permutation property of the matrix-trace operator:

\begin{split}\mathbb{E}_{{\bm{X}}}{\mathopen{}\mathclose{{}\left[\mathopen{}%\mathclose{{}\left(\bm{b}-\mathbf{{C}}{\bm{X}}}\right)^{\text{T}}\mathbf{A}%\mathopen{}\mathclose{{}\left(\bm{b}-\mathbf{{C}}{\bm{X}}}\right)}\right]}&=%\mathbb{E}_{{\bm{Z}}}{\mathopen{}\mathclose{{}\left[{\bm{Z}}^{\text{T}}{\bm{Z}%}}\right]}\\&=\mathbb{E}_{{\bm{Z}}}{\mathopen{}\mathclose{{}\left[\text{tr}\mathopen{}%\mathclose{{}\left[{\bm{Z}}^{\text{T}}{\bm{Z}}}\right]}\right]}\\&=\mathbb{E}_{{\bm{Z}}}{\mathopen{}\mathclose{{}\left[\text{tr}\mathopen{}%\mathclose{{}\left[{\bm{Z}}{\bm{Z}}^{\text{T}}}\right]}\right]}\\&=\text{tr}\mathopen{}\mathclose{{}\left[\mathbb{E}_{{\bm{Z}}}{\mathopen{}%\mathclose{{}\left[{\bm{Z}}{\bm{Z}}^{\text{T}}}\right]}}\right]\\&=\text{tr}\mathopen{}\mathclose{{}\left[\text{Cov}_{{\bm{Z}}}{\mathopen{}%\mathclose{{}\left[{\bm{Z}}}\right]}+\mathbb{E}_{{\bm{Z}}}{\mathopen{}%\mathclose{{}\left[{\bm{Z}}}\right]}\mathbb{E}_{{\bm{Z}}}{\mathopen{}%\mathclose{{}\left[{\bm{Z}}^{\text{T}}}\right]}}\right]\\&=\text{tr}\mathopen{}\mathclose{{}\left[\mathbf{A}^{1/2}\mathbf{{C}}\mathbf{{%\Sigma}}\mathbf{{C}}^{\text{T}}\mathbf{A}^{\text{T}/2}+\mathbf{A}^{1/2}(\bm{b}%-\mathbf{{C}}\bm{\mu})(\bm{b}-\mathbf{{C}}\bm{\mu})^{\text{T}}\mathbf{A}^{%\text{T}/2}}\right]\\&=\text{tr}\mathopen{}\mathclose{{}\left[\mathbf{A}\mathbf{{C}}\mathbf{{\Sigma%}}\mathbf{{C}}^{\text{T}}}\right]+\text{tr}\mathopen{}\mathclose{{}\left[(\bm{%b}-\mathbf{{C}}\bm{\mu})^{\text{T}}\mathbf{A}(\bm{b}-\mathbf{{C}}\bm{\mu})}%\right]\\&=\text{tr}\mathopen{}\mathclose{{}\left[\mathbf{A}\mathbf{{C}}\mathbf{{\Sigma%}}\mathbf{{C}}^{\text{T}}}\right]+(\bm{b}-\mathbf{{C}}\bm{\mu})^{\text{T}}%\mathbf{A}(\bm{b}-\mathbf{{C}}\bm{\mu})\\\end{split}

(B.13)

Hence, the expected value of the quadratic function of ${\bm{X}}$ is the quadratic function evaluated at the expected value of ${\bm{X}}$ —plus a “correction” term arising from the covariance of ${\bm{X}}$ .

Simulating Poisson random variates with mean less than 1.

motivation…

Consider the graphical model shown below. We want to show that the marginal probabilitity of

{\hat{Y}}

is distributed as a Poisson random variable with mean

\mu

—as long as

\mu<1

. The derivation at right shows this marginalization. The third line follows because the probability of

{\hat{Y}}

(the number of “successes”) is zero for any

{\hat{Y}}>{\hat{X}}

, since

{\hat{X}}

is the number of Bernoulli trials (it is impossible to have more successes than trials).

\begin{split}{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{y};\bm{\theta}}%\right)}&=\sum_{\hat{x}{}=0}^{\infty}{\hat{p}\mathopen{}\mathclose{{}\left(%\hat{x};\bm{\theta}}\right)}{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode%\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{y}\middle|\hat{x}%;\bm{\theta}}\right)}\\&=\sum_{\hat{x}{}=0}^{\infty}\text{Pois}\mathopen{}\mathclose{{}\left(\hat{x};%1}\right)\text{Bino}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{y};\hat{x},\mu}%\right)\\&=\sum_{\hat{x}{}=\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{%pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\hat{y}}^{\infty}\text{Pois}\mathopen{}\mathclose{{%}\left(\hat{x};1}\right)\text{Bino}\mathopen{}\mathclose{{}\left(\leavevmode%\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{y};\hat{x},\mu}%\right)\\&=\sum_{\hat{x}{}=\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{%pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\hat{y}}^{\infty}\frac{e^{-1}}{\hat{x}!}{\hat{x}%\choose\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb%}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{y}}\mu%^{\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{y}}(1-%\mu)^{\hat{x}-\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{%pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\hat{y}}\\&=\sum_{\hat{x}{}=\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{%pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\hat{y}}^{\infty}\frac{e^{-1}}{\hat{x}!}\frac{\hat{%x}!}{\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{y}!(\hat%{x}-\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{y})!}\mu%^{\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{y}}(1-%\mu)^{\hat{x}-\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{%pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\hat{y}}\\&=\frac{e^{-1}\mu^{\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{%pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\hat{y}}}{\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\hat{y}!}\sum_{\hat{x}{}=\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{y}}^{\infty}\frac%{1}{(\hat{x}-\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{%pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\hat{y})!}(1-\mu)^{\hat{x}-\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{y}}\\&=\frac{e^{-1}\mu^{\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{%pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\hat{y}}}{\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\hat{y}!}\sum_{m{}=0}^{\infty}\frac{1}{m!}(1-\mu)^%{m}\\&=\frac{e^{-1}\mu^{\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{%pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\hat{y}}}{\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\hat{y}!}e^{1-\mu}=\frac{e^{-\mu}\mu^{\leavevmode%\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{y}}}{\leavevmode%\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{y}!}=\text{Pois}%\mathopen{}\mathclose{{}\left(\mu}\right)\end{split}