Notation

This author firmly holds the (seemingly unpopular) view that good notation makes mathematical texts much easier to understand. More precisely, bad notation is much easier to parse—indeed, unremarkable—when one has already mastered the concepts; it can also mask deep underlying conceptual issues. I have attempted, although not everywhere with success, to use good notation in what follows.

symbol	use
${X}$ , ${Y}$ , ${Z}$	scalar random variables
$x$ , $y$ , $z$	scalar instantiations
${\bm{X}}$ , ${\bm{Y}}$ , ${\bm{Z}}$	vector random variables
$\bm{x}$ , $\bm{y}$ , $\bm{z}$	vector instantiations
$\mathbf{A},\mathbf{B},\mathbf{C},\mathbf{P}$ , etc.	matrices
$\bm{\theta},\bm{\phi}$	(non-random) parameters
$\bm{\pi}$	vector of categorical
	probabilities ( $\sum_{k}\pi_{k}=1$ )
$\bm{\mu}$	mean (vector)
$\mathbf{{\Sigma}}$	covariance matrix

Basic symbols.

Basic notational conventions are for the most part standard. This book uses capital letters for random variables, lowercase for their instantiations, boldface italic font for vectors, and italic for scalar variables. The (generally Latin) letters for matrices are capitalized and bolded, but (unless random) in Roman font, and not necessarily from the front of the alphabet.

The set of all standard parameters (like means, variances, and the like) of a distribution are generally denoted as a single vector with either $\bm{\theta}$ or $\bm{\phi}$ (or, in a pinch, some nearby Greek letter). But note well that in the context of Bayesian statistics their status as random variables is marked in the notation: $\bm{\Theta}$ , $\bm{\Phi}$ . The Greek letters $\bm{\pi}$ , $\bm{\mu}$ , and $\mathbf{{\Sigma}}{}$ are generally reserved for particular parameters: the vector of categorical probabilities, the mean (vector), and the covariance matrix, respectively. Note that we do not use $\pi$ for the transcendental constant; we use $\tau=2\pi$ [11].

Arguments and variables

In this textbook, I distinguish notationally between the arguments of functions (on the one hand) and variables, at which a function might be evaluated (on the other). Why?

An ambiguity in argument binding.

In standard notation, a function might be defined with the expression

f(x)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=x^{2}.

(1.1)

Although this usually causes no problems, note that $x$ does not indicate any particular value; or, to put it another way, the expression is completed by an implicit (omitted) $\forall x$ . On the occasions that we do not intend universal quantification, then, problems can arise. For example suppose we want to say that the unary function $f$ is identical to the binary function $g$ when its second argument is set to the value $y$ (or, alternatively, that such a value exists: $\exists y$ ). We could write

f(x)=g(x,y),

but the fact that we are (mentally) to insert $\forall x$ but not $\forall y$ is not evident from the equation, but only from the surrounding verbal context.

There are several standard alternatives, but none is wholly satisfactory. We could include all quantifiers whenever there is ambiguity—but ambiguity is often in the eye of the beholder, and it is dangerous for a textbook to assume that an expression is perfectly transparent. We could simply include all quantifiers, but equations with many arguments would be littered with $\forall$ statements. Or again, we could use the mechanism of raised dots,

f(\cdot)=g(\cdot,y),

although $y$ still violates the standard convention in being unbound to a universal quantifier, and this has to be extracted from the verbal context. But more fatally, this mechanism doesn’t generalize well to functions of more variables:

f(\cdot,\cdot)=g(\cdot,y,\cdot).

Which dots on the left corresponds to which on the right?

Subscripts to the rescue?

Now, in a statistics textbook, the probability-mass function associated with a discrete random variable ${X}$ is usually written $p_{{X}}$ or (to emphasize that it is a function) $p_{{X}}(\cdot)$ , and the probability of a particular observation $x$ correspondingly as $p_{{X}}(x)$ . The subscript distinguishes this mass function from, say, one associated with the random variable ${Y}$ , namely $p_{{Y}}$ . Conditional distribributions, in turn, are written $p_{{Y}|{X}}$ , and the value of a conditional distribution $p_{{Y}|{X}}(y|x)$ . This might seem exactly the mechanism we seek to identify the (universally quantified) arguments of functions. For example, consider this instance of Bayes’ rule:

p_{{X}{}|{Y}{},{Z}}(\cdot|\cdot,z)=\frac{p_{{Y}{}|{X}{},{Z}}(\cdot|\cdot,z)p_{%{X}{}|{Z}}(\cdot|z)}{p_{{Y}{}|{Z}}(\cdot|z)}.

(1.2)

The convention for understanding it is that omitted arguments ( $x,y$ ) are universally quantified, whereas included variables ( $z$ ) have been bound to something in the enclosing context.

But this proposal, too, has problems. First of all, although the subscripts make it possible to infer which omitted arguments on the left correspond to which on the right, the dots themselves are just noise. For the reader that is not convinced by Eq. 1.2, I suggest

p_{{\bm{X}}_{1},\ldots,{\bm{X}}_{{T}},{\bm{Y}}_{1},\ldots,{\bm{Y}}_{{T}}}(%\cdot,\cdot,\ldots,\cdot,\bm{y}_{1},\ldots,\bm{y}_{{T}})=\prod_{t=1}^{{{T}}}p_%{{\bm{X}}_{t}|{\bm{X}}_{t-1}}(\cdot|\cdot)p_{{\bm{Y}}_{t}|{\bm{X}}_{t}}(\bm{y}%_{t}|\cdot),

(1.3)

a partially evaluated function that we will encounter in Chapter 2. Second (what is fatal), raised dots can’t be used for variables occurring outside of the list of function arguments. For example, how are we to write Eq. 1.1?—certainly not

f(\cdot)\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\cdot^{2}.

(1.4)

Gray arguments.

We can get a hold of the fundamental issue that we are grappling with here by distinguishing function arguments from variables. This is most intuitive in terms of a programming language. For example, in the following snippet of (Python) code,

⬇


                  1
                def quadratic(x):


                  2
                    return (x - c)**2

x is an argument to the quadratic function, whereas c is a variable that is (presumably) bound somewhere in the enclosing scope. Critically, x is an argument both in the function declaration, def quadratic(x):, and in the function body, return (x - c)**2. A function can also be defined as a partially evaluated instance of another function:

⬇


                  1
                def shifted_quadratic(x, c):


                  2
                    return (x-c)**2


                  3
                


                  4
                def centered_quadratic(x):


                  5
                    return shifted_quadratic(x,0)

Both x amd c are arguments of shifted_quadratic, but centered_quadratic has only a single argument, x. It is analogous to the partially evaluated function exhibited in Eq. 1.3, whose only arguments are $\bm{x}_{1},\ldots,\bm{x}_{{T}}$ .

With some reservations, I have introduced a new notational convention in this book to mark this distinction between arguments and variables, employing a gray font color for the former. For example, Eq. 1.3 will be written as

p(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{x}_{1},%\ldots,\bm{x}_{{T}},\bm{y}_{1},\ldots,\bm{y}_{{T}})=\prod_{t=1}^{{{T}}}p(%\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{x}_{t}|%\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{x}_{t-1})%p(\bm{y}_{t}|\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{%pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{x}_{t}).

(1.5)

As in Eq. 1.3, the fact that the function is partially evaluated at $\bm{y}_{1},\ldots,\bm{y}_{{T}}$ is indicated, but in this case with the standard (black) font.

This notational convention neatly solves the problems just discussed. That is, it makes clear which variables are universally quantified—namely, the arguments, in gray—without resorting to explicit quantification, verbal context, or subscripts and dots. This last is particularly appealing not just because it is easier to read and generalizes better (recall Eq. 1.4), although these are its chief merits. It also provides an alternative mechanism for disambiguating probability-mass and -density functions from each other; namely, by their (gray) arguments rather than by subscripts. Indeed, this is the standard device employed in the machine-learning literature—but without the distinction between arguments and variables that solves our main problem.

And then, finally, we will see below that this distinction is exceedingly useful for another purpose: distinguishing partial and total derivatives.

Probabilistical functions and functionals

Symbols for probability mass and density.

symbol	use
$p$	the data mass/density function
$\hat{p}$ , $\check{p}$	the model mass/density functions

This text indiscriminately uses the same letter for probability-mass and probability-density functions, in both cases the usual (for mass functions) $p$ . Further semantic content is, however, communicated by diacritics. In particular, $p$ is reserved for “the data distribution,” i.e., the true source in the world of our samples, as opposed to a model. Often in the literature, but not in this book, the data distribution is taken to be a discrete set of points corresponding to a particular sample, that is, a collection of delta functions. Here, $p$ is interpreted to be a full-fledged distribution, known not in form but only through the samples that we have observed from it.

For model distributions we generally employ $\hat{p}$ , although we shall also have occasion to use $\check{p}$ for certain model distributions.

Now, it is a fact from elementary probability theory [XXX] that a random variable carries with it a probability distribution. Conversely, it makes no sense to talk about two different probability distributions over the same random variable—although texts on machine learning routinely do, usually in the context of relative or cross entropy [GoodfellowXXX]. We will indeed often be interested in (e.g.) the relative entropy (KL divergence) of two distributions, $p$ and $\hat{p}$ , but this text takes pains to note that these are distributions over different random variables, for example ${Y}$ and ${\hat{Y}}$ , respectively. In general, the text marks random variables, their corresponding distributions, and the arguments of those distributions with the same diacritics; hence,

\displaystyle{Y}\sim{p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}y}\right)},

\displaystyle{\hat{Y}}\sim{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode%\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{y};\bm{\theta}}%\right)}.

It may at first blush seem surprising, then, to see the relative entropy (KL divergence) expressed as

\operatorname*{\text{D}_{\text{KL}}}\mathopen{}\mathclose{{}\left\{{p\mathopen%{}\mathclose{{}\left({Y}}\right)}\middle\|{\hat{p}\mathopen{}\mathclose{{}%\left({Y};\bm{\theta}}\right)}}\right\},

that is, with ${Y}$ on both sides. But ${Y}$ is not an argument of these distributions; it is the variable at which they are being evaluated. Notice that despite the arguments not appearing in this expression, the two distributions are still distinguishable—by their diacritics.

Still, the conventions are not bulletproof. Consider for example density functions for two different data distributions:

\displaystyle{X}\sim{p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}x}\right)},

\displaystyle{Y}\sim{p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}y}\right)}.

These are distinguished not by any diacritics, but by their arguments. According to our convention, these arguments are generally listed (in gray), so it is usually possible to tell these two distributions apart. And even when considering evaluated density functions, we can typically disambiguate by our choice for the letter used for the observations: ${p\mathopen{}\mathclose{{}\left(x}\right)},{p\mathopen{}\mathclose{{}\left(y}%\right)}$ . Occasionally, however, we will need to consider evaluating such functions at some other point, say $z$ or 1. Then we will be thrown back on one of the other, standard conventions: ${p\mathopen{}\mathclose{{}\left({Y}=1}\right)}$ , $p_{{Y}}(z)$ , etc.

symbol	use
$\mathbb{E}{\mathopen{}\mathclose{{}\left[{\bm{X}}}\right]}$	expectation of ${\bm{X}}$
${\mathopen{}\mathclose{{}\left\langle{{\bm{X}}}}\right\rangle}$	sample average of ${\bm{X}}$
$\text{Var}{\mathopen{}\mathclose{{}\left[{X}{}}\right]}$	variance of ${X}{}$
$\text{Cov}{\mathopen{}\mathclose{{}\left[{\bm{X}}}\right]}$	covariance matrix of ${\bm{X}}$
$\text{Cov}{\mathopen{}\mathclose{{}\left[{\bm{X}},{\bm{Y}}}\right]}$	covariance between ${\bm{X}}$ and ${\bm{Y}}$

Expectation, covariance, and sample averages.

The symbol $\text{Cov}{\mathopen{}\mathclose{{}\left[\cdot}\right]}$ is used with a single argument to denote the operator that turns a random variable into a covariance matrix; but with two arguments, $\text{Cov}{\mathopen{}\mathclose{{}\left[\cdot,\cdot}\right]}$ , to indicate the (cross) covariance between two random variables. Perhaps more idiosyncratically, angle brackets, ${\mathopen{}\mathclose{{}\left\langle{\cdot}}\right\rangle}$ , are usually reserved for sample averages, as opposed to expectation values, although occasionally this stricture is relaxed.

The distribution with respect to which an expectation is taken will only occasionally be inferrable from its argument, so we will typically resort to subscripts (the previous discussion not withstanding). For example, we will write

\displaystyle\mathbb{E}_{{\bm{X}}{}}{\mathopen{}\mathclose{{}\left[-\log{p%\mathopen{}\mathclose{{}\left({\bm{X}}}\right)}}\right]}

\displaystyle\mathbb{E}_{{\bm{X}}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[-%\log{p\mathopen{}\mathclose{{}\left({\bm{X}}\middle|\bm{y}}\right)}\middle|\bm%{y}{}}\right]}

\displaystyle\mathbb{E}_{{\bm{X}}{}|{\bm{Y}}}{\mathopen{}\mathclose{{}\left[-%\log{p\mathopen{}\mathclose{{}\left({\bm{X}}\middle|\bm{z}}\right)}\middle|\bm%{z}{}}\right]}

Thus, e.g., the subscripts in the second and third examples tell us that the expectation is taken under the distribution ${p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[%named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{x}{}\middle|\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}$ . Of course, only the variable ${\bm{X}}$ is averaged out in these expressions; the conditioning variable is free to take on any value, which need not match the argument symbol (as in the third example).

Let us put together some of our conventions with an iterated expectation under ${p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[%named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{x}{}\middle|\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}$ and ${p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[%named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}$ ,

\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[\mathbb{E}_{{\bm{X}}{}|{%\bm{Y}}}{\mathopen{}\mathclose{{}\left[f({\bm{X}},{\bm{Y}})\middle|{\bm{Y}}{}}%\right]}}\right]}=\int_{\bm{y}{}}{p\mathopen{}\mathclose{{}\left(\bm{y}}\right%)}\int_{\bm{x}{}}{p\mathopen{}\mathclose{{}\left(\bm{x}\middle|\bm{y}}\right)}%f(\bm{x},\bm{y})\mathop{}\!\mathrm{d}{\bm{x}{}}\mathop{}\!\mathrm{d}{\bm{y}{}}.

There are a few things to notice. First of all, $\bm{x}$ and $\bm{y}$ do not appear in gray. This is because they are dummy variables, not arguments; or, to put it a different way, they are bound to the integral operaters, not (implicitly) to universal quantifiers. (Accordingly, they do not appear outside of the integrals, e.g. on the other side of the equation.) Second, bear in mind that the symbol on the right side of the conditioning bar (here, ${\bm{Y}}$ ) need not match the subscript of the outer expectation (here also ${\bm{Y}}$ ); e.g.,

\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[\mathbb{E}_{{\bm{\hat{X}%}}{}|{\bm{\hat{Y}}}}{\mathopen{}\mathclose{{}\left[f({\bm{\hat{X}}},{\bm{Y}})%\middle|{\bm{Y}}{}}\right]}}\right]}=\int_{\bm{y}{}}{p\mathopen{}\mathclose{{}%\left(\bm{y}}\right)}\int_{\bm{\hat{x}}{}}{\hat{p}\mathopen{}\mathclose{{}%\left(\bm{\hat{x}}\middle|\bm{y};\bm{\theta}}\right)}f(\bm{\hat{x}},\bm{y})%\mathop{}\!\mathrm{d}{\bm{\hat{x}}{}}\mathop{}\!\mathrm{d}{\bm{y}{}}.

Thus, the subscripts to the conditional expectation tell us that it is taken with respect to ${\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{}\middle|\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{%\theta}}\right)}$ , but we are not forbidden from filling the vacant argument $\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{%.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}$ with a different random variable, in this case ${\bm{Y}}$ , and taking an expectation.

Third, the “vector differentials” are to be interpreted as

\mathop{}\!\mathrm{d}{\bm{x}}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{%\scriptsize.}}}=\mathop{}\!\mathrm{d}{x_{1}}\mathop{}\!\mathrm{d}{x_{1}}\cdots\mathop{}\!%\mathrm{d}{x_{{K}}},

and the integral as an iterated integral; hence:

\int_{\bm{x}{}}f(\bm{x})\mathop{}\!\mathrm{d}{\bm{x}{}}=\int_{x_{{K}}{}}\int_{%\cdots{}}\int_{x_{2}{}}\int_{x_{1}{}}f(\bm{x})\mathop{}\!\mathrm{d}{x_{1}{}}%\mathop{}\!\mathrm{d}{x_{2}{}}\mathop{}\!\mathrm{d}{\cdots{}}\mathop{}\!%\mathrm{d}{x_{{K}}{}}.

The subscript to the integral tells us that it is to be taken over the entire support of the corresponding random variable.

Derivatives

[[The use of transposes in vector (and matrix) derivatives. The total derivative vs. partial derivatives. The “gradient” and the Hessian.]]