Notation
This author firmly holds the (seemingly unpopular) view that good notation makes mathematical texts much easier to understand. More precisely, bad notation is much easier to parse—indeed, unremarkable—when one has already mastered the concepts; it can also mask deep underlying conceptual issues. I have attempted, although not everywhere with success, to use good notation in what follows.
symbol | use |
|
scalar random variables |
|
scalar instantiations |
|
vector random variables |
|
vector instantiations |
|
matrices |
(non-random) parameters | |
vector of categorical | |
probabilities ( |
|
mean (vector) | |
covariance matrix |
Basic symbols.
Basic notational conventions are for the most part standard. This book uses capital letters for random variables, lowercase for their instantiations, boldface italic font for vectors, and italic for scalar variables. The (generally Latin) letters for matrices are capitalized and bolded, but (unless random) in Roman font, and not necessarily from the front of the alphabet.
The set of all standard parameters (like means, variances, and the like) of a distribution are generally denoted as a single vector with either
Arguments and variables
In this textbook, I distinguish notationally between the arguments of functions (on the one hand) and variables, at which a function might be evaluated (on the other). Why?
An ambiguity in argument binding.
In standard notation, a function might be defined with the expression
Although this usually causes no problems, note that
but the fact that we are (mentally) to insert
There are several standard alternatives, but none is wholly satisfactory.
We could include all quantifiers whenever there is ambiguity—but ambiguity is often in the eye of the beholder, and it is dangerous for a textbook to assume that an expression is perfectly transparent.
We could simply include all quantifiers, but equations with many arguments would be littered with
although
Which dots on the left corresponds to which on the right?
Subscripts to the rescue?
Now, in a statistics textbook, the probability-mass function associated with a discrete random variable
The convention for understanding it is that omitted arguments (
But this proposal, too, has problems. First of all, although the subscripts make it possible to infer which omitted arguments on the left correspond to which on the right, the dots themselves are just noise. For the reader that is not convinced by Eq. 1.2, I suggest
a partially evaluated function that we will encounter in Chapter 2. Second (what is fatal), raised dots can’t be used for variables occurring outside of the list of function arguments. For example, how are we to write Eq. 1.1?—certainly not
Gray arguments.
We can get a hold of the fundamental issue that we are grappling with here by distinguishing function arguments from variables. This is most intuitive in terms of a programming language. For example, in the following snippet of (Python) code,
1 def quadratic(x):
2 return (x - c)**2
x is an argument to the quadratic function, whereas c is a variable that is (presumably) bound somewhere in the enclosing scope. Critically, x is an argument both in the function declaration, def quadratic(x):, and in the function body, return (x - c)**2. A function can also be defined as a partially evaluated instance of another function:
1 def shifted_quadratic(x, c):
2 return (x-c)**2
3
4 def centered_quadratic(x):
5 return shifted_quadratic(x,0)
Both x amd c are arguments of shifted_quadratic, but centered_quadratic has only a single argument, x.
It is analogous to the partially evaluated function exhibited in Eq. 1.3, whose only arguments are
With some reservations, I have introduced a new notational convention in this book to mark this distinction between arguments and variables, employing a gray font color for the former. For example, Eq. 1.3 will be written as
As in Eq. 1.3, the fact that the function is partially evaluated at
This notational convention neatly solves the problems just discussed. That is, it makes clear which variables are universally quantified—namely, the arguments, in gray—without resorting to explicit quantification, verbal context, or subscripts and dots. This last is particularly appealing not just because it is easier to read and generalizes better (recall Eq. 1.4), although these are its chief merits. It also provides an alternative mechanism for disambiguating probability-mass and -density functions from each other; namely, by their (gray) arguments rather than by subscripts. Indeed, this is the standard device employed in the machine-learning literature—but without the distinction between arguments and variables that solves our main problem.
And then, finally, we will see below that this distinction is exceedingly useful for another purpose: distinguishing partial and total derivatives.
Probabilistical functions and functionals
Symbols for probability mass and density.
symbol | use |
the data mass/density function | |
|
the model mass/density functions |
This text indiscriminately uses the same letter for probability-mass and probability-density functions, in both cases the usual (for mass functions)
For model distributions we generally employ
Now, it is a fact from elementary probability theory [XXX] that a random variable carries with it a probability distribution.
Conversely, it makes no sense to talk about two different probability distributions over the same random variable—although texts on machine learning routinely do, usually in the context of relative or cross entropy [GoodfellowXXX].
We will indeed often be interested in (e.g.) the relative entropy (KL divergence) of two distributions,
It may at first blush seem surprising, then, to see the relative entropy (KL divergence) expressed as
that is, with
Still, the conventions are not bulletproof. Consider for example density functions for two different data distributions:
These are distinguished not by any diacritics, but by their arguments.
According to our convention, these arguments are generally listed (in gray), so it is usually possible to tell these two distributions apart.
And even when considering evaluated density functions, we can typically disambiguate by our choice for the letter used for the observations:
symbol | use |
expectation of |
|
sample average of |
|
variance of |
|
covariance matrix of |
|
covariance between |
Expectation, covariance, and sample averages.
The symbol
The distribution with respect to which an expectation is taken will only occasionally be inferrable from its argument, so we will typically resort to subscripts (the previous discussion not withstanding). For example, we will write
Thus, e.g., the subscripts in the second and third examples tell us that the expectation is taken under the distribution
Let us put together some of our conventions with an iterated expectation under
There are a few things to notice.
First of all,
Thus, the subscripts to the conditional expectation tell us that it is taken with respect to
Third, the “vector differentials” are to be interpreted as
and the integral as an iterated integral; hence:
The subscript to the integral tells us that it is to be taken over the entire support of the corresponding random variable.
Derivatives
[[The use of transposes in vector (and matrix) derivatives. The total derivative vs. partial derivatives. The “gradient” and the Hessian.]]