5.1 Supervised Learning
In the classical notion of a discriminative model, the inputs and outputs are completely observed.
That is, classically, learning in discriminative models is supervised.
We shall nevertheless have occasion to explore unsupervised learning in discriminative models—in the next section.
Here, we describe supervised learning as an instance of the general approach laid out in Section 4.2, to wit, minimizing relative or cross entropy.
Since we are only modeling the conditional distribution, this means the conditional entropies.
But remember that an expectation is still taken under
How we proceed from here depends on our answers to the two questions posed at the outset.
5.1.1 Linear regression
Here we begin our account of supervised learning with the simplest model: the (multivariate) normal distribution, whose mean depends linearly on the inputs:
Note that this could easily be extended to an affine function,
To find
where the move to a sample average in the final line reflects the fact that we have only samples from the data distribution.
The final line is famous enough to have earned its own name, the normal equations††margin:
normal equations
.
Acquiring
Regularization.
The inverted term
Our point of departure is to interpret the rows of
Then in place of the original conditional distribution,
For example, consider a zero-mean, Gaussian prior distribution over
(5.4) |
We have assumed the same covariance matrix for all
Then we can solve for
[[Since the energy in Eq. 5.4 represents an
Newton-Raphson.
The least-squares penalty can be solved in one step because the resulting cost function is quadratic in
The basic Newton-Raphson method aims to find the roots (zeros) of a scalar function of a scalar input,
The procedure is easily extended to vector-valued functions of vector inputs, as long as the vectors have the same length (otherwise the extension is more complicated).
This is especially germane to our problem of finding extrema of a function
In our case, the parameters are in the form of a matrix,
Consequently, the Newton-Raphson update becomes
where on the last line we have simply collected up the updates for all rows. We see that for any starting point, the solution (which doesn’t depend on that point) is reached in one step, as expected. Thus the Newton-Raphson algorithm provides another route to the normal equations.
Moment matching under additive noise.
Linear regression is so ubiquitous that the assumption of Gaussian noise, although frequently justifiable by reference to the central limit theorem, may feel overly restrictive. Let us therefore retain the assumption of a linear map with additive, independent, zero-mean noise,
but no longer require the noise,
Notice that we do not need to specify the expected outer product of the outputs. Using the fact that the noise is independent of the inputs and zero-mean, we see that
So assuming
Furthermore, as lately noted, equality of means can be achieved simply by centering the data.
That leaves
the normal equations.
In fine, assuming that the additive noise is Gaussian has the same net effect as fitting a kind of second-order approximation to the joint distribution. This should not be surprising, since the normal distribution is the most “agnostic” of all distributions that specify the first two moments.
The emission covariance.
Let us pause briefly to consider the meaning of, and whether we should be surprised by, the fact that
What we do not have is a separate set of parameters for every pair of samples,
Linear regression without probability distributions
Accordingly, let us assemble all the samples into two matrices:44 4 These are the transposes of the matrices used in most developments of linear regression, but they are consistent with all the other standard conventions used throughout this book.
In place of the preceding models, let us simply look for a solution to the linear equation
The equation has a unique solution if and only if
Still, let us proceed somewhat naïvely and simply look for an obvious linear-algebraic solution.
If
Again we have recovered the normal equations, this time apparently with even fewer assumptions.
Lest the normal equations seem inevitable, let us recall that the choice to right multiply by the matrix
The answer can be read off the derivation in Eq. 5.2 above:
The normal equations arise from squared-error penalties, and accordingly the procedure is sometimes called the method of least squares††margin:
method of least squares
.
Evidently the assumption of normally distributed errors and the least-squares penalty are two sides of the same coin.
In the case of a linear map, when
Heteroscedastic data: weighted least-squares.
Now that we have examined linear regression thoroughly from the perspective of the data matrices
This variation is known as weighted least squares††margin: weighted least squares .
Geometric arguments.
[[XXX]]
5.1.2 Generalized linear models
Having explored rather thoroughly the simplest model, let us revisit the methodological choices posed at the beginning of the chapter. In particular, suppose we relax the assumption that the conditional be Gaussian, but retain the assumption that the data are combined linearly across dimensions.††margin: Some discussion of finite sufficient statistics to motivate exponential families [[….Use exponential families for the sake of finite sufficient stats….]] This class includes the multivariate Gaussian, but also many other familiar distirbutions—Poisson, multinomial, Gamma, Dirichlet, etc.—as special cases. On the one hand, for none of these other distribution is the cross entropy quadratic in the parameters (recall Eq. 5.2), so closed-form minimizations are not possible. On the other hand, the cross-entropy loss does retain convexity for any exponential-family distribution, so the Newton-Raphson algorithm lately derived is guaranteed to find the global optimum. As we shall see, each step of the algorithm can be rewritten as solving a weighted least-squares problem, as in Eq. 5.10, except that the weights change at every iteration.
Exponential families
General form:
Cross-entropy minimizing parameters:
The optimal moment parameters occur when the gradient with respect to the natural parameters is zero….
Generalized linear model, assuming canonical response function:
Gradient:
When
When it is not, we need to adopt an iterative approach. Here we will use the Newton-Raphson algorithm discussed above. This requires the Hessian; for simplicity, we write it for scalar outputs only:
Then the IRLS update is
This is the solution to (i.e., the normal equations for) a weighted least-squares problem—cf. Eq. 5.10—but with the parenthetical quantity, rather than
We return to the cross-entropy gradient, Eq. 5.13.
The problem with setting this gradient to zero and solving for the optimal parameters is, as noted above, that
Substituting this into the gradient, Eq. 5.13, we have
This matches the update generated with Newton-Raphson, Eq. 5.15.
Thus, IRLS can be seen as approximating the gradient as locally linear in the parameters; or, perhaps more intuitively, as approximating the loss (the cross entropy) as locally quadratic in these parameters.
Accordingly, we seek a least-squares solution (because of the quadratic cost), but apply it iteratively (because the cost is an approximation).
The loss differs at each iteration because the “outputs” change, but also because the data are reweighted by
Iteratively reweighted least squares
IRLS for the Bernoulli distribution: Logistic regression
IRLS for the Poisson Distribution
Having derived IRLS, we get a feel for the algorithm by applying it to a few cases, beginnning with Poisson noise.
For simplicity, we consider a single, scalar “output” or dependent variable; and the canonical link function, the natural logarithm.
This makes the mean of the distribution an exponentiated (inverse canonical link) linear function of the “input” or independent variables,
IRLS for the Gamma Distribution
††margin: Update to new notation and perhaps rewrite.We consider the case where the scale parameter is known, but the shape parameter is not.
The reverse is much more commonly considered (it is the “gamma” GLiM built into Matlab’s glmfit, for example), probably because the sufficient statistic for the shape parameter is just the sum over independent output samples,
Here we have made explicit that the only natural parameters depending on the inputs—and the parameters—are those associated with the shape parameter:
The trailing 1 only clutters the derivation, so we work instead with
We shall also need the first and second the derivatives of the log-partition function with respect to
with
where
The Newton-Raphson algorithm also requires the Hessian. Differentiating a second time we find
with
Nomenclature
5.1.3 Artificial neural networks
In moving from linear regression to generalized linear models (GLiMs), we relaxed the assumption that the conditional distribution
One convenient way to achieve this is to compose multiple such maps together:
where
To see the increase in expressive power with layers, notice that
which evidently yields
In fact….††margin: Universal function approximators; Cybenko 1989 etc.;
Backpropagation
It is another question how to set the parameters
As throughout this chapter, we descend the gradient of the conditional cross-entropy.
But to focus on the new aspects of this process introduced by the composition of functions, let us ignore the specific distribution family and refer to the surprisal simply as
With some foresight, let us begin with the gradients with respect to the biases.
Interpreting
where
Proceeding to the penultimate layer, we find
and at the third-last layer,
This pattern now repeats at subsequent layers.
Hence the sequence of bias gradients can be computed with the backward recursion††margin:
implementation note: Because
Next, we take derivatives with respect to the weight matrices, beginning with the output layer (cf. Eq. B.7):
Proceeding to the penultimate layer, we find:
Proceeding backwards to earlier layers continues the pattern, so the weight gradients can be computed with the same backward recursion, Eq. 5.18, and the formula
Eqs. 5.18 and 5.19 define the backpropagation algorithm††margin:
backprop
for a generic neural network, although applying it to any individual case requires specifying
Hence,
(5.20) |
although more sophisticated forms of gradient descent are typically used in practice.
Some examples.
Backpropagation through time (BPTT)
Suppose we have reason to believe that some of the structure of our network is “conserved” or repeated, as in Fig. LABEL:fig:XXX. For instance, we might imagine segments of the network to correspond to slices of time, in which case it is reasonable to treat every time slice as identical in structure, even though the inputs and outputs change over time. Or we might have reason to believe that certain spatial structures are repeated. In either case, the concrete meaning of the assumption is that the same parameters show up in multiple places in the network. The upshot for backprop is that a little care must be taken in distinguishing partial from total derivatives, after which the algorithm goes through as in the previous section.
To make this concrete, we consider a recurrent neural network (RNN)††margin:
recurrent neural network (RNN)
with inputs
For simplicity, we let the recurrent layer also be the output layer—i.e., the layer at which the loss function is evaluated—in which case the most general loss function can be expressed as
The first equality is just standard calculus; the second follows because
Because of Eq. 5.21, the total effect on
The recursion is initialized at
The backprop-through time algorithm consists of Eqs. 5.22 and 5.23.
Applying it to a particular model requires working out three partial derivatives,
and applying Eqs. 5.22 and 5.23 yields:
Note that the subscript on the Jacobian of the nonlinearity,
Notice the subtle difference with the backprop signal defined in the non-recurrent case: there