5.1 Supervised Learning
In the classical notion of a discriminative model, the inputs and outputs are completely observed. That is, classically, learning in discriminative models is supervised. We shall nevertheless have occasion to explore unsupervised learning in discriminative models—in the next section. Here, we describe supervised learning as an instance of the general approach laid out in Section 4.2, to wit, minimizing relative or cross entropy. Since we are only modeling the conditional distribution, this means the conditional entropies. But remember that an expectation is still taken under :
How we proceed from here depends on our answers to the two questions posed at the outset.
5.1.1 Linear regression
Here we begin our account of supervised learning with the simplest model: the (multivariate) normal distribution, whose mean depends linearly on the inputs:
Note that this could easily be extended to an affine function, , but the notation is simpler if we assume that both the inputs and outputs, and , have been centered, in which case is otiose. Alternatively, a fixed value of 1 can be appended to the vector , and appended as a final column on —after all, we will make no assumptions about the distribution of .
To find , we differentiate the loss in Eq. 5.1:
where the move to a sample average in the final line reflects the fact that we have only samples from the data distribution. The final line is famous enough to have earned its own name, the normal equations††margin: normal equations . Acquiring through the normal equations is known as linear regression††margin: linear regression . In fact, our optimization is not the only route to the normal equations, and to gain more intuition about linear regression we shall examine several of these. But first we investigate a variation on the optimization just derived.
Regularization.
The inverted term is a sum of outer products. Therefore the resulting matrix, although obviously square, is not invertible if there are fewer samples than dimensions of , in which case a unique solution for does not exist. In that case a pseudo-inverse can be used to solve Eq. 5.2. However, as we shall see, the standard pseudo-inverse is merely a special case of a more general and elegant solution to the problem of underdetermined††margin: underdetermined normal equations.
Our point of departure is to interpret the rows of as independent random vectors33 3 Recall that random vectors use bold italic capitals, whereas matrices are assigned bold Roman capitals., , that are also marginally independent of :
Then in place of the original conditional distribution, , we use the augmented conditional , a kind of posterior distribution over (for all ):
For example, consider a zero-mean, Gaussian prior distribution over . Intuitively, this encodes our belief that the parameters are most likely to be zero, with the penalty for being non-zero growing quadratically with distance. In this case, the energy and its gradient are
(5.4) |
We have assumed the same covariance matrix for all . For simplicity, we also let the emission noise be isotropic,
Then we can solve for in closed-form:
[[Since the energy in Eq. 5.4 represents an norm, this is known as Tikhonov, ridge, or -regularized regression††margin: Tikhonov, ridge, or -regularized regression .]] As long as the rank of is full, so is the rank of the inverted term (adding positive definite matrices cannot reduce rank), so the normal equations are now solvable even for rank-deficient . This term says, intuitively, that the solution ignores the inputs, , in proportion as we are confident that their corresponding weights are zero—i.e., the larger —but also in proportion to the noisiness of the input-output relationship, .
Newton-Raphson.
The least-squares penalty can be solved in one step because the resulting cost function is quadratic in . Still, for costs that are not quadratic, but nevertheless convex, the (celebrated) Newton-Raphson algorithm is guaranteed to find the unique global solution. In anticipation of encountering such costs (Section 5.1.2), we apply this method to linear regression.
The basic Newton-Raphson method aims to find the roots (zeros) of a scalar function of a scalar input, , as follows. From a given point in parameter space, , move along the local line tangent to to the point where it intercepts the axis; call this point and iterate. For appropriately smooth functions, each root , at which , is guaranteed to be the convergent value of this procedure when intialized from some surrounding neighborhood. Mathematically, the algorithm amounts to enforcing
The procedure is easily extended to vector-valued functions of vector inputs, as long as the vectors have the same length (otherwise the extension is more complicated). This is especially germane to our problem of finding extrema of a function , since the roots of the function are extrema of . Replacing the scalar-valued functions above with their vector and matrix counterparts yields:
In our case, the parameters are in the form of a matrix, , which would make the Hessian a tensor. To avoid this ugliness, let us work with one row of the matrix, , at a time. Let us also, for simplicity, continue with the assumption that the emission noise is isotropic, Eq. 5.5. Then from Eq. 5.2, the gradient and Hessian of the cross entropy are
Consequently, the Newton-Raphson update becomes
where on the last line we have simply collected up the updates for all rows. We see that for any starting point, the solution (which doesn’t depend on that point) is reached in one step, as expected. Thus the Newton-Raphson algorithm provides another route to the normal equations.
Moment matching under additive noise.
Linear regression is so ubiquitous that the assumption of Gaussian noise, although frequently justifiable by reference to the central limit theorem, may feel overly restrictive. Let us therefore retain the assumption of a linear map with additive, independent, zero-mean noise,
but no longer require the noise, , to be normally distributed. Now, without a probability model, it is no longer possible to minimize the relative entropy between data and model distributions. Instead, we will match their first two moments. More precisely, we shall require that
Notice that we do not need to specify the expected outer product of the outputs. Using the fact that the noise is independent of the inputs and zero-mean, we see that
So assuming is “bigger” than , the noise covariance can always make up the difference.
Furthermore, as lately noted, equality of means can be achieved simply by centering the data. That leaves . Applying our “model” (Eq. 5.8), using the fact that the means are zero, and then setting the model expectations equal to the data expectations, yields:
the normal equations.
In fine, assuming that the additive noise is Gaussian has the same net effect as fitting a kind of second-order approximation to the joint distribution. This should not be surprising, since the normal distribution is the most “agnostic” of all distributions that specify the first two moments.
The emission covariance.
Let us pause briefly to consider the meaning of, and whether we should be surprised by, the fact that disappears from Eq. 5.2. It implies that use of the normal equations makes no assumptions about the “shape” (covariance) of the noise about points in the output space. For example, some outputs could be “more important” (lower-variance) than, or correlated with, others. These correspond respectively to differing values on the diagonal, and non-zero values in the off-diagonals, of the covariance matrix. The reason this makes no difference to the solution is that we have at our disposal a separate row in for every output, : we can safely adjust weights for one output dimension without affecting any other. ††margin: But not under regularization; explain this.
What we do not have is a separate set of parameters for every pair of samples, . This caused no complications because the samples were assumed to be i.i.d. Nevertheless, the assumption of i.i.d. samples is not always appropriate. For example…. The data are then heteroscedastic††margin: heteroscedastic (differently dispersed) rather than homoscedastic††margin: homoscedastic (identically dispersed). Still, if the dependence between samples is only second-order—i.e., in the first-order correlations—and known, the linear regression has an only slightly more complicated complete solution. It is most most elegantly derived when the heteroscedastic data samples are represented explicitly.
Linear regression without probability distributions
Accordingly, let us assemble all the samples into two matrices:44 4 These are the transposes of the matrices used in most developments of linear regression, but they are consistent with all the other standard conventions used throughout this book.
In place of the preceding models, let us simply look for a solution to the linear equation
The equation has a unique solution if and only if is square—that is, there are precisely as many samples () as dimensions ()—and full-rank. When is “tall” (more dimensions than samples, ), the solution is underdetermined. We have lately addressed this with regularization. When is “fat” (more samples than dimensions, ), the solution is overdetermined: no matrix satisfies Eq. 5.9, and a notion of “best fit” must be imposed in order to select just one of the many approximately satisfactory matrices.
Still, let us proceed somewhat naïvely and simply look for an obvious linear-algebraic solution. If is full-rank in addition to being fat, then the Gram matrix (in sample space) is square () and full-rank, and therefore invertible. Hence, starting from Eq. 5.9,
Again we have recovered the normal equations, this time apparently with even fewer assumptions. Lest the normal equations seem inevitable, let us recall that the choice to right multiply by the matrix was, although perhaps obvious, also arbitrary. Note in particular that, under our assumption that is fat, choosing according to the normal equations does not actually satisfy Eq. 5.9—indeed, no choice of could, because the problem is overdetermined. What notion of “best fit” have we tacitly imposed in order to select one value for ?
The answer can be read off the derivation in Eq. 5.2 above: The normal equations arise from squared-error penalties, and accordingly the procedure is sometimes called the method of least squares††margin: method of least squares . Evidently the assumption of normally distributed errors and the least-squares penalty are two sides of the same coin. In the case of a linear map, when is full rank, the least-squares solution exists. And since the quadratic is convex, the solution is unique; that is, the normal equations yield the global minimizer of the least-squares penalty. As a “sanity check,” we can also reassure ourselve that when Eq. 5.9 really does hold, i.e. when is invertible, the normal equations reduce to the unique solution, .
Heteroscedastic data: weighted least-squares.
Now that we have examined linear regression thoroughly from the perspective of the data matrices and , let us complicate the problem along the lines suggested above. Suppose that the samples in these matrices are not i.i.d. In particular, let and be random vectors of all samples, for dimensions and of the output and input (resp.), and suppose (fixed for all and ). Then we can decorrelate simply by multiplying by . In terms of the data matrix , in which each is a row, not a column, this amounts to right multiplication. To maintain the linear relationship in Eq. 5.9, we multiply both sides by . Replacing with and with turns the normal equations into
This variation is known as weighted least squares††margin: weighted least squares .
Geometric arguments.
[[XXX]]
5.1.2 Generalized linear models
Having explored rather thoroughly the simplest model, let us revisit the methodological choices posed at the beginning of the chapter. In particular, suppose we relax the assumption that the conditional be Gaussian, but retain the assumption that the data are combined linearly across dimensions.††margin: Some discussion of finite sufficient statistics to motivate exponential families [[….Use exponential families for the sake of finite sufficient stats….]] This class includes the multivariate Gaussian, but also many other familiar distirbutions—Poisson, multinomial, Gamma, Dirichlet, etc.—as special cases. On the one hand, for none of these other distribution is the cross entropy quadratic in the parameters (recall Eq. 5.2), so closed-form minimizations are not possible. On the other hand, the cross-entropy loss does retain convexity for any exponential-family distribution, so the Newton-Raphson algorithm lately derived is guaranteed to find the global optimum. As we shall see, each step of the algorithm can be rewritten as solving a weighted least-squares problem, as in Eq. 5.10, except that the weights change at every iteration.
Exponential families
General form:
Cross-entropy minimizing parameters:
The optimal moment parameters occur when the gradient with respect to the natural parameters is zero….
Generalized linear model, assuming canonical response function:
Gradient:
When is a linear function of the parameters, , as for Gaussian noise, this gradient can simply be set to zero to solve for the optimal parameters.
When it is not, we need to adopt an iterative approach. Here we will use the Newton-Raphson algorithm discussed above. This requires the Hessian; for simplicity, we write it for scalar outputs only:
Then the IRLS update is
This is the solution to (i.e., the normal equations for) a weighted least-squares problem—cf. Eq. 5.10—but with the parenthetical quantity, rather than , as the output variable. To generate some intuition about what precisely this quantity is, let us derive IRLS from a different perspective.
We return to the cross-entropy gradient, Eq. 5.13. The problem with setting this gradient to zero and solving for the optimal parameters is, as noted above, that is not generally a linear function of those parameters, . However, if we concede that we must take an iterative approach, we can perhaps approximate this function as a linear displacement from the previous value of the parameters. That is, letting denote the inverse canonical link function, , we approximate
Substituting this into the gradient, Eq. 5.13, we have
This matches the update generated with Newton-Raphson, Eq. 5.15. Thus, IRLS can be seen as approximating the gradient as locally linear in the parameters; or, perhaps more intuitively, as approximating the loss (the cross entropy) as locally quadratic in these parameters. Accordingly, we seek a least-squares solution (because of the quadratic cost), but apply it iteratively (because the cost is an approximation). The loss differs at each iteration because the “outputs” change, but also because the data are reweighted by . This is a reweighting, rather than just a weighting, because depends on the current parameters, .
Iteratively reweighted least squares
IRLS for the Bernoulli distribution: Logistic regression
IRLS for the Poisson Distribution
Having derived IRLS, we get a feel for the algorithm by applying it to a few cases, beginnning with Poisson noise. For simplicity, we consider a single, scalar “output” or dependent variable; and the canonical link function, the natural logarithm. This makes the mean of the distribution an exponentiated (inverse canonical link) linear function of the “input” or independent variables, , i.e. . Recall that for Poisson random variables, the variance is equal to the mean. Inserting these values into the first line of Eq. 5.15, yields
IRLS for the Gamma Distribution
††margin: Update to new notation and perhaps rewrite.We consider the case where the scale parameter is known, but the shape parameter is not. The reverse is much more commonly considered (it is the “gamma” GLiM built into Matlab’s glmfit, for example), probably because the sufficient statistic for the shape parameter is just the sum over independent output samples, . For simplicity, we consider outputs of length one () only, which makes the parameters a (row) vector, . Now, the log-likelihood of the parameters under the gamma distribution is:
Here we have made explicit that the only natural parameters depending on the inputs—and the parameters—are those associated with the shape parameter: . The scale parameter is assumed constant across all samples. Since must be positive, we define the link function as:
The trailing 1 only clutters the derivation, so we work instead with , starting with its gradient with respect to the parameters:
We shall also need the first and second the derivatives of the log-partition function with respect to , i.e., the expected value and variance of :
with the -order polygamma function. Putting these pieces together, the gradient (with respect to the parameters) of the entire cost function above is
where .
The Newton-Raphson algorithm also requires the Hessian. Differentiating a second time we find
with a diagonal matrix with entries . Notice, however, that if we use the expected Hessian, the term vanishes, leaving .
Nomenclature
5.1.3 Artificial neural networks
In moving from linear regression to generalized linear models (GLiMs), we relaxed the assumption that the conditional distribution be Gaussian, in particular allowing any exponential-family distribution. But in GLiMs, the map from inputs to the (moment) parameters of the outputs is still “almost” linear; more precisely, linear followed by a pointwise, monotonic nonlinearity. Let us now relax this assumption as well, that is, allow for more expressive maps.
One convenient way to achieve this is to compose multiple such maps together:
where is a pointwise nonlinearity at every “layer” , and the “bias” terms have been included explicitly. The variables at the final layer, , are then the moment parameters for .
To see the increase in expressive power with layers, notice that cannot include any “interaction” terms involving the product of two elements of , like , since acts elementwise. But interaction terms can appear even by , as seen in the simple special case
which evidently yields .
In fact….††margin: Universal function approximators; Cybenko 1989 etc.;
Backpropagation
It is another question how to set the parameters and at every layer in order to approximate any particular function. Part of the price of arbitrarily complex functions is non-convexity and expensive Hessian computations, and accordingly we set aside Newton-Raphson and focus on first-order methods. Computing the cross-entropy gradient with respect to all the parameters is conceptually straightforward, but to do so efficiently will require some punctilious bookkeeping. This is considerably simplified by differentiating with respect to matrices and vectors rather than scalars; the relevant rules are derived in Section B.1.
As throughout this chapter, we descend the gradient of the conditional cross-entropy. But to focus on the new aspects of this process introduced by the composition of functions, let us ignore the specific distribution family and refer to the surprisal simply as :
With some foresight, let us begin with the gradients with respect to the biases. Interpreting as a function of , and then applying the chain rule55 5 Recall that the chain rule for derivatives with respect to column vectors accumulates terms on the left, rather than the right; see Section B.1 through Eq. 5.17, we find the gradient at the final layer to be
where is the Jacobian of the nonlinearity, , at layer . Because acts elementwise on its vector argument, is diagonal. We have assigned a symbol to this gradient in anticipation of reusing it.
Proceeding to the penultimate layer, we find
and at the third-last layer,
This pattern now repeats at subsequent layers. Hence the sequence of bias gradients can be computed with the backward recursion††margin: implementation note: Because is diagonal, these matrix-vector products can be computed efficiently as elementwise vector products.
Next, we take derivatives with respect to the weight matrices, beginning with the output layer (cf. Eq. B.7):
Proceeding to the penultimate layer, we find:
Proceeding backwards to earlier layers continues the pattern, so the weight gradients can be computed with the same backward recursion, Eq. 5.18, and the formula
Eqs. 5.18 and 5.19 define the backpropagation algorithm††margin: backprop for a generic neural network, although applying it to any individual case requires specifying and , which we have left generic here. The weights and biases are updated simply by by moving downhill along the gradients of the cross entropy, which are the expected values of the gradients just derived:
Hence,
(5.20) |
although more sophisticated forms of gradient descent are typically used in practice.
Some examples.
Backpropagation through time (BPTT)
Suppose we have reason to believe that some of the structure of our network is “conserved” or repeated, as in Fig. LABEL:fig:XXX. For instance, we might imagine segments of the network to correspond to slices of time, in which case it is reasonable to treat every time slice as identical in structure, even though the inputs and outputs change over time. Or we might have reason to believe that certain spatial structures are repeated. In either case, the concrete meaning of the assumption is that the same parameters show up in multiple places in the network. The upshot for backprop is that a little care must be taken in distinguishing partial from total derivatives, after which the algorithm goes through as in the previous section.
To make this concrete, we consider a recurrent neural network (RNN)††margin: recurrent neural network (RNN) with inputs and “hidden”-unit activities recursively computed as
For simplicity, we let the recurrent layer also be the output layer—i.e., the layer at which the loss function is evaluated—in which case the most general loss function can be expressed as .66 6 We call this the loss function, but the complete loss is the average of this function under the data. Notice that we let it depend on the hidden units at every time step, as well as directly on the parameters. To compute parameter changes, we take the (total) derivative of the loss function with respect to those parameters:
The first equality is just standard calculus; the second follows because only occurs in the final terms in the chains of derivatives.77 7 In a common abuse of notation, I have written partial derivatives with respect to variables, rather than functions—otherwise the arguments would have to be included to disambiguate the derivatives of at different moments in time. Thus, one can either compute the total effect of on each , and then compute the direct effects of the latter on ; or one can compute the direct effect of on each , and then compute the total effect of the latter on .††margin: Exercise LABEL:ex:partialtotal
Because of Eq. 5.21, the total effect on of , the current hidden state, is its direct effect, plus the effects caused by its influence on future hidden states, . This can be expressed recursively as:
The recursion is initialized at and run backwards in time.
The backprop-through time algorithm consists of Eqs. 5.22 and 5.23. Applying it to a particular model requires working out three partial derivatives, , , and , for a particular choice of recurrent function and loss function. Using the standard recurrent function for a neural network (cf. Eq. 5.17):
and applying Eqs. 5.22 and 5.23 yields:
Note that the subscript on the Jacobian of the nonlinearity, , indicates the time index of its output. To establish the final equalities, we have used the definition
Notice the subtle difference with the backprop signal defined in the non-recurrent case: there was defined to be the gradient with respect to the biases, which indeed turned out to be the right-hand side of Eq. 5.24. But here we define directly in terms of the right-hand side, which is in the recurrent case but one contribution to the total bias gradient.