4.1 Learning as optimization
What does it mean for a machine, biological or artificial, to “learn”? Vaguely speaking, it means incrementally changing the values of “parameters” so as to improve performance on a particular task. These parameters may be the weights and biases of an artificial neural network, the statistical parameters (means, variances, etc.) of a graphical model, or the synaptic strengths of a biological neural network. The task may be to build a map from inputs to outputs (e.g., from photographs to labels of the objects in the photos), to be able to generate samples from a target probability distribution, or to maximize rewards in some environment.
Why do we ask only for “good” rather than “correct” performance?
In a very broad sense, statistical learning problems arise when no single solution is determined by the data.
The solution may be underdetermined or overdetermined (see Fig. LABEL:fig:linearFits), but in either case the resolution is to give up “right or wrong” in favor of “better or worse.”
For example, in the case of overdetermined problems, we specify how wrong a solution is, by assigning a “cost” to mismatches that (typically) smoothly increases with distance from the correct solution; and then attempt to minimize this cost or loss††margin:
loss function
(
More concretely, the networks we have considered in previous chapters are described by equations that determine their behavior up to the assignment of actual numbers to their parameters (
Ideally, the machine should function well (produce small losses) for all input/output pairs (supervised learning) or all observations (unsupervised learning)—collectively, “the data”—but these data may make conflicting demands.
How are they to be adjudicated?
Generally, they are weighted by the frequency of their occurrence; that is, the gradient of the loss function is evaluated under the data, and these gradients averaged.
Here a choice presents itself: how many of the data should be used for this average before a step is taken?
It might seem necessary to use all of them, but in fact it is neither necessary nor optimal: although subsets of the data are generally less representative of the true distribution (and therefore the true loss that will accrue on future, “test” data), computing the average of
Stochastic gradient descent is also motivated from another perspective: For certain machines, it may be necessary to process the data as they come in—e.g. if, for whatever reason, it is impossible to store them. These “online” algorithms are in some sense more biologically plausible, since it seems unlikely that the brain stores multiple input/output pairs before changing synaptic weights.††margin: online vs. batch algorithms For the same reason, we are interested in such algorithms even when the gradient-zeroing equation can be solved analytically. Below we consider algorithms that rely on analytical solutions, batch gradient descent, and online gradient descent; but it is important to remember that those of the first type can usually be converted into those of the second and third types when required.
Overfitting and regularization.
Any of these methods can fail, for the various reasons just adduced, to learn the data they have been trained on, but in fact they can succeed at this and still fail to have “learned” more generally. In particular, “overfitting††margin: overfitting ” of training data will create a model too brittle to accommodate new (i.e., test) data. The intuition behind overfitting can be brought out forcefully by imagining a machine so powerful (parameter-rich) that it can memorize all the training data ever given to it: it simply stores each input/output pair (or each input, for unsupervised learning) in its “memory.” This machine will commit no errors on the training data, and only errors on the test data. Whether or not the machine fails to “generalize” in this sense is related to the ratio of training data to parameters: roughly, one wants more of the former than the latter. ††margin: insert classic figure of high-order polynomial to noisy linear data
The obvious and seemingly straightforward remedy for overfitting, then, is simply to make sure this ratio is large enough. Unfortunately, it is often difficult to know a priori what “large enough” is. Consider, for example, a supervised learning task for a discriminative model with vector inputs, like a multiple linear regression. Naïvely, each element of that vector deserves its own parameter. But one input element might be correlated strongly with another input element, or only weakly with the output, in both of which cases it deserves “less than one” parameter. When the input dependencies are linear (i.e., they are first-order correlations), the dimensionality of the data—and therefore the number of model parameters—can be reduced ahead of time by simple linear transformation (multiplication by a fat matrix). But generally, unraveling these dependencies is itself a formidable learning task—in fact, it is normally a large part of what we wanted the machine to learn for us in the first place.
One hypothetical way to handle our general uncertaintly about the ratio of information in the data and the parameters would be to train a model with a rather large number and test for failure to generalize on held-out, “validation” data. In the case of failure, we could eliminate some of the parameters and repeat the process. But which parameters should we eliminate?
If we think of this “elimination” as setting to zero, a subtler alternative suggests itself.
Rather than simply removing parameters, one can instead “encourage” parameters to be zero.††margin:
regularization
Essentially, one adds another demand to the objective of learning the training data: parameters are penalized for being far from zero, as well as for yielding bad performance on the training data.
Parameters that have little effect on the latter penalty will not be able to offset the former penalty, and consequently will be “pulled” closer to zero.
This automates and softens the elimination of parameters, although it does not by itself obviate the need for an iterative validation scheme:
We still need to decide how much “encouragement” to give each parameter, that is, the relative strength of the two penalties.
This is often set with a single scalar hyperparameter,
The probabilistic interpretation of the loss.
Adding demands to an objective function corresponds, as usual, to the method of Lagrange multipliers, with multiplier
Perhaps unsurprisingly, the original loss (before the addition of Lagrange terms) can likewise be redescribed in terms of probabilities, and indeed the entire loss in terms of a joint distribution. This may not seem particularly compelling until we are faced with the problem of coming up with a loss function. In some cases, good losses are obvious enough—when positive and negative (real-valued) deviations are equally undesirable, a sensible loss is the average squared error; in others, perhaps less so. Consider, for example, trying to predict the running speed (not velocity) of an animal often at rest. Squared error, under which impossible “negative speeds” are penalized just like erroneous—but possible—positive speeds, is clearly problematic.
We shall see shortly how probabilities naturally give rise to losses. Here we note simply that losses generically range across all real numbers and should be decreased by learning, whereas probabilities must be positive real numbers and (intuitively) should be increased by learning. Therefore the natural transformation from probabilities to losses is the negative logarithm.
The data distribution and the model distribution.
In Chapters 2 and 3, we discussed several probabilistic models, that is, particular families of joint distributions of random variables.
Let us refer to them simply as
In the general setting that we now consider, we begin with a set of numerical samples from “the world.”
These could be readings from instruments, or tabulations made by humans, or etc.
The key assumption is that these samples come from a distribution,
Ideally, the learning algorithm will ultimately bring about the equality of model and data distributions,
In many texts, the distinction between