4.1 Learning as optimization

What does it mean for a machine, biological or artificial, to “learn”? Vaguely speaking, it means incrementally changing the values of “parameters” so as to improve performance on a particular task. These parameters may be the weights and biases of an artificial neural network, the statistical parameters (means, variances, etc.) of a graphical model, or the synaptic strengths of a biological neural network. The task may be to build a map from inputs to outputs (e.g., from photographs to labels of the objects in the photos), to be able to generate samples from a target probability distribution, or to maximize rewards in some environment.

Why do we ask only for “good” rather than “correct” performance? In a very broad sense, statistical learning problems arise when no single solution is determined by the data. The solution may be underdetermined or overdetermined (see Fig. LABEL:fig:linearFits), but in either case the resolution is to give up “right or wrong” in favor of “better or worse.” For example, in the case of overdetermined problems, we specify how wrong a solution is, by assigning a “cost” to mismatches that (typically) smoothly increases with distance from the correct solution; and then attempt to minimize this cost or loss^†^†margin: loss function ( $\mathcal{L}$ ). Thus, learning problems are typically optimization problems.

More concretely, the networks we have considered in previous chapters are described by equations that determine their behavior up to the assignment of actual numbers to their parameters ( $\bm{\theta}$ ). For some setting of the parameters, the loss is minimized (or objective maximized) and the task achieved—that is, as well as it can be by the machine in question. To find this setting, one applies standard tools from calculus: typically, one considers the derivative (gradient) of the loss function with respect to the parameters. The global minimum of the loss function is located where the derivative is zero (although not necessarily conversely), so ideally one can simply set $\mathop{}\!\mathrm{d}{\mathcal{L}}/\mathop{}\!\mathrm{d}{\bm{\theta}}$ to zero and solve for the parameters. The ideal is rarely attained in practice because the resulting equations are too difficult to be solved in closed form. Alternatively, then, a local, incremental approach is employed: parameters are changed proportional to the local gradient; that is, the algorithm walks downhill along the loss’s surface in the space of parameters. It is the incremental character of this gradient descent^†^†margin: gradient descent , and the increase in performance that accompanies it, that gives parameter acquisition the character of “learning.” Of course, gradient descent generally will find only those minima local^†^†margin: local minimum to the starting point. To some extent, this weakness can be compensated for by re-running the algorithm multiple times from randomly chosen initial parameter settings.

Ideally, the machine should function well (produce small losses) for all input/output pairs (supervised learning) or all observations (unsupervised learning)—collectively, “the data”—but these data may make conflicting demands. How are they to be adjudicated? Generally, they are weighted by the frequency of their occurrence; that is, the gradient of the loss function is evaluated under the data, and these gradients averaged. Here a choice presents itself: how many of the data should be used for this average before a step is taken? It might seem necessary to use all of them, but in fact it is neither necessary nor optimal: although subsets of the data are generally less representative of the true distribution (and therefore the true loss that will accrue on future, “test” data), computing the average of $\mathop{}\!\mathrm{d}{\mathcal{L}}/\mathop{}\!\mathrm{d}{\bm{\theta}}$ on a subset (or “batch”) will be less time consuming. Furthermore, even if all the data are used, the algorithm will not (except in very special cases) reach the local minimum in one step, so this time cost accumulates. Essentially, one takes better steps at the price of making them more slowly; or, at the other extreme (one datum per step), takes very badly chosen steps very quickly. The optimal batch size, which may depend on details of the implementation (e.g., matrix multiplications vs. loops) and hardware (e.g., GPUs), will generally fall between the extremes. Because of the randomness introduced by sub-sampling, this procedure is known as stochastic gradient descent^†^†margin: stochastic gradient descent —although the “full” data distribution itself consists of samples anyway: in some sense all gradient descent under sample distributions is stochastic.

Stochastic gradient descent is also motivated from another perspective: For certain machines, it may be necessary to process the data as they come in—e.g. if, for whatever reason, it is impossible to store them. These “online” algorithms are in some sense more biologically plausible, since it seems unlikely that the brain stores multiple input/output pairs before changing synaptic weights.^†^†margin: online vs. batch algorithms For the same reason, we are interested in such algorithms even when the gradient-zeroing equation can be solved analytically. Below we consider algorithms that rely on analytical solutions, batch gradient descent, and online gradient descent; but it is important to remember that those of the first type can usually be converted into those of the second and third types when required.

Overfitting and regularization.

Any of these methods can fail, for the various reasons just adduced, to learn the data they have been trained on, but in fact they can succeed at this and still fail to have “learned” more generally. In particular, “overfitting^†^†margin: overfitting ” of training data will create a model too brittle to accommodate new (i.e., test) data. The intuition behind overfitting can be brought out forcefully by imagining a machine so powerful (parameter-rich) that it can memorize all the training data ever given to it: it simply stores each input/output pair (or each input, for unsupervised learning) in its “memory.” This machine will commit no errors on the training data, and only errors on the test data. Whether or not the machine fails to “generalize” in this sense is related to the ratio of training data to parameters: roughly, one wants more of the former than the latter. ^†^†margin: insert classic figure of high-order polynomial to noisy linear data

The obvious and seemingly straightforward remedy for overfitting, then, is simply to make sure this ratio is large enough. Unfortunately, it is often difficult to know a priori what “large enough” is. Consider, for example, a supervised learning task for a discriminative model with vector inputs, like a multiple linear regression. Naïvely, each element of that vector deserves its own parameter. But one input element might be correlated strongly with another input element, or only weakly with the output, in both of which cases it deserves “less than one” parameter. When the input dependencies are linear (i.e., they are first-order correlations), the dimensionality of the data—and therefore the number of model parameters—can be reduced ahead of time by simple linear transformation (multiplication by a fat matrix). But generally, unraveling these dependencies is itself a formidable learning task—in fact, it is normally a large part of what we wanted the machine to learn for us in the first place.

One hypothetical way to handle our general uncertaintly about the ratio of information in the data and the parameters would be to train a model with a rather large number and test for failure to generalize on held-out, “validation” data. In the case of failure, we could eliminate some of the parameters and repeat the process. But which parameters should we eliminate?

If we think of this “elimination” as setting to zero, a subtler alternative suggests itself. Rather than simply removing parameters, one can instead “encourage” parameters to be zero.^†^†margin: regularization Essentially, one adds another demand to the objective of learning the training data: parameters are penalized for being far from zero, as well as for yielding bad performance on the training data. Parameters that have little effect on the latter penalty will not be able to offset the former penalty, and consequently will be “pulled” closer to zero. This automates and softens the elimination of parameters, although it does not by itself obviate the need for an iterative validation scheme: We still need to decide how much “encouragement” to give each parameter, that is, the relative strength of the two penalties. This is often set with a single scalar hyperparameter, $\lambda$ , found by sweeping across possible values, training the model, and testing on the validation data.

The probabilistic interpretation of the loss.

Adding demands to an objective function corresponds, as usual, to the method of Lagrange multipliers, with multiplier $\lambda$ . But it also has a statistical interpretation: the parameters have a prior distribution whose mode is zero. Something like the variance of that distribution roughly corresponds to (the inverse of) the Lagrange multiplier—broader distributions demand less strenuously that the parameters be zero. The family of the distribution corresponds to the form of the penalty function: Gaussian for squared deviation from zero, Laplace for absolute deviation, etc. We shall see examples of this “regularization” in the following chapters.

Perhaps unsurprisingly, the original loss (before the addition of Lagrange terms) can likewise be redescribed in terms of probabilities, and indeed the entire loss in terms of a joint distribution. This may not seem particularly compelling until we are faced with the problem of coming up with a loss function. In some cases, good losses are obvious enough—when positive and negative (real-valued) deviations are equally undesirable, a sensible loss is the average squared error; in others, perhaps less so. Consider, for example, trying to predict the running speed (not velocity) of an animal often at rest. Squared error, under which impossible “negative speeds” are penalized just like erroneous—but possible—positive speeds, is clearly problematic.

We shall see shortly how probabilities naturally give rise to losses. Here we note simply that losses generically range across all real numbers and should be decreased by learning, whereas probabilities must be positive real numbers and (intuitively) should be increased by learning. Therefore the natural transformation from probabilities to losses is the negative logarithm.

The data distribution and the model distribution.

In Chapters 2 and 3, we discussed several probabilistic models, that is, particular families of joint distributions of random variables. Let us refer to them simply as ${\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)}$ , not yet specifying anything about the random variables ${\bm{\hat{Y}}}$ . Recall that which distribution, as opposed to which family of distributions, a particular probabilistic model corresponds to, is determined by the numerical values of its parameters, $\bm{\theta}$ . For any particular setting of $\bm{\theta}$ , we shall call ${\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)}$ the model distribution^†^†margin: the model distribution . We also saw how to make inferences under such models: to compute marginal or conditional distributions over some subset of variables. But for these inferences to be interesting or useful, the model must be a good model of something.

In the general setting that we now consider, we begin with a set of numerical samples from “the world.” These could be readings from instruments, or tabulations made by humans, or etc. The key assumption is that these samples come from a distribution, ${p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[%named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}$ , which we shall call the data distribution^†^†margin: the data distribution . We never have direct access to this distribution, only the samples—indeed, in contrast to the model distribution, we do not even assume that it has a parametric form.¹¹ 1 We take up in Section LABEL:sec:XXX the possibility of dispensing with equations for the model distribution, as well. But with the notion of a data distribution, we can now make more precise what “learning” means for probabilistic models: Learning is the process of adjusting the parameters, $\bm{\theta}$ , so as to make the model distribution, ${\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)}$ , more like the data distribution, ${p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[%named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}$ .

Ideally, the learning algorithm will ultimately bring about the equality of model and data distributions, ${\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)}={p\mathopen{}%\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{%pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}$ , but notice that there are two distinct causes of failure: inadequacy of the algorithm, and inadequacy of the model. The second failure mode occurs when the data distribution ${p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[%named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}$ is not a member of the family of distributions specified by the model; that is, when there is no setting of $\bm{\theta}$ for which the equality holds. This underscores the importance of getting the model right, or at least close to right, or again flexible enough to accommodate a very broad family of distributions. On the other hand, these desiderata can conflict with the desideratum of an efficacious learning algorithm: more expressive models generally require more approximations during inference and learning. In other words, minimizing the damage from the two failure modes itself represents a kind of (meta)optimization. We explore this trade-off throughout this part of the book, generally moving from less expressive models with exact inference and learning algorithms to more expressive models which require approximations.

In many texts, the distinction between $p$ and $\hat{p}$ is not made, the model and data distributions are conflated, and one takes the goal of learning to be simply acquisition of the true parameters of the true model of the data. In this case, one makes reference only to a distribution called (say) ${p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[%named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)}$ —in our view, a kind of chimera between the model and data distributions. In contrast, in this textbook, we insist on distinguishing the two—first and foremost because this is a more felicitous description of the actual position of our algorithm, and our nervous system, vis-à-vis “the world”: Except in special circumstances, it is not likely that our models—still less our brains—are recapitulating the physical laws that are ultimately responsible for the data. This distinction also allows for using “noise” in our models to accommodate mismatches between the model and the data, without committing us to a belief in randomness inherent in the data themselves.