10.1 Noise-Contrastive Objectives
Here we focus on (2) (we set aside the problem of generating samples). One possible solution is to design a loss function that is minimized only for an energy corresponding to a normalized distribution, i.e. for which . We will not constrain the energy itself; that is, there exist settings of the parameters for which . However, none of these settings minimizes the loss. What objectives have this property?
10.1.1 Noise-Contrastive Estimation
The basic intuition behind noise-contrastive estimation (NCE) [9] is that one such objective is distinguishing data from noise. More precisely, we let the task be to discriminate good or “positive” samples drawn from from “negative” drawn from a “noise” distribution, , by improving an unnormalized model for the “positive” data. The dual demands of minimizing both false alarms and misses will prevent the model from making its implicit normalizer either too big or too small (respectively). We choose the noise distribution, so we can (to some extent) control how hard this task is.
Mathematically, if is the Bernoulli random variable indicating from which of the distributions was drawn, the problem becomes that of minimizing the posterior cross entropy . There is no reason to make negative or positive samples more common, so we let the prior probability of be uniform. Therefore the data distribution is
(10.1) |
We will give our generative model the same form, except that our model for the positive data will not be normalized. For notational symmetry between the model and noise distribution, we also write the noise distribution in terms of an energy,
However, we define this energy such that this noise distribution is indeed normalized. Our generative model is then
Note well that is not normalized: at the beginning of training, at least, it will not integrate to 1. Nevertheless, if we ignore this and compute the posterior in the usual way with Bayes’ rule, we get a perfectly legitimate probability distribution. In particular, the posterior probability an example being positive is
with the logistic function and the difference in energies:
The key result that makes NCE work is that the cross entropy of this posterior (see Eq. 10.4 below) is minimized only when , as opposed to for some constant [9]. (Technically, the proof requires the noise distribution to be supported wherever the data distribution is.) So we will not need to compute the normalizer, i.e. to integrate . Intuitively, this happens because the model and noise energies always show up together and must balance. If the learned (implicit) normalizer is too small, for example if the model energy is smaller than the noise energy for most values of , then most negative samples will be assigned to the positive distribution. The reverse, also undesirable, holds when the implicit normalizer is too large. Both kinds of mistakes will increase the cross entropy.
Notice, however, that these mistakes will be less noticeable if the data and noise distributions are very different from each other—e.g., if the bulks of the probability masses of the distributions are very far from each other. In this case, the model could assign (e.g.) overly high probability to the data (by making the normalizer too small) without making the noise samples particularly probable under the model. Technically, the normalized energy of the data distribution is guaranteed to be the unique solution to the loss based on the posterior in Eq. 10.2 (see below) as long as the noise distribution is supported wherever the data distribution is. But for finite training samples (the sitution in which we usually find ourselves), the guarantee is voided. The problem would appear to be more acute for more expressive model distributions.
Quasi-generative learning.
The cross-entropy loss is the negative log of the posterior distribution (Eq. 10.2), averaged under the data (Eq. 10.1):
This is evidently a discriminative problem, but with a twist. The canonical generative approach to binary classification is to model the generative distribution (like NCE); acquire the parameters by minimizing the joint cross entropy (unlike NCE); and then invert to with Bayes’ rule. For Gaussian mixtures, this is known as linear/quadratic discriminant analysis (depending on whether the covariance is the same/different across classes). The canonical discriminative approach to binary classification is to model directly (unlike NCE); and then minimize the cross entropy (like NCE). This is logistic regression. NCE mixes both methods: it models the generative distribution , but first inverts with Bayes rule, and finally minimizes the discriminative cross entropy . In the classic case of the mixture of two Gaussians/binary classification, this would amount to learning the two (mean, covariance) pairs by minimizing the posterior cross entropy—as opposed to learning these parameters by minimizing the joint cross entropy (generative), or learning a separating hyperplane by minimizing the posterior cross entropy (discriminative).
[[Nice properties of the estimator….]]
10.1.2 InfoNCE
Van den Oord and colleagues propose to put NCE to a very different purpose [51]. Rather than attempting to learn a parametric form for the probability of observed samples, they aim to extract useful features from data. In order to do so, they introduce what amounts to four novel variations on NCE, which we discuss one at a time.
(1) Generalizing to multiple “examples.”
Suppose that the observation is not a single sample but a collection of “examples,” ), precisely one of which is not noise. Then the goal is not to determine whether or not the sample is noise, but rather to determine which of the examples is noise. This means that rather than use the model-noise energy difference (Eq. 10.3) directly to assign the example to the positive or negative class, as in NCE, we will compare energy differences to each other (with the softmax function).
In this setup, the latent variable is categorical (conceived as a one-hot vector ) rather than Bernoulli, and the data distribution is:
(10.5) |
Again we have set the prior uniform, since we have no reason to make any one of the elements more or less likely to be noise than any other. We emphasize that this is not a mixture model: a single sample contains “examples”: one positive, and negative.
The generative model takes the same form, with the model distribution taking the place of the data marginal. Writing it in terms of energies, we obtain
Again we ignore the fact that the emission is unnormalized and simply compute a (normalized) posterior distribution with Bayes’ rule
that is, the output of the softmax function. Eq. 10.6 is evidently a kind of generalization of Eq. 10.2.11 1 However, note that the multi-example version of NCE does not quite reduce to the single-example case even when . Eq. 10.2 can indeed be re-written with a softmax as in Eq. 10.6, with the first argument equal to and the second equal to 0. The latter reflects our indifferent prior, which provides no additional information. In the two-example version of the generalization under discussion, on the other hand, the second argument encodes the relative probability of the second example being data or noise. In short, deciding which of two samples is “real” is easier than deciding whether or not a single sample is. Putting this together with the data distribution, we can write the conditional cross entropy as
In the final line, we are selecting only that output of the softmax function that corresponds to the actual positive sample (whose index will of course differ from trial to trial).
Negative samples enforce normalization.
We can shed light on the role played by the negative examples by considering them separately from the positive example in the posterior probability of a positive example:
Now notice that the negative-sample terms sum approximately to a constant:
The approximate equality becomes more exact as the number of negative examples increases. (And technically, the final equality requires the model and noise distributions to have the same support.) Eq. 10.8 says that, if we had in hand an expression for the normalizer, we could do without the negative samples altogether—they drop out of the loss function. Indeed, the loss now becomes
where the final line follows for large .22 2 The authors of the original paper [51] interpret this approximation as a lower bound when the model distribution matches the data distribution. Presumably the idea is that, for a very good model , the noise-to-model ratio will usually be less than one when evaluated on positive examples. Therefore, the neglected will dominate the neglected . It would take more work to prove this. This makes sense: the whole point of using negative examples was to force unnormalized models to learn the correct normalization. Since we want to use models for which computing is intractable, we will not use Eq. 10.9 as our objective—but we will use it below to prove that optimizing the multi-example NCE loss (Eq. 10.7) increases mutual information in a certain setting.
(2) Modeling the energy difference.
We have assumed up to this point that the source of our “noise” samples is also an evaluatable expression for the probability of samples. What if we have only samples from the noise distribution? Can we still learn a model of the positive data?
One obvious solution is to learn a model for the negative as well as the positive samples; for example, to build a parameterized model for the noise energy, , and use it in the generative model. But if we wanted to get a normalized version of the model energy, , we would have to be able to get or to know the normalizer for this noise energy, , which is troubling. However, as noted at the outset, getting a probability model for the data, normalized or unnormalized, is not the goal of InfoNCE. So instead we will directly model the energy difference, i.e. the left-hand rather than right-hand side of Eq. 10.3. Rather than asking for the probabilities of an example under the two models (positive and negative), we are asking for its relative probability.
One subtlety with modeling directly is that we are still at liberty to interpret this as fitting only, that is to say, not fitting the noise energy, . In other words, we can attribute any error in to an error in rather than . Consequently, the denominator in Eq. 10.8 can still be interpreted as , and the equation still goes through. We will use it below.
(3) Contrasting a conditional with a marginal distribution.
In the third departure from the original NCE, the InfoNCE method proposes to learn to model a conditional distribution , given some auxiliary variable . More importantly, we use the data marginal, , as the noise distribution. Thus, has switched roles, from data to “noise,” or more felicitously from the source of positive to negative examples. The intuition behind this choice of distributions is that a model that can distinguish them must be able to extract information about from .
This can be made precise in the language of information theory. However, the information we aim to increase is not precisely a mutual information, neither between and nor anything else, because it depends on two different distributions: the model, , and the data, . The standard mutual information can of course be written as
The information quantity we are interested in retains the marginal entropy over , since the model has no effect on it (see previous section), but replaces the conditional entropy with the conditional cross entropy:
We might accordingly call this (for want of something better) the “cross mutual information.” Intuitively, it is the portion of the (actual) entropy of that is explained by under the model .
Now, Gibbs’s inequality tells us that , so consequently : the cross mutual information is never greater than the actual mutual information. Equality is reached when the model matches the true data conditional. Although this is also the point at which the posterior cross entropy in Eq. 10.7 reaches its minimum, this is not quite the same as saying that improving the latter increases the cross mutual information. Still, it is intuitive, since we expect training to oblige the model to make increasing use of in order to distinguish the conditional data from the marginal data. And indeed, we can show this. The cross mutual information of Eq. 10.10 between and can be written more explicitly in terms of log probabilities, and then related to the (approximate) loss function in Eq. 10.9:
The final (approximate) equality follows because of the (somewhat subtle) fact that the expectation includes only positive samples, to wit, samples in which is correctly paired with , and therefore Eq. 10.9 applies.
Eq. 10.11 tells us that decreasing the posterior cross entropy (on the right-hand side of Eq. 10.11) increases, at least approximately, the cross mutual information (on the left). The larger , the less approximate the final equality (see Eq. 10.8). (And although this also increases the term in Eq. 10.11 and therefore the discrepancy between the cross mutual information and the cross entropy, it does not increase the discrepancy between their gradients.) In sum, minimizing the NCE loss in Eq. 10.7, with defined to be the difference between the conditional and marginal energies, maximizes the information extracted from by the function that assigns energies to , .
But now notice that there is nothing mathematical to distinguish the roles played by and . In terms of the data, they are either paired (positive examples) or unpaired (negative examples), so they play symmetrical roles. In terms of the model, they enter the loss only through the generic function , which is learned and has no pre-specified role for its first and second arguments. So we can equally interpret descent of the InfoNCE loss as learning to extract useful information from about rather than the other way around. Indeed, perhaps the most felicitous interpretation, which emphasizes this symmetry, is that the training scheme asks the model to distinguish between the joint distribution and the product of the marginals, .
(4) Modeling future samples of a sequence.
There are many possibilities, but one nice application of InfoNCE is to time-series data, and in particular with learning to extract useful information from the “auxiliary variable” , about the variable of interest, (for some positive integer ). That is, we want to learn how to “summarize” sequences of random variables so as best to predict their future state. (For example, for linear dynamical systems, the optimal summary is a weighted sum of past states, with weights decaying exponentially into the past.) Thus the positive and negative (“noise”) distributions are, respectively, and .
As lately discussed, the authors model the difference between the conditional and unconditional energies, rather than the energies themselves. In particular, they let this model have the form
where is a static “encoder” ANN and is an RNN. In order to decrease the posterior cross entropy (Eq. 10.7), the encoder and the RNN must extract representations from the data history (on the one hand) and a future sample (on the other) that expose the shared information between them to a bilinear form. The parameters and are all learned by stochastic gradient descent of Eq. 10.7.
10.1.3 “Local” NCE
There is another, subtlely different (from InfoNCE) way of generalizing NCE [43]. In short, although (as before) only one out of samples will be positive, our generative model will now be ignorant of this fact (cf. Fig. 10.1A, the graphical model for InfoNCE, with Fig. 10.1B). It will instead (incorrectly) treat each “example” as independent of each other, and furthermore assume (incorrectly) that positive and negative example are equally likely. We can still compute the posterior distribution over categorical random variables (one-hot vectors) under this model by aggregrating together the relevant samples, even though the model doesn’t know that they form a group:
The loss under the data distribution is then