10.1 Noise-Contrastive Objectives
Here we focus on (2) (we set aside the problem of generating samples).
One possible solution is to design a loss function that is minimized only for an energy corresponding to a normalized distribution, i.e. for which
10.1.1 Noise-Contrastive Estimation
The basic intuition behind noise-contrastive estimation (NCE) [9]
is that one such objective is distinguishing data from noise.
More precisely, we let the task be to discriminate good or “positive” samples drawn from
Mathematically, if
(10.1) |
We will give our generative model the same form, except that our model for the positive data will not be normalized. For notational symmetry between the model and noise distribution, we also write the noise distribution in terms of an energy,
However, we define this energy such that this noise distribution is indeed normalized. Our generative model is then
Note well that
with
The key result that makes NCE work is that the cross entropy of this posterior (see Eq. 10.4 below) is minimized only when
Notice, however, that these mistakes will be less noticeable if the data and noise distributions are very different from each other—e.g., if the bulks of the probability masses of the distributions are very far from each other. In this case, the model could assign (e.g.) overly high probability to the data (by making the normalizer too small) without making the noise samples particularly probable under the model. Technically, the normalized energy of the data distribution is guaranteed to be the unique solution to the loss based on the posterior in Eq. 10.2 (see below) as long as the noise distribution is supported wherever the data distribution is. But for finite training samples (the sitution in which we usually find ourselves), the guarantee is voided. The problem would appear to be more acute for more expressive model distributions.
Quasi-generative learning.
The cross-entropy loss is the negative log of the posterior distribution (Eq. 10.2), averaged under the data (Eq. 10.1):
This is evidently a discriminative problem, but with a twist.
The canonical generative approach to binary classification is to model the generative distribution
[[Nice properties of the estimator….]]
10.1.2 InfoNCE
Van den Oord and colleagues propose to put NCE to a very different purpose [51]. Rather than attempting to learn a parametric form for the probability of observed samples, they aim to extract useful features from data. In order to do so, they introduce what amounts to four novel variations on NCE, which we discuss one at a time.
(1) Generalizing to multiple “examples.”
Suppose that the observation
|
|
In this setup, the latent variable is categorical (conceived as a one-hot vector
(10.5) |
Again we have set the prior uniform, since we have no reason to make any one of the elements more or less likely to be noise than any other.
We emphasize that this is not a mixture model: a single sample contains
The generative model takes the same form, with the model distribution taking the place of the data marginal. Writing it in terms of energies, we obtain
Again we ignore the fact that the emission is unnormalized and simply compute a (normalized) posterior distribution with Bayes’ rule
that is, the
In the final line, we are selecting only that output of the softmax function that corresponds to the actual positive sample (whose index will of course differ from trial to trial).
Negative samples enforce normalization.
We can shed light on the role played by the negative examples by considering them separately from the positive example in the posterior probability of a positive example:
Now notice that the negative-sample terms sum approximately to a constant:
The approximate equality becomes more exact as the number of negative examples increases. (And technically, the final equality requires the model and noise distributions to have the same support.) Eq. 10.8 says that, if we had in hand an expression for the normalizer, we could do without the negative samples altogether—they drop out of the loss function. Indeed, the loss now becomes
where the final line follows for large
(2) Modeling the energy difference.
We have assumed up to this point that the source of our “noise” samples is also an evaluatable expression for the probability of samples. What if we have only samples from the noise distribution? Can we still learn a model of the positive data?
One obvious solution is to learn a model for the negative as well as the positive samples; for example, to build a parameterized model for the noise energy,
One subtlety with modeling
(3) Contrasting a conditional with a marginal distribution.
In the third departure from the original NCE, the InfoNCE method proposes to learn to model a conditional distribution
This can be made precise in the language of information theory.
However, the information we aim to increase is not precisely a mutual information, neither between
The information quantity we are interested in retains the marginal entropy over
We might accordingly call this (for want of something better) the “cross mutual information.”
Intuitively, it is the portion of the (actual) entropy of
Now, Gibbs’s inequality tells us that
The final (approximate) equality follows because of the (somewhat subtle) fact that the expectation includes only positive samples, to wit, samples in which
Eq. 10.11 tells us that decreasing the posterior cross entropy (on the right-hand side of Eq. 10.11) increases, at least approximately, the cross mutual information (on the left).
The larger
But now notice that there is nothing mathematical to distinguish the roles played by
(4) Modeling future samples of a sequence.
There are many possibilities, but one nice application of InfoNCE is to time-series data, and in particular with learning to extract useful information from the “auxiliary variable”
As lately discussed, the authors model the difference between the conditional and unconditional energies, rather than the energies themselves. In particular, they let this model have the form
where
10.1.3 “Local” NCE
There is another, subtlely different (from InfoNCE) way of generalizing NCE [43].
In short, although (as before) only one out of
The loss under the data distribution is then