Chapter 9 Learning with Reparameterizations
9.0.1 A duality between generative and discriminative learning
As we have seen, in generative models, expressive power typically comes at the price of ease of inference (and vice versa).
So far we have explored three different strategies for this trade off:
Severely limit expressive power to conjugate or pseudo-conjugate prior distributions in order to allow for exact inference (Chapter 7); allow arbitrarily expressive generative models, but then approximate inference with a separate recogniton model that is arbitrary, but highly parameterized (Sections 8.1 and 8.2), or a generic homogenizer (Section 8.3), or again correct up to simplifying but erroneous independence assumptions (Section 8.4).
Now we introduce a fourth strategy: let the latent variables of the model be related to the observed variables by an invertible (and therefore deterministic) transformation.
This makes the model marginal
Below we explore two such models in the usual way, starting with simple linear transformations and then moving to more complicated functions.
But let us begin with a more abstract formulation, in order to make contact with the unsupervised, discriminative learning problems of Section 5.2.
In that section, we related observations
For generative models, it is also necessary to specify a prior distribution over the latent variables. Although the discriminative model makes no such specification, its training objective—maximization of the latent (“output”) entropy—will tend to produce independent outputs. Therefore, we shall assume that the generative model’s latent variables are independent—and, for now, nothing else:
In the second equality, we have simply expressed the probability distributions in terms of the cumulative distribution functions (CDFs),
In the discriminative models of Section 5.2, rather than being specified explicitly, the distribution of latent variables was inherited from the data distribution,
We have recovered Eq. 5.29, but from a different model [3].
Both models use the same map between observations and “latent” variables, given by Eq. 9.1.
But the discriminative model proceeds to pass the outputs of this map,
Thus, in the deterministic, invertible setting, maximizing mutual information through a discriminative model is equivalent to density estimation in a generative model.
It is also useful to re-express the density-estimation problem in terms of
where
We require a reparameterization of the data,