Chapter 10 Learning Energy-Based Models

One of the basic problems we have been grappling with in fitting generative models to data is how to make the model sufficiently expressive. For example, some of the complexity or “lumpiness” of the data distribution can be explained as the effect of marginalizing out some latent variables—as in a mixture of Gaussians. As we have seen, GMMs are not sufficient to model (e.g.) natural images, so we need to introduce more complexity. Latent-variable models like the VAE attempt to push the remaining complexity into the mean (or other parameters) of the emission distribution, by making it (or them) a deep neural-network function of the latent variables. Normalizing flows likewise map simply-distributed latent variables into variables with more complicated distributions, although they treat the output of the neural network itself as the random variable of interest (we don’t bother to add a little Gaussian noise)—but at the price that the network must be invertible.

An alternative to all of these is to model the unnormalized distribution of observed variables, or equivalently, the energy:

E\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[%named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{\hat{y}},\bm{\theta}}\right)\mathrel{\vbox{%\hbox{\scriptsize.}\hbox{\scriptsize.}}}=-\log{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}};\bm{\theta}}\right)}-\log Z(\bm{%\theta})\implies{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{%.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}%\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}};\bm{%\theta}}\right)}=\frac{1}{Z(\bm{\theta})}\exp\mathopen{}\mathclose{{}\left\{-E%\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[%named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{\hat{y}},\bm{\theta}}\right)}\right\}.

The advantage is that $E\mathopen{}\mathclose{{}\left(\cdot,\bm{\theta}}\right)$ can be an arbitrarily complex function mapping to the real line and, consequently, we are not limited to distributions with a known parametric form (like Gaussian or Poisson or etc.), or that can be constructed out of invertible transformations of noise. The seemingly fatal disadvantages are that, (1) without a parametric model, it is not immediately obvious how to generate samples; and (2) without the normalizer, we cannot assign a probability to any datum (although we can assign relative probabilities to any pair of data). Computing the normalizer will be intractable if we let $E$ be particularly complex—which was the whole point of taking this approach!

10.1 Noise-Contrastive Objectives

10.2 EFH-like models