Chapter 2 Directed Generative models

Generative vs. discriminative models.

A generative modelmargin: generative models specifies a joint distribution over all random variables of interest. Now, what counts as a random variable, as opposed to a parameter, can itself be a decision for the modeler—at least for Bayesians (non-frequentists); but for now we set aside this question. Instead we might wonder what circumstances could justify specifying less than the entire joint distribution. [[See discussion in [21].]] One such circumstance is the construction of maps, e.g., from variables 𝒀{\bm{Y}} to 𝑿{\bm{X}}. Such maps can be constructed by considering only the conditional distribution, p(𝒙|𝒚){p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[% named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{x}{}\middle|\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}, and ignoring the marginal distribution of 𝒀{\bm{Y}}. These are known as discriminative modelsmargin: discriminative models .

Perhaps now the case for discriminative learning of maps seems, not just plausible, but overwhelming. When is it helpful to model the joint distribution, p(𝒙,𝒚){p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[% named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{x}{},\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}, in the construction of a map (function) from 𝒀{\bm{Y}} to 𝑿{\bm{X}}? One clear candidate is when we have some idea of the generative process, and it runs in the other direction. That is, suppose that data were (causally) generated by drawing some 𝒙\bm{x} from p(𝒙){p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[% named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{x}{}}\right)}, followed by drawing a 𝒚\bm{y} from p(𝒚|𝒙){p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[% named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{y}{}\middle|\bm{x}}\right)}. Then it seems reasonable to build a model with matching structure, p^(𝒙^,𝒚^;𝜽)=p^(𝒚^|𝒙^;𝜽)p^(𝒙^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{},\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)}={\hat{p}% \mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[% named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{y}}{}\middle|\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{% \theta}}\right)}{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{% \theta}}\right)}. If we do want a map from 𝒀{\bm{Y}} to 𝑿{\bm{X}}, we will need to apply Bayes’s theorem, which converts a prior distribution, p^(𝒙^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{\theta}}\right)}, and emission distribution,11 1 The standard terms terms—prior, likelihood, and posterior—are, unfortunately, overloaded. For example, the “maximum-likelihood” estimate refers to the likelihood of the parameters, 𝜽\bm{\theta}, not of a random variable, 𝑿^{\bm{\hat{X}}}. Indeed, the term “likelihood” was introduced in a purely frequentist context by Fisher [8]. Following the literature on hidden Markov models, this book typically refers to p^(𝒚^|𝒙^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{}\middle|\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{% \theta}}\right)} as the “emission” rather than the “likelihood,” except indeed when it is to be interpreted as a function of 𝑿^{\bm{\hat{X}}}, which at least minimizes one collision. p^(𝒚^|𝒙^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{}\middle|\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{% \theta}}\right)}, into a posterior distribution, p^(𝒙^|𝒚^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{}\middle|\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{% \theta}}\right)}.

We recall that Bayes’s “theorem”margin: Bayes’s theorem is just a rearrangement of the definition of conditional probabilities:

p^(𝒙^|𝒚^;𝜽)=p^(𝒚^|𝒙^;𝜽)p^(𝒙^;𝜽)p^(𝒚^;𝜽)=p^(𝒚^|𝒙^;𝜽)p^(𝒙^;𝜽)𝒙^p^(𝒚^|𝒙^;𝜽)p^(𝒙^;𝜽)d𝒙^p^(𝒚^|𝒙^;𝜽)p^(𝒙^;𝜽).\begin{split}{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{}\middle|% \bm{\hat{y}};\bm{\theta}}\right)}&=\frac{{\hat{p}\mathopen{}\mathclose{{}\left% (\bm{\hat{y}}\middle|\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{\theta}}\right)}{\hat{p}% \mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[% named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{\theta}}\right)}}{{\hat{p}% \mathopen{}\mathclose{{}\left(\bm{\hat{y}};\bm{\theta}}\right)}}\\ &=\frac{{\hat{p}\mathopen{}\mathclose{{}\left(\bm{\hat{y}}\middle|\leavevmode% \color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{% \theta}}\right)}{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{% \theta}}\right)}}{\int_{\bm{\hat{x}}{}}{\hat{p}\mathopen{}\mathclose{{}\left(% \bm{\hat{y}}\middle|\bm{\hat{x}};\bm{\theta}}\right)}{\hat{p}\mathopen{}% \mathclose{{}\left(\bm{\hat{x}};\bm{\theta}}\right)}\mathop{}\!\mathrm{d}{\bm{% \hat{x}}{}}}\\ &\propto{\hat{p}\mathopen{}\mathclose{{}\left(\bm{\hat{y}}\middle|\leavevmode% \color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{% \theta}}\right)}{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{% \theta}}\right)}.\\ \end{split} (2.1)

The last formulation is, although less explicit, also common.22 2 I have written Bayes’s theorem with 𝒚^\bm{\hat{y}} rather than 𝒚^\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}} in order to emphasize that it provides a distribution over 𝑿^{\bm{\hat{X}}} for a particular observation—in this case, drawn from the gener distribution, p^(𝒚^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)}). But this is of course true for any observation 𝒚^\bm{\hat{y}}, i.e., as a statement about a function of two random variables rather than one. The idea is to emphasize that the formula relates three functions of the same variable, 𝒙^\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}: the posterior distribution, the likelihood, and the prior distributions. Thus the “omitted” proportionality constant may depend on 𝒚^\bm{\hat{y}}, but not on 𝒙^\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}. Indeed, for certain distributions, applying Bayes’s theorem will not require computing the normalizer explicitly; instead, we shall simply recognize the parameteric family to which the unnormalized product of prior and likelihood belong. Won’t we need the normalizer to make further computations with the posterior distribution? Not necessarily: for some distributions, the computations can be written purely in terms of the cumulants, which can be computed independently of the normalizer.

Then again, for other distributions, the normalizer is essential—for example, if 𝑿^{\bm{\hat{X}}} were categorically distributed. And often it is useful in its own right: for some models, we never make observations of 𝑿{\bm{X}}, only 𝒀{\bm{Y}}, so p^(𝒚;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\bm{y};\bm{\theta}}\right)} provides the ultimate measure of our model’s worth. Accordingly, in many of the models that we consider below, we shall compute it. That is, we shall convert the prior and emission distributions into the posterior distribution (with Eq. 2.1) and this mariginal distribution over the observations. We will thereby have “inverted the arrow” of the model: provided an alternative, albeit completely equivalent, characterization of the joint distribution of 𝑿^{\bm{\hat{X}}} and 𝒀^{\bm{\hat{Y}}}.

Unfortunately (and now we arrive at the rub), this is possible only for a handful of distributions. We shall explore this problem over the course of this chapter.

Where does the generative model come from?

…. [[some examples]] We have discussed the possibility of constructing generative models from “some idea of the generative process,” and in certain cases this includes even the numerical values of parameters; e.g., perhaps they come from a physical process. More frequently, we need to learn these parameters. This task will occupy other chapters in this book, but a basic distinction between learning tasks has implications for our representations themselves.

The distinction is whether or not we ever observe the “query” variables about which, ultimately, we shall make inferences. In one kind of problem, we at some time observe the query variables along with the emissions, i.e. we make observations {𝒙n,𝒚n}n=1N\mathopen{}\mathclose{{}\left\{{\bm{x}}_{n},{\bm{y}}_{n}}\right\}_{n=1}^{N} from the data distribution p(𝒙,𝒚){p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[% named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{x}{},\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}, and fit a model p^(𝒙^,𝒚^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{},\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)} to these data. In the other kind of problem, we never observe a variable 𝐗{\bm{X}}, and instead only ever observe {𝒚n}n=1N\mathopen{}\mathclose{{}\left\{{\bm{y}}_{n}}\right\}_{n=1}^{N} from the data marginal p(𝒚){p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[% named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}. In this case there is no 𝑿{\bm{X}} to speak of, only 𝑿^{\bm{\hat{X}}}. At first blush, it may seem somewhat mysterious why we would introduce into our model a variable which may have no counterpart in the real world. But such “latent” variables can simplify our observations, as seen most obviously in a mixture model, Fig. LABEL:fig:XXX. Although we never directly observe the value of this latent variable, it seems obvious that it is there. Other examples include [[spatiotemporally extended objects for images…]]

What is the implication for our representations? Latent-variable models will in general be less expressive than their otherwise equivalent, fully observed counterparts. This is because only certain aspects of the latent variable will ever be identifiable from the observed data. For example, consider a normally distributed latent variable, 𝑿^{\bm{\hat{X}}}, and an emission 𝒀^|𝑿^{\bm{\hat{Y}}}|{\bm{\hat{X}}} that is normally distributed about an affine function of 𝑿^{\bm{\hat{X}}}. If the offset in that affine function is unknown and to be learned, there is no point in allowing the latent variable a non-zero mean. It provides one degree of freedom too many. More generally, latent-variable models raise questions of the identifiability of their parameters. We discuss these issues below.

Directed graphical models.

margin: Expand me. Introduce the three canonical graphs, figures, etc.

Of course, our model may involve more than just two random variables! Then it may become quite useful to express graphically the statistical dependencies among these variables. And indeed, when the dependencies are described, as here, in terms of probability distributions, we can use these distributions to parameterize a directed acyclic graph, each node corresponding to a random variable.

Now, by the chain rule of probability, the joint distribution can always be written as a product over NN conditional distributions (with marginal distributions as a special case), one for each of the NN variables in the joint. Thus a one-to-one-to-one relationship is established between nodes, random variables, and conditional distributions. The variable to the left of the vertical bar || therefore determines the assignment of conditional distribution to node—and the variables to the right of the bar, for their part, determine the parents of that node in the graph. That is, for all ii, the conditional distribution p^(x^i|x^p(i))\hat{p}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{% rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{x}_% {i}|\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\hat{x}_{% \text{p}(i)}) is assigned to node ii, and nodes in the set p(i)\text{p}(i) are connected by directed edges (arrows) to the node ii. (Nodes with only marginal distributions have no parents.)margin: directed graphical model

There are two problems with this approach. First, if the conditional distributions were in fact adduced simply by naïve application of the chain rule of probability, then one node in the graph would have all the others as parents, another would have nearly all, etc. However, the fundamental practical fact about graphical models, as we shall see, is that they are only really useful when this doesn’t happen; indeed, when most of the arrows in the graph are missing. That raises the question: Given the semantics of these directed graphical models, when can we remove arrows? The second problem is that this graph doesn’t capture any of the statistical dependencies particular to our model! The chain rule applies equally to any joint distribution.

The problems are flip sides of the same coin and have a single solution. When there are conditional independencies among variables, some of the conditional distributions simplify: variables disappear from the right-hand side of the vertical bars ||. Given the rules for constructing the graph lately described, this corresponds to removing arrows from the graph. Thus, missing arrows in the graph represent (conditional) independence statements, and make inference possible, as we shall see.

Now, chain-rule decompositions are not unique, and so in practice it is unusual simply to write out the joint in terms of one of these and then to start thinking of what conditional independence statements apply. Instead, one typically proceeds from the other direction, by considering how certain random variables depend on others, constructing conditional distributions accordingly (and the graph along the way), and then finally multiplying them together to produce the joint.

Unsuprisingly, then, conditional (in)dependencies between any two (sets of) variables can be determined with a simple procedure on these directed graphical models. And (exact) inference amounts to some more or less clever application of Bayes’s rule, Eq. 2.1, that exploits the graph structure. How easy or hard it is to apply the rule depends on both the structure of the graphical model and the nature of the distributions that parameterize it. In fine, exact inference with Bayes rule is computationally feasible only for graphs (1) containing only small cliques (fully connected subgraphs) and (2) that are parameterized by distributions that play nicely with one another, i.e. yield tractable integrals under Bayes rule. The latter group turns out, as we shall see, to be very small. These two considerations motivate attempts to approximate inference. We have encountered one of the fundamental trade-offs explored in this book: between the expressive power of the model and the exactness of the inference in that model.

We shall discuss general inference algorithms after we have introduced undirected graphical models in Chapter 3. In the following sections, we start with very simple models for which exact inference is possible by simple application of Bayes rule. But we shall quickly find reason to give up exact inference in return for more expressive models, in which case we shall use simple approximations.

In all cases in this chapter, we focus on models that can be written in terms of “source” variables 𝑿^{\bm{\hat{X}}} that have no parents (although see below), and a single set of “emissions,” 𝒀^{\bm{\hat{Y}}}, i.e. their children in this graph. The goal of inference will be essentially “to invert the arrow” in the graph between 𝑿^{\bm{\hat{X}}} and 𝒀^{\bm{\hat{Y}}}. To emphasize the connections between these models, all emissions will generally be normally distributed. To generate different models, we consider prior distributions that differ along two abstract “dimensions”: (1) sparsity; and (2) internal structure of statistical dependence—in particular, we consider source variables that form a Markov chain.

Naming conventions for the models are not wholly satisfactory. Often the most popular name associated with a model arose historically for the inference algorithm, or even the corresponding learning algorithm, rather than the model itself. Where possible, which is not always, I have supplied a name with some basis in the literature that describes the model itself.