Chapter 2 Directed Generative models
Generative vs. discriminative models.
A generative model††margin:
generative models
specifies a joint distribution over all random variables of interest.
Now, what counts as a random variable, as opposed to a parameter, can itself be a decision for the modeler—at least for Bayesians (non-frequentists); but for now we set aside this question.
Instead we might wonder what circumstances could justify specifying less than the entire joint distribution.
[[See discussion in [21].]]
One such circumstance is the construction of maps, e.g., from variables
Perhaps now the case for discriminative learning of maps seems, not just plausible, but overwhelming.
When is it helpful to model the joint distribution,
We recall that Bayes’s “theorem”††margin: Bayes’s theorem is just a rearrangement of the definition of conditional probabilities:
The last formulation is, although less explicit, also common.22
2
I have written Bayes’s theorem with
Then again, for other distributions, the normalizer is essential—for example, if
Unfortunately (and now we arrive at the rub), this is possible only for a handful of distributions. We shall explore this problem over the course of this chapter.
Where does the generative model come from?
…. [[some examples]] We have discussed the possibility of constructing generative models from “some idea of the generative process,” and in certain cases this includes even the numerical values of parameters; e.g., perhaps they come from a physical process. More frequently, we need to learn these parameters. This task will occupy other chapters in this book, but a basic distinction between learning tasks has implications for our representations themselves.
The distinction is whether or not we ever observe the “query” variables about which, ultimately, we shall make inferences.
In one kind of problem, we at some time observe the query variables along with the emissions, i.e. we make observations
What is the implication for our representations?
Latent-variable models will in general be less expressive than their otherwise equivalent, fully observed counterparts.
This is because only certain aspects of the latent variable will ever be identifiable from the observed data.
For example, consider a normally distributed latent variable,
Directed graphical models.
††margin: Expand me. Introduce the three canonical graphs, figures, etc.Of course, our model may involve more than just two random variables! Then it may become quite useful to express graphically the statistical dependencies among these variables. And indeed, when the dependencies are described, as here, in terms of probability distributions, we can use these distributions to parameterize a directed acyclic graph, each node corresponding to a random variable.
Now, by the chain rule of probability, the joint distribution can always be written as a product over
There are two problems with this approach. First, if the conditional distributions were in fact adduced simply by naïve application of the chain rule of probability, then one node in the graph would have all the others as parents, another would have nearly all, etc. However, the fundamental practical fact about graphical models, as we shall see, is that they are only really useful when this doesn’t happen; indeed, when most of the arrows in the graph are missing. That raises the question: Given the semantics of these directed graphical models, when can we remove arrows? The second problem is that this graph doesn’t capture any of the statistical dependencies particular to our model! The chain rule applies equally to any joint distribution.
The problems are flip sides of the same coin and have a single solution.
When there are conditional independencies among variables, some of the conditional distributions simplify: variables disappear from the right-hand side of the vertical bars
Now, chain-rule decompositions are not unique, and so in practice it is unusual simply to write out the joint in terms of one of these and then to start thinking of what conditional independence statements apply. Instead, one typically proceeds from the other direction, by considering how certain random variables depend on others, constructing conditional distributions accordingly (and the graph along the way), and then finally multiplying them together to produce the joint.
Unsuprisingly, then, conditional (in)dependencies between any two (sets of) variables can be determined with a simple procedure on these directed graphical models. And (exact) inference amounts to some more or less clever application of Bayes’s rule, Eq. 2.1, that exploits the graph structure. How easy or hard it is to apply the rule depends on both the structure of the graphical model and the nature of the distributions that parameterize it. In fine, exact inference with Bayes rule is computationally feasible only for graphs (1) containing only small cliques (fully connected subgraphs) and (2) that are parameterized by distributions that play nicely with one another, i.e. yield tractable integrals under Bayes rule. The latter group turns out, as we shall see, to be very small. These two considerations motivate attempts to approximate inference. We have encountered one of the fundamental trade-offs explored in this book: between the expressive power of the model and the exactness of the inference in that model.
We shall discuss general inference algorithms after we have introduced undirected graphical models in Chapter 3. In the following sections, we start with very simple models for which exact inference is possible by simple application of Bayes rule. But we shall quickly find reason to give up exact inference in return for more expressive models, in which case we shall use simple approximations.
In all cases in this chapter, we focus on models that can be written in terms of “source” variables
Naming conventions for the models are not wholly satisfactory. Often the most popular name associated with a model arose historically for the inference algorithm, or even the corresponding learning algorithm, rather than the model itself. Where possible, which is not always, I have supplied a name with some basis in the literature that describes the model itself.