3.1 The exponential-family harmonium

In Chapter 2, we encountered a direct trade-off between the expressivity of the model emission distribution, ${\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{}\middle|\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{% \theta}}\right)}$ , and the model posterior, ${\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{}\middle|\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{% \theta}}\right)}$ , imposed by Bayes’s theorem. In particular, applying Bayes’s theorem requires integrating the product of the emission and prior ( ${\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{\theta}}\right)}$ ) densities across all configurations of the latent variables, ${\bm{\hat{X}}}$ . For continuous-valued ${\bm{\hat{X}}}$ , this integral is tractable only for specially selected prior and emission densities. For discrete-valued ${\bm{\hat{X}}}$ , the integral becomes a sum, which is only computationally feasible for low-dimensional ${\bm{\hat{X}}}$ : the number of summands is exponential in $\dim\mathopen{}\mathclose{{}\left({\bm{\hat{X}}}}\right)$ .

Suppose instead, then, we simply declared at the outset our two (so far) desiderata: easily computable emission and posterior distributions. Of course, not every pair of such distributions will be compatible, but perhaps if we start with some very general form for these distributions, we can subsequently determine what restrictions will be required for their consistency. In so doing, we shall have derived a rather general undirected graphical model known as the exponential-family harmonium [53]. In fact, the EFH was derived as a generalization of the famous restricted Boltzmann machine [45], but we shall approach from the other end and present the RBM as a special case of the EFH.

Deriving the joint from two coupled, exponential-family conditionals.

We shall not assume the emission and posterior distributions fully general, but that they are in exponential families. Note that this need not be the same exponential family; indeed, the several elements of (e.g.) ${\bm{\hat{X}}}$ need not even belong to the same one. Nevertheless, we can specify that the two distributions have the forms

\begin{split}{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{}\middle|% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{% };\bm{\theta}}\right)}&=h(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{x}})\exp\mathopen{}\mathclose{{}\left\{\bm% {\eta}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb% }{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}% })^{\text{T}}{\bm{T}}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{x}})-A(\bm{\eta}(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}))}\right% \},\\ {\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{}\middle|\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{% \theta}}\right)}&=k(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{y}})\exp\mathopen{}\mathclose{{}\left\{\bm% {\zeta}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{% rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat% {x}})^{\text{T}}{\bm{U}}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{y}})-B(\bm{\zeta}(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}))}\right% \}.\end{split}

Thus, (functions of) $\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}$ and $\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}$ interact with each other only through an inner product.

Now, the ratio of the conditionals is also the ratio of the marginals,

\frac{{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{}\middle|\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{% \theta}}\right)}}{{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]% {.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{}\middle|% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{% };\bm{\theta}}\right)}}=\frac{{\hat{p}\mathopen{}\mathclose{{}\left(% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{% };\bm{\theta}}\right)}}{{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode% \color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{% \theta}}\right)}}=\frac{k(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{y}})\exp\mathopen{}\mathclose{{}\left\{A(% \bm{\eta}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{% rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat% {y}}))}\right\}}{h(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{x}})\exp\mathopen{}\mathclose{{}\left\{B(% \bm{\zeta}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}% {rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{% \hat{x}}))}\right\}}\exp\mathopen{}\mathclose{{}\left\{\bm{\zeta}(\leavevmode% \color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}})^{\text{T% }}{\bm{U}}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}% {rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{% \hat{y}})-\bm{\eta}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{y}})^{\text{T}}{\bm{T}}(\leavevmode\color[% rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}})}\right\},

but we know an additional fact about this ratio: it must factor entirely into pieces that refer to at most one of $\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}$ or $\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}$ . The first two (rational) factors look fine, but the third term requires that

\bm{\zeta}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}% {rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{% \hat{x}})^{\text{T}}{\bm{U}}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[% named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{y}})-\bm{\eta}(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}})^{\text{T% }}{\bm{T}}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}% {rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{% \hat{x}})=\mu(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{x}})-\nu(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}),

(3.1)

for some functions $\mu$ and $\nu$ . It can be shown (see the proof below) that under some mild conditions, this requires each distribution’s natural parameters to be an affine function of the other distribution’s sufficient statistics,

\begin{split}\bm{\eta}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{y}})=\bm{b}_{\hat{x}}+\mathbf{W}_{\hat{y}% \hat{x}}{\bm{U}}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{y}})\\ \bm{\zeta}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}% {rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{% \hat{x}})=\bm{b}_{\hat{y}}+\mathbf{W}_{\hat{y}\hat{x}}^{\text{T}}{\bm{T}}(% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}})% ,\end{split}

with a shared, albeit transposed, linear transformation $\mathbf{W}_{\hat{y}\hat{x}}$ . Therefore, the marginal distributions are (up to the proportionality constants)

\begin{split}{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{% \theta}}\right)}&\propto h(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]% {pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{x}})\exp\mathopen{}\mathclose{{}\left\{\bm% {b}_{\hat{x}}^{\text{T}}{\bm{T}}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[% named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{x}})+B(\bm{\zeta}(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}))}\right% \},\\ {\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)}&\propto k(% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}})% \exp\mathopen{}\mathclose{{}\left\{\bm{b}_{\hat{y}}^{\text{T}}{\bm{U}}(% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}})% +A(\bm{\eta}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{y}}))}\right\};\end{split}

and the conditional distributions are

\begin{split}{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{}\middle|% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{% };\bm{\theta}}\right)}&=h(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{x}})\exp\mathopen{}\mathclose{{}\left\{% \mathopen{}\mathclose{{}\left(\bm{b}_{\hat{x}}+\mathbf{W}_{\hat{y}\hat{x}}{\bm% {U}}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}})% }\right)^{\text{T}}{\bm{T}}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named% ]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{x}})-A(\bm{\eta}(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}))}\right% \},\\ {\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{}\middle|\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{% \theta}}\right)}&=k(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{y}})\exp\mathopen{}\mathclose{{}\left\{% \mathopen{}\mathclose{{}\left(\bm{b}_{\hat{y}}+\mathbf{W}_{\hat{y}\hat{x}}^{% \text{T}}{\bm{T}}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{x}})}\right)^{\text{T}}{\bm{U}}(% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}})% -B(\bm{\zeta}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{x}}))}\right\}.\end{split}

Multiplying a conditional by the appropriate marginal yields the joint distribution:

\begin{split}{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{},% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{% };\bm{\theta}}\right)}&={\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode% \color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{}\middle|% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{% };\bm{\theta}}\right)}{\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[% rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{% \theta}}\right)}\\ &\propto h(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}% {rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{% \hat{x}})k(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}% {rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{% \hat{y}})\exp\mathopen{}\mathclose{{}\left\{\bm{b}_{\hat{y}}^{\text{T}}{\bm{U}% }(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}})% +\bm{b}_{\hat{x}}^{\text{T}}{\bm{T}}(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{x}})+{\bm{U}}(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}})^{\text{T% }}\mathbf{W}_{\hat{y}\hat{x}}^{\text{T}}{\bm{T}}(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}})}\right\}% .\end{split}

Thus the joint takes the form of a Boltzmann distribution with negative energy

-E(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}},% \leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}},% \bm{\theta})=\bm{b}_{\hat{y}}^{\text{T}}{\bm{U}}(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}})+\bm{b}_{% \hat{x}}^{\text{T}}{\bm{T}}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named% ]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{x}})+{\bm{U}}(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}})^{\text{T% }}\mathbf{W}_{\hat{y}\hat{x}}^{\text{T}}{\bm{T}}(\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}})+\log% \mathopen{}\mathclose{{}\left(h(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[% named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{x}})k(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}})}\right).

The price of trivial inference.

We can now reckon the cost at which our closed-form posterior distribution was bought. We have traded an intractable posterior-distribution normalizer for an intractable joint-distribution normalizer. The normalizer for the marginal distribution ${\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)}$ is still intractable, as it is for many directed models, but now so is the normalizer for the prior distribution, ${\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{x}}{};\bm{\theta}}\right)}$ .

…

Enforcing consistency between exponential-family emission and posterior distributions.

We saw above that when the emission and posterior distributions are both in exponential families, the natural parameters are constrained by Eq. 3.1. To simplify the presentation, we repeat the constraint here (with the vector-valued functions named alphabetically):

\mu(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}})% -\nu(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}})% =\bm{\gamma}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{x}})^{\text{T}}\bm{\delta}(\leavevmode% \color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}})-\bm{% \beta}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb% }{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}% })^{\text{T}}\bm{\alpha}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{x}}).

(3.2)

It is intuitive that this equation constrains the natural parameters (here, $\bm{\gamma}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor% }{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{% \hat{x}})$ and $\bm{\beta}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}% {rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{% \hat{y}})$ ): no $\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{x}}$ - $\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}$ interaction terms appear on the left-hand side, so those generated on the right must cancel. This is particularly restrictive since the interactions are created only through inner products. For example, if $\delta(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb% }{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}})$ contains only terms quadratic in the elements of $\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% .5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}$ , then $\bm{\beta}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}% {rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{% \hat{y}})$ must contain such terms as well, in order to cancel them (except in the trivial case where $\bm{\gamma}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor% }{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{% \hat{x}})$ is constant).

Let all the functions be polynomials in $\bm{\hat{x}}$ and $\bm{\hat{y}}$ of maximum degree $D$ , and define the monomial bases

\begin{split}\bm{v}_{\hat{y}}&\mathrel{\vbox{\hbox{\scriptsize.}\hbox{% \scriptsize.} }}=\mathopen{}\mathclose{{}\left[\hat{y}_{1},\hat{y}_{2},\ldots,\hat{y}_{1}^{2% },\hat{y}_{1}\hat{y}_{2},\hat{y}_{1}\hat{y}_{3},\ldots,\hat{y}_{K}^{D}}\right]% ^{\text{T}}\\ \bm{v}_{\hat{x}}&\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.} }}=\mathopen{}\mathclose{{}\left[\hat{x}_{1},\hat{x}_{2},\ldots,\hat{x}_{1}^{2% },\hat{x}_{1}\hat{x}_{2},\hat{x}_{1}\hat{x}_{3},\ldots,\hat{x}_{K}^{D}}\right]% ^{\text{T}}.\end{split}

. . =[y^1,y^2,…,y^12,y^1y^2,y^1y^3,…,y^KD]T𝒗x^

. .

=[x^1,x^2,…,x^12,x^1x^2,x^1x^3,…,x^KD]T.\begin{split}\bm{v}_{\hat{y}}&\mathrel{\vbox{\hbox{\scriptsize.}\hbox{% \scriptsize.} }}=\mathopen{}\mathclose{{}\left[\hat{y}_{1},\hat{y}_{2},\ldots,\hat{y}_{1}^{2% },\hat{y}_{1}\hat{y}_{2},\hat{y}_{1}\hat{y}_{3},\ldots,\hat{y}_{K}^{D}}\right]% ^{\text{T}}\\ \bm{v}_{\hat{x}}&\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.} }}=\mathopen{}\mathclose{{}\left[\hat{x}_{1},\hat{x}_{2},\ldots,\hat{x}_{1}^{2% },\hat{x}_{1}\hat{x}_{2},\hat{x}_{1}\hat{x}_{3},\ldots,\hat{x}_{K}^{D}}\right]% ^{\text{T}}.\end{split}

(Notice that we have omitted the constants from these bases.) For appropriately shaped matrices ( $\mathbf{A},\mathbf{B},\mathbf{C},\mathbf{D}$ ), vectors ( $\bm{a},\bm{b},\bm{c},\bm{d}$ ), and constant ( $k$ ), Eq. 3.2 is equivalent to the equation

\begin{split}\bm{m}^{\text{T}}\bm{v}_{\hat{x}}-\bm{n}^{\text{T}}\bm{v}_{\hat{y% }}+k&=\mathopen{}\mathclose{{}\left(\bm{c}+\mathbf{C}\bm{v}_{\hat{x}}}\right)^% {\text{T}}\mathopen{}\mathclose{{}\left(\bm{d}+\mathbf{D}\bm{v}_{\hat{y}}}% \right)-\mathopen{}\mathclose{{}\left(\bm{b}+\mathbf{B}\bm{v}_{\hat{y}}}\right% )^{\text{T}}\mathopen{}\mathclose{{}\left(\bm{a}+\mathbf{A}\bm{v}_{\hat{x}}}% \right)\\ &=\bm{v}_{\hat{x}}^{\text{T}}\mathopen{}\mathclose{{}\left(\mathbf{C}^{\text{T% }}\mathbf{D}-\mathbf{A}^{\text{T}}\mathbf{B}}\right)\bm{v}_{\hat{y}}+\bm{v}_{% \hat{x}}^{\text{T}}\mathopen{}\mathclose{{}\left(\mathbf{C}^{\text{T}}\bm{d}-% \mathbf{A}^{\text{T}}\bm{b}}\right)+\mathopen{}\mathclose{{}\left(\bm{c}^{% \text{T}}\mathbf{D}-\bm{a}^{\text{T}}\mathbf{B}}\right)\bm{v}_{\hat{y}}+% \mathopen{}\mathclose{{}\left(\bm{c}^{\text{T}}\bm{d}-\bm{a}^{\text{T}}\bm{b}}% \right)\end{split}

holding for all values of $\bm{v}_{\hat{x}}$ and $\bm{v}_{\hat{y}}$ . Therefore,

$\displaystyle k$	$\displaystyle=\bm{c}^{\text{T}}\bm{d}-\bm{a}^{\text{T}}\bm{b}$	(3.3)
$\displaystyle\bm{m}$	$\displaystyle=\mathbf{C}^{\text{T}}\bm{d}-\mathbf{A}^{\text{T}}\bm{b}$
$\displaystyle-\bm{n}^{\text{T}}$	$\displaystyle=\bm{c}^{\text{T}}\mathbf{D}-\bm{a}^{\text{T}}\mathbf{B}$
$\displaystyle 0$	$\displaystyle=\mathbf{C}^{\text{T}}\mathbf{D}-\mathbf{A}^{\text{T}}\mathbf{B}.$

We shall only make use of the last of these, Eq. 3.3.

Now assume $\mathbf{A}$ and $\mathbf{D}$ are “fat”—that is, $D\geq K$ : the monomial bases $\bm{v}_{\hat{x}}$ and $\bm{v}_{\hat{y}}$ have at least as many elements as the vector-valued functions $\bm{\alpha}$ and $\bm{\delta}$ (resp.)—with linearly independent columns. Then there exists a (tall) right pseudo-inverse for $\mathbf{A}$ , call it $\mathbf{A}^{\dagger}$ , such that $\mathbf{A}\mathbf{A}^{\dagger}=\mathbf{I}$ ; and a (tall) right pseudo-inverse for $\mathbf{D}$ , call it $\mathbf{D}^{\dagger}$ , such that $\mathbf{D}\mathbf{D}^{\dagger}=\mathbf{I}$ . It follows immediately from the last of Eq. 3.3 that

\begin{split}\mathbf{C}^{\text{T}}&=\mathbf{A}^{\text{T}}\mathbf{B}\mathbf{D}^% {\dagger},\hskip 28.452756pt\mathbf{B}=\mathopen{}\mathclose{{}\left(\mathbf{C% }\mathbf{A}^{\dagger}}\right)^{\text{T}}\mathbf{D}\\ \implies\mathopen{}\mathclose{{}\left(\mathbf{C}\mathbf{A}^{\dagger}}\right)^{% \text{T}}&=\mathbf{B}\mathbf{D}^{\dagger}=\mathrel{\vbox{\hbox{\scriptsize.}% \hbox{\scriptsize.} }}\mathbf{W}\\ \implies\mathbf{C}^{\text{T}}&=\mathbf{A}^{\text{T}}\mathbf{W},\hskip 36.98858% 3pt\mathbf{B}=\mathbf{W}\mathbf{D},\end{split}

. . 𝐖⟹𝐂T=𝐀T⁢𝐖,𝐁=𝐖𝐃,\begin{split}\mathbf{C}^{\text{T}}&=\mathbf{A}^{\text{T}}\mathbf{B}\mathbf{D}^% {\dagger},\hskip 28.452756pt\mathbf{B}=\mathopen{}\mathclose{{}\left(\mathbf{C% }\mathbf{A}^{\dagger}}\right)^{\text{T}}\mathbf{D}\\ \implies\mathopen{}\mathclose{{}\left(\mathbf{C}\mathbf{A}^{\dagger}}\right)^{% \text{T}}&=\mathbf{B}\mathbf{D}^{\dagger}=\mathrel{\vbox{\hbox{\scriptsize.}% \hbox{\scriptsize.} }}\mathbf{W}\\ \implies\mathbf{C}^{\text{T}}&=\mathbf{A}^{\text{T}}\mathbf{W},\hskip 36.98858% 3pt\mathbf{B}=\mathbf{W}\mathbf{D},\end{split} (3.4)

where on the second line we have defined a new matrix $\mathbf{W}$ . This allows us to rewrite the functions $\bm{\gamma}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor% }{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{% \hat{x}})$ and $\bm{\beta}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}% {rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{% \hat{y}})$ in terms of $\bm{\alpha}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor% }{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{% \hat{x}})$ and $\bm{\delta}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor% }{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{% \hat{y}})$ (resp.):

\begin{split}\bm{\gamma}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{x}})&=\mathbf{C}\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{v}_{\hat{x}}+\bm{c% }\\ &=\mathbf{W}^{\text{T}}\mathbf{A}\leavevmode\color[rgb]{.5,.5,.5}\definecolor[% named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{v}_{\hat{x}}+\bm{c}\\ &=\mathbf{W}^{\text{T}}\mathopen{}\mathclose{{}\left(\mathbf{A}\leavevmode% \color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{v}_{\hat{x}}+\bm{a% }}\right)+\mathopen{}\mathclose{{}\left(\bm{c}-\mathbf{W}^{\text{T}}\bm{a}}% \right)\\ &=\mathbf{W}^{\text{T}}\bm{\alpha}(\leavevmode\color[rgb]{.5,.5,.5}% \definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5% }\pgfsys@color@gray@fill{.5}\bm{\hat{x}})+\mathopen{}\mathclose{{}\left(\bm{c}% -\mathbf{W}^{\text{T}}\bm{a}}\right)\\ \bm{\beta}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}% {rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{% \hat{y}})&=\mathbf{B}\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{v}_{\hat{y}}+\bm{b}\\ &=\mathbf{W}\mathbf{D}\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{v}_{\hat{y}}+\bm{b}\\ &=\mathbf{W}\mathopen{}\mathclose{{}\left(\mathbf{D}\leavevmode\color[rgb]{% .5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{v}_{\hat{y}}+\bm{d% }}\right)+\mathopen{}\mathclose{{}\left(\bm{b}-\mathbf{W}\bm{d}}\right)\\ &=\mathbf{W}\bm{\delta}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}\bm{\hat{y}})+\mathopen{}\mathclose{{}\left(\bm{b}-% \mathbf{W}\bm{d}}\right).\end{split}

(3.5)

In a word, $\bm{\gamma}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor% }{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{% \hat{x}})$ is an affine function of $\bm{\alpha}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor% }{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{% \hat{x}})$ , and $\bm{\beta}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}% {rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{% \hat{y}})$ is an affine function of $\bm{\delta}(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor% }{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\bm{% \hat{y}})$ ; and the linear transformations are transposes of each other.