4.2 Minimizing relative entropy and maximizing likelihood

So we have specified the goal of learning, p^(𝒚^;𝜽)=p(𝒚){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)}={p\mathopen{}%\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[named]{%pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}, but on the other hand conceded that it may not be achievable. What we need, given the view just sketched of learning as optimization, is a notion of distance from this goal. Or in other words, if learning is “to make the model distribution, p^(𝒚^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)}, more like the data distribution, p(𝒚){p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[%named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)},” we need to operationalize “more like.” To do so, we take a somewhat informal detour through elementary information theory.

A guessing game.

Suppose we play a game in which I hide a marble under one of four cups, and you have to determine which by asking a series of yes/no questions. Every time we play, I select the destination cup according to the probabilities pp which, suppose, are

p(𝒚)=
/12/14/18/18ABCD
{p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[%named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}=\leavevmode\hbox to228.42pt{\vboxto%26.48pt{\pgfpicture\makeatletter\raise-4.576907pt\hbox{\hskip 0.4pt\lower-2.42%6907pt\hbox to 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{%pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }%\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}%\pgfsys@invoke{ }\nullfont\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@invoke{%\lxSVG@closescope }\pgfsys@endscope\hbox to 0.0pt{\pgfsys@beginscope%\pgfsys@invoke{ }{{}{{}}{}{{}{}}{{}}{}{}{}{}{{}}{}{}{}{{{}{}}}\pgfsys@beginscope\pgfsys@invoke{ }%\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}%\pgfsys@invoke{ }\pgfsys@setlinewidth{0.8pt}\pgfsys@invoke{ }{}\pgfsys@moveto{%0.0pt}{0.0pt}\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@lineto{0.0pt}{10.812047pt}%\pgfsys@lineto{113.811024pt}{10.812047pt}\pgfsys@lineto{113.811024pt}{0.0pt}%\pgfsys@closepath\pgfsys@moveto{113.811024pt}{10.812047pt}\pgfsys@stroke%\pgfsys@invoke{ }\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{}{{{}{}}}{{}{}}{{}{{}}}{{}{}}{}{{}{}}{}{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1%.0}{45.655684pt}{2.906061pt}\pgfsys@invoke{ }\hbox{{\definecolor{%pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }%\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$1/2$}}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{{}}{}{{}{}}{{}}{}{}{}{}{{}}{}{}{}{{{}{}}}\pgfsys@beginscope\pgfsys@invoke{ }%\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}%\pgfsys@invoke{ }\pgfsys@setlinewidth{0.8pt}\pgfsys@invoke{ }{}\pgfsys@moveto{%113.811024pt}{0.0pt}\pgfsys@moveto{113.811024pt}{0.0pt}\pgfsys@lineto{113.8110%24pt}{10.812047pt}\pgfsys@lineto{170.716536pt}{10.812047pt}\pgfsys@lineto{170.%716536pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{170.716536pt}{10.812047pt}%\pgfsys@stroke\pgfsys@invoke{ }\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ %}{{}{}{{{}{}}}{{}{}}{{}{{}}}{{}{}}{}{{}{}}{}{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1%.0}{131.013951pt}{2.906061pt}\pgfsys@invoke{ }\hbox{{\definecolor{%pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }%\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$1/4$}}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{{}}{}{{}{}}{{}}{}{}{}{}{{}}{}{}{}{{{}{}}}\pgfsys@beginscope\pgfsys@invoke{ }%\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}%\pgfsys@invoke{ }\pgfsys@setlinewidth{0.8pt}\pgfsys@invoke{ }{}\pgfsys@moveto{%170.716536pt}{0.0pt}\pgfsys@moveto{170.716536pt}{0.0pt}\pgfsys@lineto{170.7165%36pt}{10.812047pt}\pgfsys@lineto{199.169292pt}{10.812047pt}\pgfsys@lineto{199.%169292pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{199.169292pt}{10.812047pt}%\pgfsys@stroke\pgfsys@invoke{ }\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ %}{{}{}{{{}{}}}{{}{}}{{}{{}}}{{}{}}{}{{}{}}{}{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1%.0}{173.693085pt}{2.906061pt}\pgfsys@invoke{ }\hbox{{\definecolor{%pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }%\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$1/8$}}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{{}}{}{{}{}}{{}}{}{}{}{}{{}}{}{}{}{{{}{}}}\pgfsys@beginscope\pgfsys@invoke{ }%\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}%\pgfsys@invoke{ }\pgfsys@setlinewidth{0.8pt}\pgfsys@invoke{ }{}\pgfsys@moveto{%199.169292pt}{0.0pt}\pgfsys@moveto{199.169292pt}{0.0pt}\pgfsys@lineto{199.1692%92pt}{10.812047pt}\pgfsys@lineto{227.622048pt}{10.812047pt}\pgfsys@lineto{227.%622048pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{227.622048pt}{10.812047pt}%\pgfsys@stroke\pgfsys@invoke{ }\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ %}{{}{}{{{}{}}}{{}{}}{{}{{}}}{{}{}}{}{{}{}}{}{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1%.0}{202.145841pt}{2.906061pt}\pgfsys@invoke{ }\hbox{{\definecolor{%pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }%\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$1/8$}}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\par{}{{}}{}{{}{}}{{}}{}{}{}{}{{}}{}{}{}{{{}{}}}\pgfsys@beginscope\pgfsys@invoke{ }%\pgfsys@setlinewidth{0.8pt}\pgfsys@invoke{ }\hbox{\hbox{{\pgfsys@beginscope%\pgfsys@invoke{ }{{}{}{{{}{}}}{{}{}}{{}{{}}}{{}{}}{}{{}{}}{}{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1%.0}{53.155569pt}{13.718109pt}\pgfsys@invoke{ }\hbox{{\definecolor{%pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }%\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{A}}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{{}}{}{{}{}}{{}}{}{}{}{}{{}}{}{}{}{{{}{}}}\pgfsys@beginscope\pgfsys@invoke{ }%\pgfsys@setlinewidth{0.8pt}\pgfsys@invoke{ }\hbox{\hbox{{\pgfsys@beginscope%\pgfsys@invoke{ }{{}{}{{{}{}}}{{}{}}{{}{{}}}{{}{}}{}{{}{}}{}{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1%.0}{138.513837pt}{13.718109pt}\pgfsys@invoke{ }\hbox{{\definecolor{%pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }%\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{B}}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{{}}{}{{}{}}{{}}{}{}{}{}{{}}{}{}{}{{{}{}}}\pgfsys@beginscope\pgfsys@invoke{ }%\pgfsys@setlinewidth{0.8pt}\pgfsys@invoke{ }\hbox{\hbox{{\pgfsys@beginscope%\pgfsys@invoke{ }{{}{}{{{}{}}}{{}{}}{{}{{}}}{{}{}}{}{{}{}}{}{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1%.0}{181.192971pt}{13.718109pt}\pgfsys@invoke{ }\hbox{{\definecolor{%pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }%\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{C}}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{{}}{}{{}{}}{{}}{}{}{}{}{{}}{}{}{}{{{}{}}}\pgfsys@beginscope\pgfsys@invoke{ }%\pgfsys@setlinewidth{0.8pt}\pgfsys@invoke{ }\hbox{\hbox{{\pgfsys@beginscope%\pgfsys@invoke{ }{{}{}{{{}{}}}{{}{}}{{}{{}}}{{}{}}{}{{}{}}{}{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1%.0}{209.645727pt}{13.718109pt}\pgfsys@invoke{ }\hbox{{\definecolor{%pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }%\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{D}}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}%\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}%\lxSVG@closescope\endpgfpicture}}
(4.1)

and which you know. Your strategy is to partition the probability space into halves recursively. (Of course, this may not be feasible—e.g., if cup A had probability 1/31/3—but we will handle such cases later. We will also address below whether your strategy is any good.) Thus, each of your questions eliminates half the probability space, and it takes nn questions to reduce the space to (12)n(\frac{1}{2})^{n} of its original size. Put the other way around, to reduce the space to fraction qq takes -log2q-\log_{2}q questions. margin: flow chart showing guessing outcomes

How many questions will it take on average (across all games)? Here, for example, you would first ask, “Is it cup A?” Half the time you would be right, so half the time you would locate the marble in one guess ((1/2)(1)(1/2)(1)). If not, you would ask, “Is it cup B?” in which case you would again be right half the time you asked the question—i.e., half of the half of the time you played the game, at which point you would have asked two questions ((1/4)(2)(1/4)(2)). Finally, in the 1/41/4 of the games that you received two “no”s, you would ask “Is it cup C?” and from either answer deduce the location of the marble ((1/8)(3)+(1/8)(3)(1/8)(3)+(1/8)(3)). Totaling up the average number of questions yields

y-p(y)log2p(y)=-12log212-14log214-18log218-18log218=12(1)+14(2)+18(3)+18(3)=1.75.\begin{split}\sum_{y}-{p\mathopen{}\mathclose{{}\left(y}\right)}\log_{2}{p%\mathopen{}\mathclose{{}\left(y}\right)}&=-\frac{1}{2}\log_{2}\frac{1}{2}-%\frac{1}{4}\log_{2}\frac{1}{4}-\frac{1}{8}\log_{2}\frac{1}{8}-\frac{1}{8}\log_%{2}\frac{1}{8}\\&=\frac{1}{2}(1)+\frac{1}{4}(2)+\frac{1}{8}(3)+\frac{1}{8}(3)=1.75.\end{split}

It must be emphasized that the logarithms’ arguments are consequences of your guessing strategy, but their coefficients are the result of my behavior across repeated plays of the game. Intuitively, you are allocating questions to each outcome in proportion to how surprising it would be, -log2p(y)-\log_{2}{p\mathopen{}\mathclose{{}\left(y}\right)}. The average number of questions you must ask is therefore equal to how surprised you are on average across games.

Now perhaps I think you’re winning the game too often, so I try to push the average number of questions up by choosing a new hiding strategy:

q(𝒛)=
/14/14/14/14ABCD
.
{q\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[%named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{z}{}}\right)}=\leavevmode\hbox to228.42pt{\vboxto%26.48pt{\pgfpicture\makeatletter\raise-4.576907pt\hbox{\hskip 0.4pt\lower-2.42%6907pt\hbox to 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{%pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }%\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}%\pgfsys@invoke{ }\nullfont\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@invoke{%\lxSVG@closescope }\pgfsys@endscope\hbox to 0.0pt{\pgfsys@beginscope%\pgfsys@invoke{ }{{}{{}}{}{{}{}}{{}}{}{}{}{}{{}}{}{}{}{{{}{}}}\pgfsys@beginscope\pgfsys@invoke{ }%\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}%\pgfsys@invoke{ }\pgfsys@setlinewidth{0.8pt}\pgfsys@invoke{ }{}\pgfsys@moveto{%0.0pt}{0.0pt}\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@lineto{0.0pt}{10.812047pt}%\pgfsys@lineto{56.905512pt}{10.812047pt}\pgfsys@lineto{56.905512pt}{0.0pt}%\pgfsys@closepath\pgfsys@moveto{56.905512pt}{10.812047pt}\pgfsys@stroke%\pgfsys@invoke{ }\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{}{{{}{}}}{{}{}}{{}{{}}}{{}{}}{}{{}{}}{}{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1%.0}{17.202928pt}{2.906061pt}\pgfsys@invoke{ }\hbox{{\definecolor{%pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }%\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$1/4$}}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{{}}{}{{}{}}{{}}{}{}{}{}{{}}{}{}{}{{{}{}}}\pgfsys@beginscope\pgfsys@invoke{ }%\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}%\pgfsys@invoke{ }\pgfsys@setlinewidth{0.8pt}\pgfsys@invoke{ }{}\pgfsys@moveto{%56.905512pt}{0.0pt}\pgfsys@moveto{56.905512pt}{0.0pt}\pgfsys@lineto{56.905512%pt}{10.812047pt}\pgfsys@lineto{113.811024pt}{10.812047pt}\pgfsys@lineto{113.81%1024pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{113.811024pt}{10.812047pt}%\pgfsys@stroke\pgfsys@invoke{ }\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ %}{{}{}{{{}{}}}{{}{}}{{}{{}}}{{}{}}{}{{}{}}{}{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1%.0}{74.10844pt}{2.906061pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor%}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }%\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$1/4$}}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{{}}{}{{}{}}{{}}{}{}{}{}{{}}{}{}{}{{{}{}}}\pgfsys@beginscope\pgfsys@invoke{ }%\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}%\pgfsys@invoke{ }\pgfsys@setlinewidth{0.8pt}\pgfsys@invoke{ }{}\pgfsys@moveto{%113.811024pt}{0.0pt}\pgfsys@moveto{113.811024pt}{0.0pt}\pgfsys@lineto{113.8110%24pt}{10.812047pt}\pgfsys@lineto{170.716536pt}{10.812047pt}\pgfsys@lineto{170.%716536pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{170.716536pt}{10.812047pt}%\pgfsys@stroke\pgfsys@invoke{ }\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ %}{{}{}{{{}{}}}{{}{}}{{}{{}}}{{}{}}{}{{}{}}{}{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1%.0}{131.013951pt}{2.906061pt}\pgfsys@invoke{ }\hbox{{\definecolor{%pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }%\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$1/4$}}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{{}}{}{{}{}}{{}}{}{}{}{}{{}}{}{}{}{{{}{}}}\pgfsys@beginscope\pgfsys@invoke{ }%\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}%\pgfsys@invoke{ }\pgfsys@setlinewidth{0.8pt}\pgfsys@invoke{ }{}\pgfsys@moveto{%170.716536pt}{0.0pt}\pgfsys@moveto{170.716536pt}{0.0pt}\pgfsys@lineto{170.7165%36pt}{10.812047pt}\pgfsys@lineto{227.622048pt}{10.812047pt}\pgfsys@lineto{227.%622048pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{227.622048pt}{10.812047pt}%\pgfsys@stroke\pgfsys@invoke{ }\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ %}{{}{}{{{}{}}}{{}{}}{{}{{}}}{{}{}}{}{{}{}}{}{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1%.0}{187.919463pt}{2.906061pt}\pgfsys@invoke{ }\hbox{{\definecolor{%pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }%\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$1/4$}}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\par{}{{}}{}{{}{}}{{}}{}{}{}{}{{}}{}{}{}{{{}{}}}\pgfsys@beginscope\pgfsys@invoke{ }%\pgfsys@setlinewidth{0.8pt}\pgfsys@invoke{ }\hbox{\hbox{{\pgfsys@beginscope%\pgfsys@invoke{ }{{}{}{{{}{}}}{{}{}}{{}{{}}}{{}{}}{}{{}{}}{}{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1%.0}{24.702813pt}{13.718109pt}\pgfsys@invoke{ }\hbox{{\definecolor{%pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }%\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{A}}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{{}}{}{{}{}}{{}}{}{}{}{}{{}}{}{}{}{{{}{}}}\pgfsys@beginscope\pgfsys@invoke{ }%\pgfsys@setlinewidth{0.8pt}\pgfsys@invoke{ }\hbox{\hbox{{\pgfsys@beginscope%\pgfsys@invoke{ }{{}{}{{{}{}}}{{}{}}{{}{{}}}{{}{}}{}{{}{}}{}{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1%.0}{81.608325pt}{13.718109pt}\pgfsys@invoke{ }\hbox{{\definecolor{%pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }%\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{B}}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{{}}{}{{}{}}{{}}{}{}{}{}{{}}{}{}{}{{{}{}}}\pgfsys@beginscope\pgfsys@invoke{ }%\pgfsys@setlinewidth{0.8pt}\pgfsys@invoke{ }\hbox{\hbox{{\pgfsys@beginscope%\pgfsys@invoke{ }{{}{}{{{}{}}}{{}{}}{{}{{}}}{{}{}}{}{{}{}}{}{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1%.0}{138.513837pt}{13.718109pt}\pgfsys@invoke{ }\hbox{{\definecolor{%pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }%\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{C}}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{{}}{}{{}{}}{{}}{}{}{}{}{{}}{}{}{}{{{}{}}}\pgfsys@beginscope\pgfsys@invoke{ }%\pgfsys@setlinewidth{0.8pt}\pgfsys@invoke{ }\hbox{\hbox{{\pgfsys@beginscope%\pgfsys@invoke{ }{{}{}{{{}{}}}{{}{}}{{}{{}}}{{}{}}{}{{}{}}{}{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1%.0}{195.419349pt}{13.718109pt}\pgfsys@invoke{ }\hbox{{\definecolor{%pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }%\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{D}}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}%\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}%\lxSVG@closescope\endpgfpicture}}.
(4.2)

Intuitively, by making the “odds more even,” I have made the problem harder—outcomes are overall more surprising because (e.g.) there is no single obvious answer, like A in the first example. And indeed, even though you tailor your guessing strategy to these new odds, you must on average ask more questions to locate the marble:

z-q(z)log2q(z)=14log214+14log214+14log214+14log214=2.0.\sum_{z}-{q\mathopen{}\mathclose{{}\left(z}\right)}\log_{2}{q\mathopen{}%\mathclose{{}\left(z}\right)}=\frac{1}{4}\log_{2}\frac{1}{4}+\frac{1}{4}\log_{%2}\frac{1}{4}+\frac{1}{4}\log_{2}\frac{1}{4}+\frac{1}{4}\log_{2}\frac{1}{4}\\=2.0.

Information entropy.

It should be intuitive that, by hiding the marble “more randomly,” I have increased the number of yes/no questions you must ask to locate it. In a rather different context, Shannon famously sought a mathematical expression for the “randomness” or “uncertainty” of a random variable, 𝒀{\bm{Y}}, in terms of its probability distribution, that captured two essential notions: that broader distributions should be more uncertain, and that the uncertainty of two independent random variables should be additive (as opposed to sub- or super-additive) [Shannon19XX, 20]. Remarkably, he showed that this expression must be (equivalent to) the minimum number of yes/no questions asked, on average, in a guessing game like ours. More precisely, he showed that any expression satisfying his desiderata must be, up to a scale factor,

-𝒚p(𝒚)logp(𝒚)=..Hp[𝒀],-\sum_{\bm{y}{}}{p\mathopen{}\mathclose{{}\left(\bm{y}}\right)}\log{p\mathopen%{}\mathclose{{}\left(\bm{y}}\right)}=\mathrel{\vbox{\hbox{\scriptsize.}\hbox{%\scriptsize.}}}\text{H}_{p}{\mathopen{}\mathclose{{}\left[{\bm{Y}}}\right]}, (4.3)

a quantity he called entropymargin: information entropy for its resemblance to the quantity of that name in statistical physics, and for which accordingly we use the symbol H (i.e., capital “eta” for entropy). Again interpreting -logp(𝒚)-\log{p\mathopen{}\mathclose{{}\left(\bm{y}}\right)} as the surprisalmargin: surprisal of observation 𝒚\bm{y}, the entropy can be described as the minimum average surprisal.

Because the scaling is irrelevant, a logarithm of any base will do equally well. The natural logarithm is most mathematically convenient so we default to it throughout; but base 2 yields the convenient intepretation of our guessing game, and allows entropy to be denominated in the familiar quantity of bits.

We return to that game now but suppose this time that you do not know my hiding probabilities. In particular, suppose I hide the marble according to the first scheme, Eq. 4.1, but you guess according to the second scheme, Eq. 4.2. Then the number of questions you will need to ask, on average, is

𝒚-p(𝒚)log2q(𝒚)=12log214+14log214+18log214+18log214=2.0.\sum_{\bm{y}{}}-{p\mathopen{}\mathclose{{}\left(\bm{y}}\right)}\log_{2}{q%\mathopen{}\mathclose{{}\left(\bm{y}}\right)}=\frac{1}{2}\log_{2}\frac{1}{4}+%\frac{1}{4}\log_{2}\frac{1}{4}+\frac{1}{8}\log_{2}\frac{1}{4}+\frac{1}{8}\log_%{2}\frac{1}{4}\\=2.0.

It is no coincidence that more guesses will be required on average under the “wrong” distribution (2.0) than under the right one (1.75). Lest one suspect that this has to do with the entropies of pp and qq, notice that the result still holds when the distributions are reversed: If I hide the marble according to the uniform distribution (qq, Eq. 4.2) and you guess according to the non-uniform distribution (pp, Eq. 4.1), then you will likewise have to ask more questions than if you had guessed according to my distribution:

𝒛-q(𝒛)log2p(𝒛)=14log212+14log214+14log218+14log218=2.25,\sum_{\bm{z}{}}-{q\mathopen{}\mathclose{{}\left(\bm{z}}\right)}\log_{2}{p%\mathopen{}\mathclose{{}\left(\bm{z}}\right)}=\frac{1}{4}\log_{2}\frac{1}{2}+%\frac{1}{4}\log_{2}\frac{1}{4}+\frac{1}{4}\log_{2}\frac{1}{8}+\frac{1}{4}\log_%{2}\frac{1}{8}\\=2.25,

rather than 2.0.

In general, we call the average number of questions required under the strategy derived from qq, when the marble is hidden according to pp, the cross entropymargin: cross entropy between pp and qq:

-𝒚p(𝒚)logq(𝒚)=..Hpq[𝒀].-\sum_{\bm{y}{}}{p\mathopen{}\mathclose{{}\left(\bm{y}}\right)}\log{q\mathopen%{}\mathclose{{}\left(\bm{y}}\right)}=\mathrel{\vbox{\hbox{\scriptsize.}\hbox{%\scriptsize.}}}\text{H}_{pq}{\mathopen{}\mathclose{{}\left[{\bm{Y}}}\right]}. (4.4)

That the cross entropy Hpq[𝒀]\text{H}_{pq}{\mathopen{}\mathclose{{}\left[{\bm{Y}}}\right]} is always greater than the entropy (Gibbs’s inequalitymargin: Gibbs’s inequality ) follows from Jensen’s inequality (see Section LABEL:sec:XXX):

Hpq[𝒀]-Hp[𝒀]\displaystyle\text{H}_{pq}{\mathopen{}\mathclose{{}\left[{\bm{Y}}}\right]}-%\text{H}_{p}{\mathopen{}\mathclose{{}\left[{\bm{Y}}}\right]} =𝒚-p(𝒚)logq(𝒚)+𝒚p(𝒚)logp(𝒚)\displaystyle=\sum_{\bm{y}{}}-{p\mathopen{}\mathclose{{}\left(\bm{y}}\right)}%\log{q\mathopen{}\mathclose{{}\left(\bm{y}}\right)}+\sum_{\bm{y}{}}{p\mathopen%{}\mathclose{{}\left(\bm{y}}\right)}\log{p\mathopen{}\mathclose{{}\left(\bm{y}%}\right)} Eqs. 4.3 and 4.4
=𝔼𝒀[-logq(𝒀)p(𝒀)]\displaystyle=\mathbb{E}_{{\bm{Y}}{}}{\mathopen{}\mathclose{{}\left[-\log\frac%{{q\mathopen{}\mathclose{{}\left({\bm{Y}}}\right)}}{{p\mathopen{}\mathclose{{}%\left({\bm{Y}}}\right)}}}\right]}
-log(𝔼𝒀[q(𝒀)p(𝒀)])\displaystyle\geq-\log\mathopen{}\mathclose{{}\left(\mathbb{E}_{{\bm{Y}}{}}{%\mathopen{}\mathclose{{}\left[\frac{{q\mathopen{}\mathclose{{}\left({\bm{Y}}}%\right)}}{{p\mathopen{}\mathclose{{}\left({\bm{Y}}}\right)}}}\right]}}\right) Jensen’s
=-log(1)\displaystyle=-\log\mathopen{}\mathclose{{}\left(1}\right)
=0,\displaystyle=0,

with equality only when p=qp=q (again by Jensen’s inequality). Since qq could be any distribution, this also shows that the guessing strategy proposed at the outset is the optimal one: no other strategy can yield fewer questions on average. The quantity on the left-hand side is, then, the number of extra questions you have to ask in virtue of having used the wrong strategy. This quantity also has a name, the relative entropymargin: relative entropy , or Kullback-Leibler (KL) divergence:

DKL{p(𝒀)q(𝒀)}..=Hpq[𝒀]-Hp[𝒀].\operatorname*{\text{D}_{\text{KL}}}\mathopen{}\mathclose{{}\left\{{p\mathopen%{}\mathclose{{}\left({\bm{Y}}}\right)}\middle\|{q\mathopen{}\mathclose{{}\left%({\bm{Y}}}\right)}}\right\}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize%.}}}=\text{H}_{pq}{\mathopen{}\mathclose{{}\left[{\bm{Y}}}\right]}-\text{H}_{p}{%\mathopen{}\mathclose{{}\left[{\bm{Y}}}\right]}. (4.5)

Efficient coding.

The guessing-game interpretation of entropy maps straightfowardly onto the classical coding problem. Under an efficient code, more frequent data 𝒚\bm{y} are assigned shorter code words. And indeed, assigning 1 and 0 (resp.) to the answers “yes” and “no” in our guessing game yields the following binary codes for the four letters:

A=1\displaystyle\text{A}=1 B=01\displaystyle\text{B}=01 C=001\displaystyle\text{C}=001 D=000.\displaystyle\text{D}=000.

This is a prefix-free code—no codeword has another codeword as a prefix—so any string of binary numbers, like 1001100010110011000101, has an unambiguous interpretation (ACADAB). The average codeword length is the entropy of the data—e.g., 1.75 bits under Eq. 4.1. As with the guessing game, designing the code (guessing strategy) according to the wrong distribution costs an extra number of bits (questions) given by the relative entropy of the correct to the incorrect distributions, Eq. 4.5. For example, Eq. 4.2 suggests partitioning the space first into (A or B) vs. (C or D), and thence into individual letters, which is equivalent to the coding scheme

A=11\displaystyle\text{A}=11 B=10\displaystyle\text{B}=10 C=01\displaystyle\text{C}=01 D=00.\displaystyle\text{D}=00.

This is also a prefix-free code, but as we have seen, under Eq. 4.2 it costs 2.0 bits on average.

Thus the relative entropy measures how much less efficient it is to use qq rather than pp to encode some data 𝒚\bm{y} drawn from p(𝒚){p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[%named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}. Since the relative entropy is non-negative and zero only when p=qp=q, we have at last operationalized the notion of one distribution being or becoming “more like” another: We make a distribution qq more like a distribution pp when we decrease the number of extra bits required to encode, according to qq, data drawn from p(𝒚){p\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}\definecolor[%named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}%\pgfsys@color@gray@fill{.5}\bm{y}{}}\right)}.

Minimizing relative entropy.

Most (but not all) of the losses in this book, then, whether for discriminative or generative models, supervised or unsupervised learning, can be written as relative entropies, with the model distribution p^(𝒚^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)} in place of qq. The generic optimization problem is

argmin𝜽{DKL{p(𝒀)p^(𝒀;𝜽)}}\displaystyle\operatorname*{argmin}_{\bm{\theta}}\mathopen{}\mathclose{{}\left%\{\operatorname*{\text{D}_{\text{KL}}}\mathopen{}\mathclose{{}\left\{{p%\mathopen{}\mathclose{{}\left({\bm{Y}}}\right)}\middle\|{\hat{p}\mathopen{}%\mathclose{{}\left({\bm{Y}};\bm{\theta}}\right)}}\right\}}\right\} =argmin𝜽{Hpp^[𝒀;𝜽]-Hp[𝒀]}\displaystyle=\operatorname*{argmin}_{\bm{\theta}}\mathopen{}\mathclose{{}%\left\{{\text{H}_{p\hat{p}}{\mathopen{}\mathclose{{}\left[{\bm{Y}}{};\bm{%\theta}}\right]}}-{\text{H}_{p}{\mathopen{}\mathclose{{}\left[{\bm{Y}}{}}%\right]}}}\right\} by def’n, Eq. 4.5 (4.6)
=argmin𝜽{Hpp^[𝒀;𝜽]}\displaystyle=\operatorname*{argmin}_{\bm{\theta}}\mathopen{}\mathclose{{}%\left\{{\text{H}_{p\hat{p}}{\mathopen{}\mathclose{{}\left[{\bm{Y}}{};\bm{%\theta}}\right]}}}\right\} entropy is parameter-free
argmin𝜽{-logp^(𝒀;𝜽)𝒀}\displaystyle\approx\operatorname*{argmin}_{\bm{\theta}}\mathopen{}\mathclose{%{}\left\{{\mathopen{}\mathclose{{}\left\langle{-\log{\hat{p}\mathopen{}%\mathclose{{}\left({\bm{Y}};\bm{\theta}}\right)}}}\right\rangle_{{\bm{Y}}{}}}}\right\} approx. w/sample average
=argmax𝜽{logp^(𝒀;𝜽)𝒀}\displaystyle=\operatorname*{argmax}_{\bm{\theta}}\mathopen{}\mathclose{{}%\left\{{\mathopen{}\mathclose{{}\left\langle{\log{\hat{p}\mathopen{}\mathclose%{{}\left({\bm{Y}};\bm{\theta}}\right)}}}\right\rangle_{{\bm{Y}}{}}}}\right\}
=argmax𝜽{lognNp^(𝒚n;𝜽)}\displaystyle=\operatorname*{argmax}_{\bm{\theta}}\mathopen{}\mathclose{{}%\left\{\log\prod_{n}^{N}{\hat{p}\mathopen{}\mathclose{{}\left(\bm{y}_{n};\bm{%\theta}}\right)}}\right\}
=argmax𝜽{nNp^(𝒚n;𝜽)}\displaystyle=\operatorname*{argmax}_{\bm{\theta}}\mathopen{}\mathclose{{}%\left\{\prod_{n}^{N}{\hat{p}\mathopen{}\mathclose{{}\left(\bm{y}_{n};\bm{%\theta}}\right)}}\right\} log is monotonic.\displaystyle\log\text{ is monotonic}.

Thus we see that minimizing relative entropy is equivalent to minimizing cross entropy, or again to maximizing the likelihood of the parametersmargin: maximum likelihood , thus connecting our optimization to the classical objective for fitting statistical models.

Maximum-likelihood estimates (MLEs) have long enjoyed widespread use in statistics for their asymptotic properties: Suppose the data were “actually” generated from a parameterized distribution within the family of model distributions, p^(𝒚^;𝜽){\hat{p}\mathopen{}\mathclose{{}\left(\leavevmode\color[rgb]{.5,.5,.5}%\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5%}\pgfsys@color@gray@fill{.5}\bm{\hat{y}}{};\bm{\theta}}\right)}. Then as the sample size approaches infinity, the MLE converges (in probability) to the true underlying parameter (“consistency”) and achieves the mininum mean squared error among all consistent estimators (“efficiency”). The parameter estimates of models fit by minimizing relative entropy inherit these properties.

[………….]

[[Forward and reverse KL]]

[[continuous RVs]]

[[well known losses like squared error and the “binary cross entropy”]]