4.2 Minimizing relative entropy and maximizing likelihood
So we have specified the goal of learning,
A guessing game.
Suppose we play a game in which I hide a marble under one of four cups, and you have to determine which by asking a series of yes/no questions.
Every time we play, I select the destination cup according to the probabilities
and which you know.
Your strategy is to partition the probability space into halves recursively.
(Of course, this may not be feasible—e.g., if cup A had probability
How many questions will it take on average (across all games)?
Here, for example, you would first ask, “Is it cup A?”
Half the time you would be right, so half the time you would locate the marble in one guess (
It must be emphasized that the logarithms’ arguments are consequences of your guessing strategy, but their coefficients are the result of my behavior across repeated plays of the game.
Intuitively, you are allocating questions to each outcome in proportion to how surprising it would be,
Now perhaps I think you’re winning the game too often, so I try to push the average number of questions up by choosing a new hiding strategy:
Intuitively, by making the “odds more even,” I have made the problem harder—outcomes are overall more surprising because (e.g.) there is no single obvious answer, like A in the first example. And indeed, even though you tailor your guessing strategy to these new odds, you must on average ask more questions to locate the marble:
Information entropy.
It should be intuitive that, by hiding the marble “more randomly,” I have increased the number of yes/no questions you must ask to locate it.
In a rather different context, Shannon famously sought a mathematical expression for the “randomness” or “uncertainty” of a random variable,
a quantity he called entropy††margin:
information entropy
for its resemblance to the quantity of that name in statistical physics, and for which accordingly we use the symbol H (i.e., capital “eta” for entropy).
Again interpreting
Because the scaling is irrelevant, a logarithm of any base will do equally well. The natural logarithm is most mathematically convenient so we default to it throughout; but base 2 yields the convenient intepretation of our guessing game, and allows entropy to be denominated in the familiar quantity of bits.
We return to that game now but suppose this time that you do not know my hiding probabilities. In particular, suppose I hide the marble according to the first scheme, Eq. 4.1, but you guess according to the second scheme, Eq. 4.2. Then the number of questions you will need to ask, on average, is
It is no coincidence that more guesses will be required on average under the “wrong” distribution (2.0) than under the right one (1.75).
Lest one suspect that this has to do with the entropies of
rather than 2.0.
In general, we call the average number of questions required under the strategy derived from
That the cross entropy
with equality only when
Efficient coding.
The guessing-game interpretation of entropy maps straightfowardly onto the classical coding problem.
Under an efficient code, more frequent data
This is a prefix-free code—no codeword has another codeword as a prefix—so any string of binary numbers, like
This is also a prefix-free code, but as we have seen, under Eq. 4.2 it costs 2.0 bits on average.
Thus the relative entropy measures how much less efficient it is to use
Minimizing relative entropy.
Most (but not all) of the losses in this book, then, whether for discriminative or generative models, supervised or unsupervised learning, can be written as relative entropies, with the model distribution
by def’n, Eq. 4.5 | (4.6) | |||||
entropy is parameter-free | ||||||
approx. w/sample average | ||||||
Thus we see that minimizing relative entropy is equivalent to minimizing cross entropy, or again to maximizing the likelihood of the parameters††margin: maximum likelihood , thus connecting our optimization to the classical objective for fitting statistical models.
Maximum-likelihood estimates (MLEs) have long enjoyed widespread use in statistics for their asymptotic properties: Suppose the data were “actually” generated from a parameterized distribution within the family of model distributions,
[………….]
[[Forward and reverse KL]]
[[continuous RVs]]
[[well known losses like squared error and the “binary cross entropy”]]