Entropy, conditional entropy, mutual information, cross entropy, Kullback-Leibler divergence, and perplexity
In mathematical statistics, entropy measures the uncertainty in the outcome of a random phenomenon. Consider the random experiment of tossing a coin. If the coin is fair, entropy is maximized. The uncertainty (entropy) in the experiment is 1, measured in bits. This is also the expected amount of information an outcome of the random experiment carries. If the coin is weighted and comes up heads 100 % of the time, the uncertainty of the random experiment is 0. In that case the outcome of the random experiment gives no information. If the coin is biased, the uncertainty of the random experiment is something from between these two extremes.
Formally, the entropy $H$ of a discrete random variable $X$ is defined as
where $x$ is iterated over all the possible values of $X$. In the above formula, entropy is measured with respect to base 2. Sometimes other bases are used—then the result will not appear in bits, but otherwise it makes no difference in the following discussion.
Conditional entropy and mutual information
If we’re able to gain some information about a random phenomenon through another random phenomenon, our uncertainty is diminished. Conditional entropy measures how much uncertainty in the outcome of one random phenomenon remains, if the outcome of another random phenomenon is known. The entropy of a discrete random variable $X$ conditional on the variable $Y$ is
The latter form of the equation can be derived if it is understood that $0 \log(0)$ should be treated as being equal to zero.
Mutual information is closely related to conditional entropy. As explained above, conditional entropy measures the amount of uncertainty that still remains about a random variable $X$, when we know the value of another random variable $Y$. The amount the uncertainty of $X$ was reduced by knowing the value of $Y$ we call the mutual information of $X$ and $Y$. Thus we can write
$I(X;Y)$, the mutual information of $X$ and $Y$, is a symmetric function:
For discrete random variables it can be written
Cross entropy and Kullback-Leibler divergence
If we model $X$ with an approximate distribution $q(x)$, cross entropy tells us how well $q(x)$ fits the real probability distribution of $X$, and is defined
When $p(x) = q(x)$, cross entropy has its minimum value, which is the entropy $H(p)$. Another similar measure is the Kullback-Leibler divergence or relative entropy (also sometimes called cross entropy):
Kullback-Leibler divergence is non-negative and zero if and only if $p(x) = q(x)$ almost everywhere. Intuitively Kullback-Leibler divergence can be seen as a difference between two distributions, but it is not a true metric and not a symmetric function. Specifically, it measures the information lost, when $q(x)$ is used to approximate $p(x)$.
Perplexity is defined as the exponent of entropy or cross entropy. The perplexity of model $q(x)$ on the random variable $X$ is the exponent of cross entropy:
This is another measure for how well $q(x)$ approximates the distribution $p(x)$. Usually the true distribution $p(x)$ is not known, but instead we have an empirical distribution from observed data, or cross entropy is estimated directly from the data as explained later.
Another application of perplexity is to measure the uncertainty of a random phenomenon. In this case the model is evaluated on the distribution it embodies. The perplexity of a random variable $X$ is the exponent of entropy:
Perplexity tells us how many sides we need on a fair dice to get the same uncertainty as we have on the random variable. We can easily see that the perplexity of a fair coin is 2 (we are “two ways perplexed” about the outcome of the random experiment).
Another perspective to these measures comes from information theory, where the possible values of $X$ are seen as messages to be coded using an optimal coding scheme. Entropy, then, gives the expected message length, cross entropy gives the expected message length when the coding scheme is optimal for $q(x)$ but the data follows $p(x)$, and KL divergence gives the extra message length when the coding scheme is optimal for $q(x)$ but the data follows $p(x)$.
Confusingly, entropy is sometimes used to refer to the quantity of information that the outcome of a random variable contains. This quantity is also called the self-information, and defined as:
Entropy is the expected value of self-information.
For example, when tossing a fair coin, the probability that the outcome is heads is 0.5, so the self-information of that event is $-\log_2(0.5) = 1$. The event “heads” is said to carry 1 bit of information.
Estimation from a sample and language modeling
Often we would like to measure these quantities, but instead of the true probability distribution we have a sample drawn from it. There are actually many ways to estimate the entropy from a sample. One possible way to compute the empirical entropy is to estimate each of the probabilities $p(x)$ as the fraction of $x$ in the sample.
A known distribution $p(x)$ can be compared to a sample $x_1 x_2 \ldots x_N$ using empirical cross entropy, defined as the Monte Carlo estimate of cross entropy:
For example, in natural language processing, it is usual to model a sequence of words $w_1 w_2 \ldots w_N$ as the outcome of a random phenomenon whose probability is given by a language model:
Cross entropy can be used to evaluate how well the model predicts a given text sequence. In this context, cross entropy is defined as if the words were independent observations:
and perplexity is defined as the exponent of cross entropy:
The latter form of the equation brings us another, equal, definition for the perplexity of a word sequence as the geometric mean of the word conditional probabilities: