Graph all the things

analyzing all the things you forgot to wonder about

Measuring Entropy

interests: statistics

I started this post almost 5 years ago, did some interesting math, then... forgot about it? Anyway, it's done now.

In information theory, entropy measures how much information a distribution contains. A fair coin toss, or a uniform distribution over 2 bins, has 1 bit of entropy. A uniform distribution over 4 bins has 2 bits of entropy, over 8 bins has 3 bits of entropy, and so forth.

Things start to get interesting when the distribution is not uniform. How much information is there in a coin toss that always lands on heads? None at all, or 0 bits. How about if the coin toss is somewhere in between uniform and deterministic - say 90% chance heads? It turns out to be 0.47 bits. In other words, the best possible compression scheme for this coin would convert an average string of 100 heads and tails into just 47 0's and 1's.

But this question gets even more interesting when you look at empirical distributions instead of exact distributions. What is a good estimate for the entropy of a coin toss of unknown probability if I've tried it 100 times and gotten 90 heads?

Here you can play around with a few distributions and different estimates of entropy:

trueprobabilityempiricalprobability(n=100)xentropy estimate / bitsexact:naive:Miller-Madow:jackknife:

What is a Discrete Distribution's Entropy?

For a discrete random variable X between 1 and m with P(X=x) = p_x, entropy is defined as

H[X] = -\sum_{x=1}^mp_x\log(p_x)
You can choose whatever base you want for the log. If it's base 2, you get bits, and if it's base e, you get nats. I'll be using nats for the rest of this post.

Given an empirical distribution \hat{X} from n samples x_1, \ldots x_n, one could evaluate the "naive" entropy estimator by mimicking the formula for H[X]. If a value x appears k_x times in our samples, we can let \hat{p}_x = k_x / n and approximate

\hat{H}_{ML}[\hat{X}] \approx -\sum_{x=1}^m\hat{p}_x\log(\hat{p}_x)
This is also known as the "maximum likelihood" estimator for entropy.

However, it turns out that \hat{H}_{ML} consistently underestimates entropy. We can do a bit better with a different estimator.

Deriving a Better Estimator for Entropy

An estimator \hat{H}[\hat{X}] is unbiased if it is equal to the the true value H[X] on average. You may know that other summary statistics of probabilitiy distributions have nice unbiased estimators. For instance, to estimate the mean E[X] of a (finite mean) distribution, we can take the mean of of the samples:

E[X] = E\left[\frac{1}{n}\sum_{i=1}^nx_i\right]
In other words, \frac{1}{n}\sum_{i=1}^nx_i is an unbiased estimator of the expected value of X.

Might we find an unbiased estimator for entropy?

H[X] = E\left[?\right]
No. Don't even try. There is none.

So rather than trying to derive an unbiased estimator, we'll have to settle for deriving a less biased one. In fact, finding less biased estimators of entropy is a real topic of research. One such formula is the Miller-Madow corrected estimator:

\hat{H}_{MM}[\hat{X}] = \hat{H}_{ML}[\hat{X}] + \frac{m - 1}{2n}
The Miller-Madow formula simply adds a small term to the naive estimator depending on the number of bins and samples.

We'll derive it from scratch. The approach is to consider a true distribution and approximate the expected value of the \hat{H}_{ML}[X] over random samples. Evaluating E[\hat{H}_{ML}], we'll make some careful approximations and show that it is roughly H - (m-1)/(2n).

The expected value of the naive estimator is

E\left[\hat{H}_{ML}(\hat{X})\right] = -\sum_{x=1}^mh_x
h_x = \sum_{k=1}^{n}\binom{n}{k} p_x^k(1-p_x)^{n-k}\frac{k}{n}\ln\left(\frac{k}{n}\right)
Shifting the start index to 0 simplifies this slightly:
h_x = p_x\sum_{k=0}^{n-1}\binom{n-1}{k} p_x^k(1-p_x)^{n-1-k}\ln\left(\frac{k+1}{n}\right)

To simplify this expression further, we need an approximation. Assuming the vast majority of the probability comes from k in the ballpark of np_x (a reasonable assumption whenever np_x > 5 or so), we can make this local estimate for the natural log term:

\ln t  \approx \ln p + (1 - e^{-(t-p)/p})
Approximating a log with an exponential seems strange, but it makes the math nice. An alternative is to use a 2nd order Taylor expansion (which we'll have to do at the end anyway), but I found this a little cleaner.
Plot of log(t) and two approximations. The exponential approximation comes closer.

After we plug this in with t=(k+1)/n and p=p_x and apply the binomial formula, we get

h_x \approx p_x\left(\ln p_x + 1 - e^{1 - \frac1{np_x}}\left(1 - p_x\left(1-  e^{-\frac1{np_x}}\right)\right)^{n-1}\right)
No more summations! And the p_x\ln p_x term is becoming clear! Assuming that n is large, this further approximates to
h_x \approx p_x\left(\ln p_x + 1 - e^{1 -\frac1{np_x} - (n-1)p_x\left(1-e^{-\frac1{np_x}}\right)}\right)
Oof, double exponents. But by applying a 2nd order approximation in 1/(np_x) and doing some algebra, we get
h_x \approx p_x\ln p_x - \frac{p_x}{2n} + \frac{1}{2n}
Returning to \hat{H}_{ML}[\hat{X}],
E\left[\hat{H}_{ML}(\hat{X})\right] \approx -\sum_{x=1}^m \left(p_x\ln p_x - \frac{p_x}{2n} + \frac{1}{2n}\right)
E\left[\hat{H}_{ML}(\hat{X})\right] \approx -\frac{m - 1}{2n} - \sum_{x=1}^m p_x\ln p_x

Now we have an improved, less biased estimator!

H_{MM}[X] \approx \hat{H}_{ML}(\hat{X}) + \frac{m - 1}{2n}
This is exactly the Miller-Madow correction. Maybe the reason I abandoned this 5 years ago was because I realized it was a pre-existing result.

Fun Fact: Infinite Entropy Distributions

There are distributions with infinite entropy! For instance, if X is unbounded and P(X=x) \propto \frac{1}{x\ln(x)^2}, then \sum_{x=0}^\infty P(x) converges but -\sum_{x=0}^\infty P(x)\ln P(x) is unbounded (approximately proportional to \ln(\ln x_0)) when summed up to x_0). Obviously, trying to estimate the entropy of infinite-entropy distributions will catastrophically fail.

I find it interesting that a single draw from this distribution contains infinite information. No matter what compression scheme you use, the expected number of bits needed to represent a single draw is infinite. Fortunately these distributions are quite rare. I think they all have probability decaying at a rate of 1/(x\ln(x)^\alpha) for some 1 < \alpha \le 2 (I know my wording here isn't precise, but you know what I mean).

Appendix: The Jackknife

There is another estimator people often use called the jackknife. It tends to be slightly better than the Miller-Madow correction, but the formula is more complicated. If \hat{H}_{ML - j} represents the naive entropy estimator on the sample without the jth element, the jackknife is defined as

\hat{H}_{JK} = n\hat{H}_{ML} - \frac{n - 1}{n}\sum_{j=1}^n\hat{H}_{ML-j}
I also implemented this one in the above visualization.