Graph all the things
analyzing all the things you forgot to wonder about
2023-02-02
interests: statistics
I started this post almost 5 years ago, did some interesting math, then... forgot about it? Anyway, it's done now.
In information theory, entropy measures how much information a distribution contains. A fair coin toss, or a uniform distribution over 2 bins, has 1 bit of entropy. A uniform distribution over 4 bins has 2 bits of entropy, over 8 bins has 3 bits of entropy, and so forth.
Things start to get interesting when the distribution is not uniform. How much information is there in a coin toss that always lands on heads? None at all, or 0 bits. How about if the coin toss is somewhere in between uniform and deterministic - say 90% chance heads? It turns out to be 0.47 bits. In other words, the best possible compression scheme for this coin would convert an average string of 100 heads and tails into just 47 0's and 1's.
But this question gets even more interesting when you look at empirical distributions instead of exact distributions. What is a good estimate for the entropy of a coin toss of unknown probability if I've tried it 100 times and gotten 90 heads?
Here you can play around with a few distributions and different estimates of entropy:
For a discrete random variable between 1 and with , entropy is defined as
You can choose whatever base you want for the log. If it's base 2, you get bits, and if it's base , you get nats. I'll be using nats for the rest of this post.Given an empirical distribution from samples , one could evaluate the "naive" entropy estimator by mimicking the formula for . If a value appears times in our samples, we can let and approximate
This is also known as the "maximum likelihood" estimator for entropy.However, it turns out that consistently underestimates entropy. We can do a bit better with a different estimator.
An estimator is unbiased if it is equal to the the true value on average. You may know that other summary statistics of probabilitiy distributions have nice unbiased estimators. For instance, to estimate the mean of a (finite mean) distribution, we can take the mean of of the samples:
In other words, is an unbiased estimator of the expected value of .Might we find an unbiased estimator for entropy?
No. Don't even try. There is none.So rather than trying to derive an unbiased estimator, we'll have to settle for deriving a less biased one. In fact, finding less biased estimators of entropy is a real topic of research. One such formula is the Miller-Madow corrected estimator:
The Miller-Madow formula simply adds a small term to the naive estimator depending on the number of bins and samples.We'll derive it from scratch. The approach is to consider a true distribution and approximate the expected value of the over random samples. Evaluating , we'll make some careful approximations and show that it is roughly .
The expected value of the naive estimator is
where Shifting the start index to 0 simplifies this slightly:To simplify this expression further, we need an approximation. Assuming the vast majority of the probability comes from in the ballpark of (a reasonable assumption whenever or so), we can make this local estimate for the natural log term:
Approximating a log with an exponential seems strange, but it makes the math nice. An alternative is to use a 2nd order Taylor expansion (which we'll have to do at the end anyway), but I found this a little cleaner.After we plug this in with and and apply the binomial formula, we get
No more summations! And the term is becoming clear! Assuming that is large, this further approximates to Oof, double exponents. But by applying a 2nd order approximation in and doing some algebra, we get Returning to ,Now we have an improved, less biased estimator!
This is exactly the Miller-Madow correction. Maybe the reason I abandoned this 5 years ago was because I realized it was a pre-existing result.There are distributions with infinite entropy! For instance, if is unbounded and , then converges but is unbounded (approximately proportional to when summed up to ). Obviously, trying to estimate the entropy of infinite-entropy distributions will catastrophically fail.
I find it interesting that a single draw from this distribution contains infinite information. No matter what compression scheme you use, the expected number of bits needed to represent a single draw is infinite. Fortunately these distributions are quite rare. I think they all have probability decaying at a rate of for some (I know my wording here isn't precise, but you know what I mean).
There is another estimator people often use called the jackknife. It tends to be slightly better than the Miller-Madow correction, but the formula is more complicated. If represents the naive entropy estimator on the sample without the th element, the jackknife is defined as
I also implemented this one in the above visualization.