
Graph all the things
analyzing all the things you forgot to wonder about
2023-02-02
interests: statistics
I started this post almost 5 years ago, did some interesting math, then... forgot about it? Anyway, it's done now.
In information theory, entropy measures how much information a distribution contains. A fair coin toss, or a uniform distribution over 2 bins, has 1 bit of entropy. A uniform distribution over 4 bins has 2 bits of entropy, over 8 bins has 3 bits of entropy, and so forth.
Things start to get interesting when the distribution is not uniform. How much information is there in a coin toss that always lands on heads? None at all, or 0 bits. How about if the coin toss is somewhere in between uniform and deterministic - say 90% chance heads? It turns out to be 0.47 bits. In other words, the best possible compression scheme for this coin would convert an average string of 100 heads and tails into just 47 0's and 1's.
But this question gets even more interesting when you look at empirical distributions instead of exact distributions. What is a good estimate for the entropy of a coin toss of unknown probability if I've tried it 100 times and gotten 90 heads?
Here you can play around with a few distributions and different estimates of entropy:
For a discrete random variable  between 1 and 
 with 
, entropy is defined as
Given an empirical distribution  from 
 samples 
, one could evaluate the "naive" entropy estimator by mimicking the formula for 
.
If a value 
 appears 
 times in our samples, we can let 
 and approximate
However, it turns out that  consistently underestimates entropy.
We can do a bit better with a different estimator.
An estimator  is unbiased if it is equal to the the true value 
 on average.
You may know that other summary statistics of probabilitiy distributions have nice unbiased estimators.
For instance, to estimate the mean 
 of a (finite mean) distribution, we can take the mean of of the samples:
Might we find an unbiased estimator for entropy?
So rather than trying to derive an unbiased estimator, we'll have to settle for deriving a less biased one. In fact, finding less biased estimators of entropy is a real topic of research. One such formula is the Miller-Madow corrected estimator:
We'll derive it from scratch.
The approach is to consider a true distribution and approximate the expected value of the  over random samples.
Evaluating 
, we'll make some careful approximations and show that it is roughly 
.
The expected value of the naive estimator is
To simplify this expression further, we need an approximation.
Assuming the vast majority of the probability comes from  in the ballpark of 
 (a reasonable assumption whenever 
 or so), we can make this local estimate for the natural log term:
After we plug this in with  and 
 and apply the binomial formula, we get
Now we have an improved, less biased estimator!
There are distributions with infinite entropy!
For instance, if  is unbounded and 
, then 
 converges but 
 is unbounded (approximately proportional to 
 when summed up to 
).
Obviously, trying to estimate the entropy of infinite-entropy distributions will catastrophically fail.
I find it interesting that a single draw from this distribution contains infinite information.
No matter what compression scheme you use, the expected number of bits needed to represent a single draw is infinite.
Fortunately these distributions are quite rare.
I think they all have probability decaying at a rate of  for some 
 (I know my wording here isn't precise, but you know what I mean).
There is another estimator people often use called the jackknife.
It tends to be slightly better than the Miller-Madow correction, but the formula is more complicated.
If  represents the naive entropy estimator on the sample without the 
th element, the jackknife is defined as