Graph all the things
analyzing all the things you forgot to wonder about
2017-05-11
interests: US history, unsupervised learning, interactive visualizations
Some historians find it fun to rank presidents, and they do this regularly in the Sienna Research Institute Presidents Study, ranking all past presidents on 20 different criteria. I noticed that the rankings are heavily redundant. Just look at "overall ability" versus "executive ability" - that's 96% correlation! To make a bit more sense of this 20-dimensional data (a ranking for each category for each president), I plotted each president's rankings in 20-dimensional space, then rotated and recentered them so that the axes line up with the directions the data varies in.
Click on a president to follow him, and compare different axes!
This is called Principal Component Analysis (PCA). The resulting dimensions are ranked from most to least descriptive, and the first (most descriptive) one is called the principal component.
My takeaways from this are:
Here's the proportion of the data that's conveyed by the first components:
This part relies on knowledge of covariance and basic linear algebra.
Let be a random variable in with covariance matrix (in this case, and the variables are the rankings, so the covariance is Spearman's rank covariance). The principal components of the probability distribution are the normalized eigenvectors of . They have important properties that make PCA meaningful:
In fact, PCA can be derived from that last bullet. If you remember some basic formulas for covariance,
Since we must have for , the uncorrelated components we are looking for must be orthogonal under the covariance matrix . Since all covariance matrices are positive definite, that condition is uniquely satisfied (up to scaling) by the eigenectors of . And if we use the normalized eigenvectors, we get which gives us that nice easy way to order the components by importance like I mentioned earlier.