Graph all the things
analyzing all the things you forgot to wonder about
*everyone's conjugate transpose
The title is a joke, but was inspired by the 110-page tome What Every Programmer Should Know About Memory, which I think is not a joke?
I'm going to try explaining these 6 useful matrix decompositions:
The first 4 (SVD, Eigen, Schur, PSD eigen) help with geometry and exponentiation. The last 2 (LU and Cholesky) help calculate determinants and solve for in systems of equations like . Scroll to the bottom if you need a reminder about any terminology or matrix classes.
Hot take: if you only know one matrix decomposition, make it SVD (forget Eigen).
A 2x2 matrix has 4 degrees of freedom that describe a linear distortion on 2D vectors. The usual representation of
The SVD shows that every matrix does these same 3 things. In other words, for any matrix , there exist unitary matrices () and () and a nonnegative diagonal matrix () such that
The amounts stretches by along the axes are called the singular values. If you take a (hyper)circle and apply a matrix to it, you get back a (hyper)ellipse whose radii are the singular values. This has wide-sweeping applications in dimensionality reduction and projective geometry.
At Hive, I once used SVD to help find interesting clusters of television shows, including this one:
The data amounted to a giant matrix of (viewers x TV shows), over 1M x 10k in size, where each entry was a 1 if the specific viewer had watched the TV show and 0 otherwise. I wanted to apply k-means to cluster the TV shows, but the data was too big, so I first used SVD to reduce dimensionality via a technique called PCA. I used Apache Spark to get the biggest singular values and an slice of . I then projected onto the column vectors of , thereby reducing the dimensionality of the shows from 10k x 1M to 10k x . Applying k-means to the reduced data gave us lots of fun clusters to look at.
The eigendecomposition explains what happens after repeated (or fractional, or inverse, or exponential) application of a matrix. That makes it good at describing how linear systems evolve over time or space, which is extremely important; any system of ordinary linear differential equations can be solved by a matrix exponential. If
Intuitively, when the eigendecomposition is possible, there are certain eigenvectors that the matrix stretches by their respective eigenvalues without rotating. More concretely, there exists an inverble matrix (containing the eigenvectors) and a diagonal matrix (containing the eigenvalues) such that
As I hinted, its most important property is easy exponentiation. Just apply the exponents to the entries of the diagonal matrix:
But beware: the eigendecomposition can be confusing.
What happens when you exponentiate a matrix that has no eigendecomposition? Polynomial growth (often mixed with more exponentials). For instance, is not diagonalizable. It turns out that exponentiating it gives
According to the Schur decomposition, every square matrix is equivalent to an upper triangular matrix if we rotate (and flip) our axes carefully. Precisely, there exist a unitary matrix and upper triangular matrix such that
Similarly to the eigendecomposition, this has the property that
The Schur decomposition also provides the (generalized!) eigenvalues! The main diagonal of is identical the eigendecomposition's term.
The holy trinity of decompositions is the eigendecomposition of a PSD matrix. It gives you the best of 3 worlds: an SVD where and are the same matrix, an eigendecomposition where is unitary, and a Schur decomposition where is diagonal. Where is unitary and is diagonal, you get
This does it all. The singular values equal the eigenvalues. The PSD matrix can be decomposed into orthogonal invariant subspaces, so with the right choice of axes, you can reason about each dimension indepently from the others. And if is real, then both and are as well.
It is so special that we should have a different word for it. Maybe "Cauchy decomposition" would be historically accurate, but to abide Stigler's Law, I propose "Keanu decomposition".
These last 2 matrix decompositions are completely different from the first 4. Instead of applying rotations and stretches, they break square matrices into two triangular matrices. This enables quickly solving for in systems of equations like , as well as calculating . In particular, LU with partial pivoting is what Lapack (and therefore numpy etc.) use for solving general systems of equations.
If you harken back to Algebra II, you may remember doing tedious Gaussian elimination to solve linear systems. Computing the LU decomposition requies the same -flop Gaussian elimination, but spits out some handy byproducts: a lower triangular matrix , upper triangular matrix , and permutation matrix such that
One of the easiest things you can do with this is compute 's determinant. The determinant of a triangular matrix is the product of its diagonal, and the determinant of a permutation matrix is , so
The other thing you can do is solve linear systems. That includes computing matrix inverses (solving )! Intuitively, it's relatively easy to solve for in a triangular system of equations like
Details and caveats:
Positive-definite matrices make everything better, and the LU decomposition is no exception. It upgrades into the Cholesky decomposition, with these benfits:
|Method||flops for solves constituting systems of equations|
The biggest use case for the Cholesky decomposition is solving least squares problems. A linear regression has the same solution for as the system of equations . Assuming there are at least linearly independent vectors in , is positive-definite.
Least squares fits right into the form of the Cholesky decomposition, which makes sense - it's exactly what Cholesky designed it for. He worked as a survey officer for the French in WWI, using it to triangulate distances quickly. Unfortunately he died of battle wounds just a few months before the war ended. But his work was published posthumously, and even bigger names like Alan Turing have studied his decomposition's properties.
At Lyft we used the Cholesky decomposition in a contextual assortment bandit service. In other words, we would take a few numbers of information (e.g. how urban the user's location is) and recommend a subset of possible options based on that (e.g. regular Lyft ride, bicycle, public transit). We learned how to go from inputs (contexts) to outputs (assortments) over time by observing some loss (e.g. conversion rate). The crux of this problem was semi-frequently evaluating terms that looked like
I used complex matrices for this post to avoid a lot of duplicate terminology for real matrices. Some types of matrices are special cases of others:
After careful review, these matrix decompositions did not make the cut: