Graph all the things
analyzing all the things you forgot to wonder about
2023-03-27
interests: linear algebra
*everyone's conjugate transpose
The title is a joke, but was inspired by the 110-page tome What Every Programmer Should Know About Memory, which I think is not a joke?
I'm going to try explaining these 6 useful matrix decompositions:
Decompositions
The first 4 (SVD, Eigen, Schur, PSD eigen) help with geometry and exponentiation. The last 2 (LU and Cholesky) help calculate determinants and solve for in systems of equations like . Scroll to the bottom if you need a reminder about any terminology or matrix classes.
Hot take: if you only know one matrix decomposition, make it SVD (forget Eigen).
A 2x2 matrix has 4 degrees of freedom that describe a linear distortion on 2D vectors. The usual representation of
can be unhelpful when reasoning geometrically. More intuitively, a 2x2 matrix acts on a vector by doing these 3 things in order:The SVD shows that every matrix does these same 3 things. In other words, for any matrix , there exist unitary matrices () and () and a nonnegative diagonal matrix () such that
So first rotates via in dimensions, then stretches along the coordinate axes and changes dimension to via (either truncating or adding on 0's), then rotates and flips via in dimensions.The amounts stretches by along the axes are called the singular values. If you take a (hyper)circle and apply a matrix to it, you get back a (hyper)ellipse whose radii are the singular values. This has wide-sweeping applications in dimensionality reduction and projective geometry.
At Hive, I once used SVD to help find interesting clusters of television shows, including this one:
The data amounted to a giant matrix of (viewers x TV shows), over 1M x 10k in size, where each entry was a 1 if the specific viewer had watched the TV show and 0 otherwise. I wanted to apply k-means to cluster the TV shows, but the data was too big, so I first used SVD to reduce dimensionality via a technique called PCA. I used Apache Spark to get the biggest singular values and an slice of . I then projected onto the column vectors of , thereby reducing the dimensionality of the shows from 10k x 1M to 10k x . Applying k-means to the reduced data gave us lots of fun clusters to look at.
The eigendecomposition explains what happens after repeated (or fractional, or inverse, or exponential) application of a matrix. That makes it good at describing how linear systems evolve over time or space, which is extremely important; any system of ordinary linear differential equations can be solved by a matrix exponential. If
then we can solve at any point in time with Engineers use the eigendecomposition to model how skyscrapers respond to earthquakes. Physicists use it to understand quantum systems. Biologists use it to model population genetics.Intuitively, when the eigendecomposition is possible, there are certain eigenvectors that the matrix stretches by their respective eigenvalues without rotating. More concretely, there exists an invertible matrix (containing the eigenvectors) and a diagonal matrix (containing the eigenvalues) such that
As I hinted, its most important property is easy exponentiation. Just apply the exponents to the entries of the diagonal matrix:
Similarly,But beware: the eigendecomposition can be confusing.
What happens when you exponentiate a matrix that has no eigendecomposition? Polynomial growth (often mixed with more exponentials). For instance, is not diagonalizable. It turns out that exponentiating it gives
The Schur decomposition helps explain behavior like this.According to the Schur decomposition, every square matrix is equivalent to an upper triangular matrix if we rotate (and flip) our axes carefully. Precisely, there exist a unitary matrix and upper triangular matrix such that
Similarly to the eigendecomposition, this has the property that
It is much harder to exponentiate an upper triangular matrix than a diagonal matrix, but it's still easier than exponentiating a square matrix directly. For instance, in the example above, the Schur decomposition gives and . It is easy to show that , and conjugating by gives the result above. Even when exponentiating the upper triangular matrix by brute force is necessary, it requires roughly a quarter as much work as a square matrix.The Schur decomposition also provides the (generalized!) eigenvalues! The main diagonal of is identical the eigendecomposition's term.
The holy trinity of decompositions is the eigendecomposition of a PSD matrix. It gives you the best of 3 worlds: an SVD where and are the same matrix, an eigendecomposition where is unitary, and a Schur decomposition where is diagonal. Where is unitary and is diagonal, you get
This does it all. The singular values equal the eigenvalues. The PSD matrix can be decomposed into orthogonal invariant subspaces, so with the right choice of axes, you can reason about each dimension indepently from the others. And if is real, then both and are as well.
It is so special that we should have a different word for it. Maybe "Cauchy decomposition" would be historically accurate, but to abide Stigler's Law, I propose "Keanu decomposition".
These last 2 matrix decompositions are completely different from the first 4. Instead of applying rotations and stretches, they break square matrices into two triangular matrices. This enables quickly solving for in systems of equations like , as well as calculating . In particular, LU with partial pivoting is what Lapack (and therefore numpy etc.) use for solving general systems of equations.
If you harken back to Algebra II, you may remember doing tedious Gaussian elimination to solve linear systems. Computing the LU decomposition requies the same -flop Gaussian elimination, but spits out some handy byproducts: a lower triangular matrix , upper triangular matrix , and permutation matrix such that
One of the easiest things you can do with this is compute 's determinant. The determinant of a triangular matrix is the product of its diagonal, and the determinant of a permutation matrix is , so
It only took flops to calculate this. Compare that to flops for the usual algorithm they teach in school.The other thing you can do is solve linear systems. That includes computing matrix inverses (solving )! Intuitively, it's relatively easy to solve for in a triangular system of equations like
Here one can trivially deduce the value of , then substitute that into the equation for (i.e. "forward substitution"). This requres only floating-point operations (flops). Once we have the LU decomposition, we can first solve for in , then solve for in . Computing the LU decomposition and solving like this is exactly as hard as (and procedurally almost identical to) Gaussian elimination on an augmented matrix. Its advantage is that we can now solve multiple systems of equations quickly, reusing the LU decomposition.Details and caveats:
Positive-definite matrices make everything better, and the LU decomposition is no exception. It upgrades into the Cholesky decomposition, with these benfits:
Method | flops for solves constituting systems of equations |
Gaussian elimination | |
LU decomposition | |
Cholesky decomposition |
The biggest use case for the Cholesky decomposition is solving least squares problems. A linear regression has the same solution for as the system of equations . Assuming there are at least linearly independent vectors in , is positive-definite.
Least squares fits right into the form of the Cholesky decomposition, which makes sense - it's exactly what Cholesky designed it for. He worked as a survey officer for the French in WWI, using it to triangulate distances quickly. Unfortunately he died of battle wounds just a few months before the war ended. But his work was published posthumously, and even bigger names like Alan Turing have studied his decomposition's properties.
At Lyft we used the Cholesky decomposition in a contextual assortment bandit service. In other words, we would take a few numbers of information (e.g. how urban the user's location is) and recommend a subset of possible options based on that (e.g. regular Lyft ride, bicycle, public transit). We learned how to go from inputs (contexts) to outputs (assortments) over time by observing some loss (e.g. conversion rate). The crux of this problem was semi-frequently evaluating terms that looked like
Fortunately is a positive-definite matrix! We were able to evaluate fairly quickly thanks to the Cholesky decomposition. And... user experiences were improved? Value was delivered to key stakeholders? Something like that.I used complex matrices for this post to avoid a lot of duplicate terminology for real matrices. Some types of matrices are special cases of others:
Matrices
Definitions:
After careful review, these matrix decompositions did not make the cut: