Graph all the things

analyzing all the things you forgot to wonder about

6 Matrix Decompositions Everyone* Should Know

2023-03-27
interests: linear algebra

*everyone's conjugate transpose

The title is a joke, but was inspired by the 110-page tome What Every Programmer Should Know About Memory, which I think is not a joke?

I'm going to try explaining these 6 useful matrix decompositions:

Decompositions

The first 4 (SVD, Eigen, Schur, PSD eigen) help with geometry and exponentiation. The last 2 (LU and Cholesky) help calculate determinants and solve for in systems of equations like . Scroll to the bottom if you need a reminder about any terminology or matrix classes.

Singular Value Decomposition (SVD)

• Formula:
• Applies to: all matrices!
• Useful for: geometry

Hot take: if you only know one matrix decomposition, make it SVD (forget Eigen).

A 2x2 matrix has 4 degrees of freedom that describe a linear distortion on 2D vectors. The usual representation of

can be unhelpful when reasoning geometrically. More intuitively, a 2x2 matrix acts on a vector by doing these 3 things in order:
• rotates by an angle (1 degree of freedom)
• stretches along the x and y axes by different ratios (2 degrees of freedom)
• rotates by another angle (and maybe flips) (1 degree of freedom)

The SVD shows that every matrix does these same 3 things. In other words, for any matrix , there exist unitary matrices () and () and a nonnegative diagonal matrix () such that

So first rotates via in dimensions, then stretches along the coordinate axes and changes dimension to via (either truncating or adding on 0's), then rotates and flips via in dimensions.

The amounts stretches by along the axes are called the singular values. If you take a (hyper)circle and apply a matrix to it, you get back a (hyper)ellipse whose radii are the singular values. This has wide-sweeping applications in dimensionality reduction and projective geometry.

At Hive, I once used SVD to help find interesting clusters of television shows, including this one:

• Overhaulin'
• Bitchin' Rides
• Graveyard Carz

The data amounted to a giant matrix of (viewers x TV shows), over 1M x 10k in size, where each entry was a 1 if the specific viewer had watched the TV show and 0 otherwise. I wanted to apply k-means to cluster the TV shows, but the data was too big, so I first used SVD to reduce dimensionality via a technique called PCA. I used Apache Spark to get the biggest singular values and an slice of . I then projected onto the column vectors of , thereby reducing the dimensionality of the shows from 10k x 1M to 10k x . Applying k-means to the reduced data gave us lots of fun clusters to look at.

Eigendecomposition

• Formula:
• Applies to: diagonalizable matrices
• Useful for: exponentiation

The eigendecomposition explains what happens after repeated (or fractional, or inverse, or exponential) application of a matrix. That makes it good at describing how linear systems evolve over time or space, which is extremely important; any system of ordinary linear differential equations can be solved by a matrix exponential. If

then we can solve at any point in time with
Engineers use the eigendecomposition to model how skyscrapers respond to earthquakes. Physicists use it to understand quantum systems. Biologists use it to model population genetics.

Intuitively, when the eigendecomposition is possible, there are certain eigenvectors that the matrix stretches by their respective eigenvalues without rotating. More concretely, there exists an invertible matrix (containing the eigenvectors) and a diagonal matrix (containing the eigenvalues) such that

As I hinted, its most important property is easy exponentiation. Just apply the exponents to the entries of the diagonal matrix:

Similarly,

But beware: the eigendecomposition can be confusing.

• The eigenvectors aren't generally orthogonal. So if you imagine stretching the space along each eigenvector by its eigenvalue sequentially, you'd be wrong. Somehow the matrix jointly stretches the eigenvectors simultaneously.
• Part of the decomposition is , which has almost no useful properties.
• The eigenvectors and values do not have any important geometric interpretation.

Schur Decomposition

• Formula:
• Applies to: square matrices
• Useful for: exponentiation

What happens when you exponentiate a matrix that has no eigendecomposition? Polynomial growth (often mixed with more exponentials). For instance, is not diagonalizable. It turns out that exponentiating it gives

The Schur decomposition helps explain behavior like this.

According to the Schur decomposition, every square matrix is equivalent to an upper triangular matrix if we rotate (and flip) our axes carefully. Precisely, there exist a unitary matrix and upper triangular matrix such that

Similarly to the eigendecomposition, this has the property that

It is much harder to exponentiate an upper triangular matrix than a diagonal matrix, but it's still easier than exponentiating a square matrix directly. For instance, in the example above, the Schur decomposition gives and . It is easy to show that , and conjugating by gives the result above. Even when exponentiating the upper triangular matrix by brute force is necessary, it requires roughly a quarter as much work as a square matrix.

The Schur decomposition also provides the (generalized!) eigenvalues! The main diagonal of is identical the eigendecomposition's term.

Positive-semidefinite (PSD) Eigendecomposition

• Formula:
• Applies to: PSD matrices
• Useful for: geometry, exponentiation

The holy trinity of decompositions is the eigendecomposition of a PSD matrix. It gives you the best of 3 worlds: an SVD where and are the same matrix, an eigendecomposition where is unitary, and a Schur decomposition where is diagonal. Where is unitary and is diagonal, you get

This does it all. The singular values equal the eigenvalues. The PSD matrix can be decomposed into orthogonal invariant subspaces, so with the right choice of axes, you can reason about each dimension indepently from the others. And if is real, then both and are as well.

It is so special that we should have a different word for it. Maybe "Cauchy decomposition" would be historically accurate, but to abide Stigler's Law, I propose "Keanu decomposition".

LU Decomposition (with partial pivoting)

• Formula:
• Applies to: square matrices
• Useful for: determinants, solving linear systems

These last 2 matrix decompositions are completely different from the first 4. Instead of applying rotations and stretches, they break square matrices into two triangular matrices. This enables quickly solving for in systems of equations like , as well as calculating . In particular, LU with partial pivoting is what Lapack (and therefore numpy etc.) use for solving general systems of equations.

If you harken back to Algebra II, you may remember doing tedious Gaussian elimination to solve linear systems. Computing the LU decomposition requies the same -flop Gaussian elimination, but spits out some handy byproducts: a lower triangular matrix , upper triangular matrix , and permutation matrix such that

One of the easiest things you can do with this is compute 's determinant. The determinant of a triangular matrix is the product of its diagonal, and the determinant of a permutation matrix is , so

It only took flops to calculate this. Compare that to flops for the usual algorithm they teach in school.

The other thing you can do is solve linear systems. That includes computing matrix inverses (solving )! Intuitively, it's relatively easy to solve for in a triangular system of equations like

Here one can trivially deduce the value of , then substitute that into the equation for (i.e. "forward substitution"). This requres only floating-point operations (flops). Once we have the LU decomposition, we can first solve for in , then solve for in . Computing the LU decomposition and solving like this is exactly as hard as (and procedurally almost identical to) Gaussian elimination on an augmented matrix. Its advantage is that we can now solve multiple systems of equations quickly, reusing the LU decomposition.

Details and caveats:

• Partial pivoting refers to dynamically swapping rows to avoid division by 0 during Gaussian elimination. And with a bit more care, if we always divide by the largest magnitude number possible, we also improve numerical stability. That's why the permutation matrix is necessary.
• If is not invertible, and/or won't be either, so one should check for 0's on their diagonals before trying to solve linear systems.
• If is and is with , we can do better than the standard LU procedure by actually evaluting . The total cost is still , but computers are better and more parallel at matrix multiplications than forward/backward substitution.

Cholesky Decomposition

• Formula:
• Applies to: Positive-definite matrices
• Useful for: determinants, solving linear systems

Positive-definite matrices make everything better, and the LU decomposition is no exception. It upgrades into the Cholesky decomposition, with these benfits:

• 2x faster to compute the decomposition.
• typically no need for a permutation matrix.
 Method flops for solves constituting systems of equations Gaussian elimination LU decomposition Cholesky decomposition

The biggest use case for the Cholesky decomposition is solving least squares problems. A linear regression has the same solution for as the system of equations . Assuming there are at least linearly independent vectors in , is positive-definite.

Least squares fits right into the form of the Cholesky decomposition, which makes sense - it's exactly what Cholesky designed it for. He worked as a survey officer for the French in WWI, using it to triangulate distances quickly. Unfortunately he died of battle wounds just a few months before the war ended. But his work was published posthumously, and even bigger names like Alan Turing have studied his decomposition's properties.

At Lyft we used the Cholesky decomposition in a contextual assortment bandit service. In other words, we would take a few numbers of information (e.g. how urban the user's location is) and recommend a subset of possible options based on that (e.g. regular Lyft ride, bicycle, public transit). We learned how to go from inputs (contexts) to outputs (assortments) over time by observing some loss (e.g. conversion rate). The crux of this problem was semi-frequently evaluating terms that looked like

Fortunately is a positive-definite matrix! We were able to evaluate fairly quickly thanks to the Cholesky decomposition. And... user experiences were improved? Value was delivered to key stakeholders? Something like that.

Appendix

I used complex matrices for this post to avoid a lot of duplicate terminology for real matrices. Some types of matrices are special cases of others:

Matrices

Definitions:

• square: .
• invertible: is invertible if there exists an inverse such that . This is equivalent to having only nonzero eigenvalues. Or to having a nonzero determinant.
• conjugate transpose (noun): the result of flipping the matrix across its diagonal and flipping the sign of all imaginary components. denotes the conjugate transpose of .
• conjugating (verb): conjugating by an invertible matrix means taking . I find it really unfortunate that this is based on the same word as the previous term.
• similar: is similar to if you can conjugate by some matrix to get ; for an invertible matrix .
• diagonalizable: similar to a diagonal matrix.
• diagonal: has nonzero entries only on the diagonal.
• Hermitian: equal to its conjugate transpose. If the matrix is real, this means it's symmetric.
• unitary: is unitary if . This really means that acts on vectors by only rotations and flips, preserving distances and angles. We often conjugate by a unitary matrix and write the result as .
• positive-semidefinite (PSD): is positive-semidefinite if for any vector , . Any Hermitian matrix with nonnegative real eigenvalues has this property.
• upper (or lower) triangular: only has nonzero entries on or above (or below) the diagonal.
• permutation: a matrix of 0's and 1's where each column and each row has exactly one 1. During matrix multiplication, these just reorder the rows of the other matrix.
• flop: floating point operation. A single division, multiplication, addition, or subtraction.

Notable Mentions

After careful review, these matrix decompositions did not make the cut:

• polar decomposition. It's equivalent to SVD and just breaks the matrix into fewer parts.
• low-rank decomposition. I'm pretty sure people only go from decomposed form -> full matrix and never the other way around.
• Jordan normal form. It's in between eigen and Schur, and I just can't put that much nuance into one blog post.