Graph all the things

analyzing all the things you forgot to wonder about

Streaming Asymmetric Numeral System Explained

2023-09-10
interests: compression (explained for beginners)

A few months ago, I spent several hours struggling to understand the inner workings of a lossless compression technique called Tabled Asymmetric Numeral System (tANS). I couldn't find any good* explanations for the leap involved from vanilla ANS to streaming ANS, so I vowed to create one.

What are we compressing?

Symbols!

As a silly example, say we want to compress the words "freer reef referee referrer" where each letter is a symbol (ignoring spaces). There are 3 distinct symbols - "e", "f", and "r" (an alphabet size of 3), and 24 symbols total.

In practice, symbols have a slightly higher-level meaning. The classic DEFLATE format (e.g. gzip) uses 288 distinct symbols: 256 for all literal bytes, 29 meaning "look back at earlier symbols and copy some", 1 for the end of the data block, and 2 for unimplemented purposes. Pco compresses numbers by binning them, and it represents each bin as a symbol.

How are we compressing the symbols?

Turning them into a really big number!

Well, technically two numbers: one regular-size number $N$ , say up to $2^{32} - 1$ , encoding the total number of symbols, followed by the really big number $x$ . In this way we can spend $32 + \lceil\log_2(x + 1)\rceil$ bits to compress the data. If the compressed data ends up being a gigabit, $x$ will end up being about $2^\text{1,000,000}$ .

Vanilla ANS

If a symbol has frequency $p$ in the data, we hope to spend about $\log_2(1/p)$ bits to encode the character. For example, if we have 2 equally frequent symbols, we expect to spend about 1 bit per symbol. If we have 8 equally frequent symbols, we expect to spend about 3 bits per symbol. In practice, each symbol will have its own arbitrary frequency, so we'd like to spend more bits per uncommon symbol and fewer bits per common one.

We start by making a small index of symbols approximately mimicking the frequency of each symbol. In our silly example above, we might choose an index of "eefr" since "e" is a bit more common than the other symbols. Then we tile it infinitely along the number line:

The index eefr tiled infinitely on the number line, starting from 1

Vanilla ANS's compression algorithm is dead simple: start with $x_0 = 0$ , and when we encode the $n$ th symbol, let $x_n$ be the index at which the $x_{n-1}$ st instance of that symbol occurs. We use 0-based indexing so that the 0th instance really means the first. The final $x_N$ is our really big number $x$ .

Let's do an example, compressing the first 2 symbols "f" and "r". Since $x_0 = 0$ and the first symbol is "f", we set $x_1 = 2$ , where the 0th (first) "f" appears in the index:

The index eefr tiled infinitely on the number line, now with x_1=2 compressing an f

Then, since $x_1 = 2$ and the second symbol is "r", we set $x_2 = 11$ where the 2nd (third) "r" appears in the index:

The index eefr tiled infinitely on the number line, now with x_2=11 compressing an r

It's also simple to decode: starting from $n = N$ , iteratively decode the symbol at $x_n$ and let $x_{n - 1}$ be the number of times that symbol appeared in the table up to (but not including) $x_n$ . In the example above, we can start from $x_2 = 11$ , see that 11 corresponds to an "r", and infer $x_1 = 2$ since 11 is the third instance of an "r". Then we can use $x_1 = 2$ to infer the previous symbol was "f" and so forth.

Since half the symbols in our index are "e", each time we encode an "e" we roughly double $x_n$ , adding one bit. And each time we encode an "f" or "r" we roughly quadruple $x_n$ , adding two bits. With a larger index, we could get closer to the ideal number of bits ( $\log_2\left(\frac{24}{11}\right)=1.13$ per "e", 2.58 per "f", and 1.43 per "r").

To summarize vanilla ANS in a single diagram:

$Data flow diagram showing the n+1st symbol and x_n being encoded into x_{n+1} and vice versa$

Streaming and normalization

Vanilla ANS is great and easy to understand, but it's impractical because it involves arithmetic on an extremely large integer $x$ . To make it faster and more memory-efficient, we need a way to keep $x$ small. We do this by adding a step to encoding called normalization, and a corresponding step to decoding called denormalization (sometimes these are both called "renormalization", but that seems redundant and ambiguous to me):

$Data flow diagram showing the n+1st symbol and x_n being encoded into x_n' and the bitstream; x_n' and the the n+1st symbol being encoded into x_{n+1}; and all the reverse steps as decoding$

Normalization and denormalization keep $x$ in a manageable range by writing bits out early instead of letting $x$ grow arbitrarily large. In particular, we require the index size $L$ to be a power of 2, and we keep $x$ in the interval $[L, 2L)$ . Emitting bits early like this sacrifices a small amount information during normalization, hurting compression ratio a smidgen, but spares us from doing impractical arithmetic.

Here's the main thing that tripped me up learning streaming ANS: I thought it would be equivalent to a special case of vanilla ANS with a $2^\text{something}$ -sized index. This is not the case. Normalization has no vanilla ANS analog, and it reduces the precision of $x$ , regardless of index size.

To normalize $x_n$ when compressing a symbol, we count $L_s$ , the number of times the symbol appears in the index, then find the exponent $k$ such that $\lfloor x_n/2^k\rfloor \in [L_s, 2L_s)$ . We write out the least significant $k$ bits of $x_n$ to the bitstream, then return $x_n' = \lfloor x/2^k\rfloor$ as the value of $x$ normalized to the symbol. Normalization is complete, and we then apply the vanilla ANS step to $x_n'$ , giving $x_{n+1}$ .

If you think about it, $x_{n+1}$ must be in the interval $[L, 2L)$ , since it's the index of the $x_n'$ th occurrence of the symbol in the index and $x_n'$ is between 1- and 2-times the number of occurences of the symbol in the index.

Denormalization is just the opposite. After undoing vanilla ANS, we get both $x_n'$ and the symbol, and then we can find the exponent $k$ such that $2^k x_n' \in [L, 2L)$ . We read those $k$ least significant bits back from the bitstream and get $x_n = 2^k x_n' + \text{least significant bits}$ .

Normalization isn't magic, but it satisfies 3 important properties:

It's invertible. (necessary to get the symbols back)
It returns an $x_n'$ such that vanilla ANS applied to $x_n'$ is in a fixed range. (improves compression/decompression speed and memory usage)
It carries the most significant information from $x_n$ into $x_n'$ . (doesn't hurt compression ratio too much)

Range Asymmetric Numeral System (rANS)

This is just a name for the streaming algorithm described in the previous section. Done!

Tabled Asymmetric Numeral System (tANS)

Finding the value of $k$ used during normalization and denormalization on the fly can be expensive. It also affects the number of bits you need to write/read, so CPUs need to block on evaluating $k$ before evaluating the next steps. It can be faster to precompute the information needed for all $L$ states of the index.

For instance, in the decoder, we can make a table from every $x_{n+1}$ in the normalized range $[L, 2L)$ to

its symbol,
how many more bits to read for denormalization,
and the next value of $x_n$ to add the least significant bits to.

This is equivalent to rANS, but is usually faster when the size of the data is much larger than the index.

Confusing thing explained: decoding doesn't have enough options?

In our "eefr" example with streaming, "e" has 2 occurrences in index of size 4, so every time we denormalize after an "e" during decoding we read exactly 1 bit. So, based on the bitstream, the decoder is choosing between at most 2 symbols to decode next! What if the next symbol is supposed to be the 3rd one though? If we had swapped that next symbol with the 3rd one during compression, how could the decoder possibly decode it?

The trick is that compression was done in the opposite order. It's exactly like the movie Tenet (wait a sec, that won't help anyone understand...). To the decoder, we're talking about the next symbol, but to the encoder, it's the previous symbol. If we swap the $n$ th symbol out for $s_n'$ during compression, we'd end up in a different state $x_{n+1}$ that does have the option to decode $s_n'$ .

One way of thinking about it: the encoder and decoder each encounter a choice at every symbol. The compressor always has exactly one option per symbol in the alphabet, but the decoder may have more or fewer. Notice that every state $x$ has an out-edge (encoding option) of each symbol, but doesn't necessarily have in in-edge (decoding option) for each symbol:

State machine showing the transitions between the x=4, 5, 6, 7 states, with each edge colored by the symbol it encodes