Graph all the things

analyzing all the things you forgot to wonder about

2016-08-20

I believe there are two types of algorithms worth writing: elegant ones that can be easily understood, and complicated ones that get the job done better. The Sieve of Eratosthenes is an elegant algorithm for finding all the prime numbers between 1 and , and the Sieve of Atkin is a high-performance one:

The little wings around the data points are uncertainties for runtime. Sieve of Eratosthenes uses time, and Sieve of Atkin uses time, with a smaller linear ratio at first but much larger overhead. To see that 50x overhead, here's a log-log plot of the same data:

I wrote these algorithms in Julia, and the code is at the bottom of the page.

The proof that Sieve of Atkin works involves abstract algebra, and is much longer and less obvious than the proof of Sieve of Eratosthenes, but the payoff is splendid. It pleases me that someone thought of this approach to finding primes.

What I didn't mention before is that there is actually an even faster version of Sieve of Atkin, which you can read about here. It runs in time and just memory via "page segmentation": it breaks the range from 1 to into intervals of size roughly , then works on each interval in sequence, using information from the previous interval to educate which numbers to bother checking in the current interval.

However, this "even faster" version is really bothersome to code, and has runtime with such a high constant factor that it never really outperforms the approach I implemented; growth is really slow. As a result, no one ever codes it. Not even me.

If you've ever been taught to properly shoot a basketball (not that I have, but anyway...), you know that the easiest way to do something - the way you already seem to know how to do - is often suboptimal. That's why I'm a bit surprised that so many companies are using python as their language of choice for serious machine learning or data science. Python was developed to be simple and all-purpose, and it's a language that everyone knows, but it is not efficient. Numpy is an attempt to compensate for python's numeric slugishness, and a decent one. But it relies on methods that aren't native to python, and when you need to do any logic in addition to numpy calls, performance drops terribly.

Julia seems like a much more serious language for dealing with data. I couldn't get a python sieve's performance to come close to that of Julia, even after trying several possible implementations:

A huge advantage Julia has is that it compiles. But there are benefits other than speed, like storing boolean vector data in BitArrays where each boolean is actually just one bit. In contrast, numpy allocates an entire byte to each boolean, eating up your memory.

I put my code in this little Github repo. These aren't the fastest implementations (primegen or something similar holds that title), but they are as fast as you can get for this level of effort.

Note that huge constants for Sieve of Atkin are just lookup tables to save a bunch of operations. I precomputed them with another little Julia script.