Sorting in Practice 3: Introsort
How does a typical C++ standard library implement std::sort
? The standard does not actually mandate a particular algorithm, but when I last looked at my installation (around 2020), I found something called introsort and this seems to be a common choice. The name stands for “introspective” sort – it pays attention to how it is doing – and it was intro-duced (haha) in a 1997 paper by David Musser.1 The code also seemed to call __adjust_heap
(heapsort?) and ___insertion_sort
under some circumstances, but we’ll see what that’s for.
Depending on your lecturer, you might come out of a freshman algorithms class having a high opinion of merge sort: elegant, easy to implement and clearly linearithmic runtime. Quicksort is actually kind of bad right, it has quadratic worst case? The idea for introsort is that actually quicksort is really good! We would usually be perfectly happy to use it. It’s average runtime on uniformly distributed inputs is linearithmic and in terms of constant factors it is faster than most other sorting algorithms, such as heapsort, on most inputs. The thing that’s bad about it is its worst case. It can end up doing a quadratic amount of work and that’s actually really bad – and not just in theory: sometimes you just have to sort really long lists and it can be a big problem if all of a sudden, this one time, it happens to be super expensive.
Let’s think like a theoretical computer scientist for a second. If you have two different algorithms for the same problem, whose runtime behaves differently on specific instances (and maybe they have trouble with different instances), then there also exists an algorithm that has, on every instance, the better of the two runtimes. Welcome to big-Oh notation: just cheat an additional factor 2. You run both algorithms interleaved (one step of A, one step of B, one step of A, &c.) and as soon as one of them finishes, you take its result.
So how do we fix quicksort’s worst case? Eh, just interleave it with merge sort! Job done. Then we’ll have linearithmic worst case. True, but let’s start thinking like a practical computer scientist again. Merge sort is actually pretty much a disaster on typical computer architectures. We probably want an in-place sorting algorithm. But now we have another problem: if both algorithms are going to shuffle the list, at least one of them will have to do some amount of copying, which is expensive and no longer in-place.
Quicksort’s problem is its worst case, but how does that actually come about? If we’re unlucky, it repeatedly partitions the data poorly and ends up recursing lots and lots of times on a part of the list that only gets a little bit smaller each time.
One way to address this problem would be to spend some time finding a good pivot, that is, a good element to partition the list on. But we can’t spend a lot of time on that, because – and I’m quoting Musser here – heap sort is only about 2 to 5 times slower than quicksort on average. It has the linearithmic worst case we want, but the constant factors are worse than quicksort. To get the good behaviour of quicksort, we can’t really add much of anything to it (in the typical case).
So here’s Musser’s idea. We just let quicksort do its thing unimpeded – going on its merry quick way, except for one thing: we keep track of how deep into the recursion we have gone. If it gets out of hand, we switch to heap sort to finish that part of the list in linearithmic time: here we use that quicksort partitions the list on its way down into the recursion. Any part of the list that a particular quicksort call is working on is already in the correct place; it is just not internally sorted yet. Switching algorithm all of a sudden – on this part of the list – does not disrupt the rest of the algorithm. Per Musser, we pay about a factor 2 to 5 on average by switching to heapsort, but we cut off the disaster case.
The worst case runtime of this combined algorithm can clearly be made linearithmic by picking the recursion threshold to be $O(\log n)$. Musser picks $2\lfloor \log_2(n)\rfloor$ and presents experimental evidence that introsort is actually almost as fast as quicksort in almost all cases. And indeed, it kind of must be: quicksort’s disaster case is rare (otherwise its average case would not be good), so we rarely have to fall back to heapsort. This entire trick – the entirety of introsort! – is meaningless in terms of big-Oh worst case: just use heapsort from the start. (Or, heaven forbid, a version of quicksort where the pivot is selected by Blum et al’s linear-time algorithm for the median.2). What introsort does in practice, is allow us to get almost the entire good case of quicksort most of the time, paying a little bit of an insurance fee in order to cover the occasional disaster.
So what are the calls to ___insertion_sort
? Maybe you can think about that for a second. Insertion sort is quick on almost-sorted lists.
Where does insertion sort go, in … (gestures broadly) … this thing?
Trying to detect if the array is almost sorted, is probably a bad idea, I think. That sounds like it takes a bunch of time that can end up being for nothing. Here is an idea – and this is not actually Musser’s idea, he calls it “one of the usual optimizations of quicksort:” we will ignore small subproblems in the recursion and clean them up with insertion sort at the end!
We already observed that quicksort’s partition step puts things in approximately the right place, in the sense that it puts things on the correct side of the pivot. At first that does not say a lot, but as we go further down the recursion, things get closer and closer to where they should be. In a quicksort call on a certain span of positions, all of the elements current there are going to stay in this interval of positions. So let’s do the following: if the interval for a quicksort call is below a certain length threshold, we do … nothing. We just leave it unsorted.
But then the list is not sorted. So all the way at the end – after the entire quicksort/heapsort hybrid thing – we call insertion sort. Once, on the entire list, which is almost sorted, so that’s super fast. If we ended up sorting a particular part of the list with heapsort, that part is actually sorted and insertion sort runs right past it. And any parts where the quicksort recursion bailed are smaller than the size threshold, so nothing is far from where it should be.
Of course now we need to tune this size threshold – and actually we also still need to tune the recursion threshold! Unfortunately Musser doesn’t seem to mention the size threshold. But then again, it is a “usual optimisation” so you will be able to find good advice if you look for it. In practice, it is not even your own problem: the standard library will contain a tuned implementation that you can just use.
I still have not told you how to sort a million 32-bit integers. Introsort seems like a reasonable choice, but if performance is extremely important we need to get more specific about the data and the particular hardware. There are explicitly cache-aware sorting algorithms, algorithms that try to space out their reads and comparisons in order to compensate for memory latency in a pipelined processor – all kinds of cool stuff, but I’ll leave it here for now.