Faster Sorting Beyond DeepMind’s AlphaDev

Technical Report, 6 September 2023

ABSTRACT
‍
In spring 2023, DeepMind published improvements to the sorting functions used in standard C and C++ libraries, obtained using deep reinforcement learning. The discovered sorting routines for small sequences yielded speedups by factors of 3 to 5 [Mankowitz, D.J., Michi, A.,Zhernov, A. et al. Faster sorting algorithms discovered using deep reinforcement learning. Nature 618, 257–263 (2023). https://doi.org/10.1038/s41586-023-06004-9]. The results represent a remarkable milestone as they promise to reduce power consumption of a wide range of applications using sorting algorithms. Furthermore, the publication illustrates that AI methods can improve upon the work of human software engineers.
‍
Building upon these results, mimicry discovered further improvements, with shorter code sequences for some of these functions and further speedups by factors up to 3.
‍
We combined mimicry’s unique function discovery technologies, a broader set of CPU instructions, and different algorithmic approaches established using human guidance. Our results illustrate the advantages of pairing human ingenuity with AI-based tooling in creating better software and finding novel routines.

In this report, we provide an overview of our contributions, and outline how they have been obtained. Section 3 describes the novel sorting algorithms, with Sections 4 and 5 detailing how they can efficiently be implemented on x86-64 processors without or with SIMD instructions. Section 6 provides measurement results. We outline potential further work in Section 7 and provide conclusions in Section 8.

(1) OUR CONTRIBUTIONS
‍
We discovered a different algorithmic approach to sorting small arrays. Our sort algorithms identify a shuffle vector for a given short vector of values. The shuffle vector provides an ordering of indices and is applied to the original vector. This approach contrasts with other sorting algorithms, including the one used by AlphaDev, which consider pairs of values, and swap them to establish a sort order.
‍
For sorting 3 elements, we discovered a Sort3 algorithm using SIMD-instructions that achieves 13-41% shorter execution time (depending on microarchitecture, excluding Zen+, on which it is running 9% slower as discussed in Section 6), while using only 9 instead of 17 instructions. Also, we discovered a 14-instruction Sort3 algorithm for unsigned values using scalar, non-SIMD-instructions that executes 17-29% faster (all microarchitectures, including Zen+), as well as a 17-instruction algorithm for signed values using scalar, non-SIMD instructions that is as fast as or up to 24% faster as AlphaDev’s algorithm (all microarchitectures).
‍
Furthermore, we improved the performance of Sort4 algorithms. Our discovered algorithm using SIMD instructions includes 12 instructions, running 43-69% faster than the 28-instruction algorithm identified by AlphaDev (depending on microarchitecture, and including Zen+). Our Sort4 algorithm for unsigned values using scalar, non-SIMD instructions executes 4-31% faster, while only requiring 23 instructions. For sorting signed values, we discovered a 27-instruction algorithm that uses scalar, non-SIMD instructions and runs between 10% slower and up to 24% faster than AlphaDev’s Sort4 algorithm.

(2) OUR APPROACH
‍
Our search for even faster sorting functions started by considering how more parallelism could be exploited, be it for use with SIMD instructions or for higher IPC (instructions per cycle) capabilities available on recent microarchitectures. In this context, we reframed the sorting problem as calculating a shuffle vector containing indices, which will yield the vector with ordered elements when applied to the original vector. Applying such shuffles is supported by single SIMD machine instructions to reorder elements in a vector, or by storing individual values with different indices that are applied in the addressing mode when using scalar instructions. Thus, the problems remain to determine how to efficiently calculate an appropriate shuffle vector given an unsorted vector, and how to best map the calculations onto the CPU- and micro-architecture. In particular, we used mimicry’s function finder technology to identify and validate such shuffle calculation functions. Further, we applied technologies derived from advanced compilers for mapping such functions onto the architecture.
‍
We suppose that AlphaDev may have identified similar functions if applied to this algorithmic problem. Its ingoing solution space provided for exploring any order of compare, move, and conditional move instructions, which limits solutions to algorithms that move individual elements based on comparison results. It remains to be seen whether AI methods may discover structurally novel algorithmic solutions, as it was done by our engineers.

(3) THE NOVEL ALGORITHMIC SOLUTION
‍
A shuffle vector contains a set of indices for reordering a data vector. There are two types of shuffle vectors, both used in our algorithms: Write shuffle vectors specifying at which index a given element should be written, and read shuffle vectors determining from which index a given element should be read. The function wshuf(v, vs) implementing a write shuffle operation with v being the data vector and vs being the write shuffle vector would thus execute

dest[vs[i]] = v[i], for all i
‍
while rshuf(v, vs) implementing a read shuffle operation would execute

dest[i] = v[vs[i]], for all i.

A simple (but generally not efficient) way to calculate a write shuffle for a vector is to count for each element the number of elements that should be stored before this element according to the sorting criteria, e.g. the number of elements that are less than this element.

If there are multiple elements with the same value in the source vector, they will receive the same index. To make the result usable for sorting, it must be distinguished between source elements before the current element, and those after, such that a non-repeating sequence of indices to write emerges. E.g. one may count the number of elements, for which the number of elements being less or equal to the current element are counted, and those after the current for which only those being less are counted¹. The algorithm would thus be:

for all elements v_i: vs_i = count(v_j ≤ v_i, for all j < i) + count(v_j< v_i, for all j > i).
‍
The number of comparisons for a vector of length n would be n*(n-1). This can be reduced by a factor of two, as elements are compared pairwise twice.

Our Sort3 function thus looks as follows:
(1) vs₂ = count(v₀ ≤ v₂,v₁ ≤ v₂)
(2) c = v₀ < v₁
(3) vs₀, vs₁ = map(vs₂, c)
(4) v = wshuf(v, vs)

Note how vs₀ and vs₁ are fully determined by vs₂and c: c provides which of v₀ or v₁ should be at the lower index and at the upper index, respectively. Depending on vs₂, the lower and upper indices are (1,2), (0,2), or (0,1). As the calculation cannot be expressed with one or two machine instructions, mimicry’s function finder has opted to implement the function using a mapping table indexed by vs₂ and c.

Sort4 is analogous, with vs₃ assuming the role of vs₂ above, and c becoming a bit vector of multiple comparison results (namely all comparisons in Sort3). It thus looks as follows:
(1) vs₃ = count(v₀ ≤ v₃, v₁ ≤ v₃, v₂ ≤ v₃)
(2) c = v₀ < v₁ | v₁ < v₂ | v₂< v₀
(3) vs₀, vs₁, vs₂ = map(vs₃, c)
(4) v = wshuf(v, vs)

¹Mimicry’s function finder has sometimes selected the opposite order in the presented implementation at instruction level. The choice neither affects the result nor performance, and may change with future versions of function finder, or with additional features how human engineers can influence its behavior. We note that from a theoretical point of view the definition presented above is preferable, as it results in a stable sort.

(4) MAPPING THE ALGORITHM TO THE X86-64 ARCHITECTURE WITH SCALAR INSTRUCTIONS
‍
For the implementation in Sort3mu (mimicry’s Sort3 implementation for unsigned values with 14 instructions), function finder identified a couple of unexpected machine instructions in the x86-64 architecture reducing code size and improving speed. The usual implementation of counting comparison results using a cmp (compare) instruction, followed by a setcc (set byte on condition code) instruction and an add instruction, can be reduced to the cmp instruction followed by adc (add with carry) instruction. For unsigned values, the carry flag represents whether the first operand is below the second operand of the comparison instruction. To establish a proper count, the corresponding count register needs to be initialized to 0. The counter initialization operation can be merged with the first counting by using a subtraction with borrow (sbb) instruction instead: sbb r_i, r_i sets r_i to 0 if the carry flag is not set, and to -1 if the carry flag is set. The resulting count (with later adc operations) is thus offset by -1. Note that to invert the sign in the count, the comparison parameters for the first comparison need to be exchanged.
‍
The calculation of vs₂ for Sort3mu is:
(1) cmp v2, v0 // note: swapped arguments
(2) sbb vs2, vs2
(3) cmp v1, v2
(4) adc $0, vs2 // vs2 has a range -1..1

Picture 1 shows the complete algorithm for Sort3mu with 14 scalar instructions, and the mapping tables for vs₀ and vs₁.

Picture 1: mimicry’s scalarSort3 algorithm for unsigned values

Sort4mu is similar to Sort3mu. Function finder represented the concatenation of the different comparison results for c by shifting the previous result 1 bit to the left. The shift can be merged into the adc operation by adding the previous result instead of 0 as in Sort3, i.e. c is determined by adc c, c as c = c_old+c_old+carry, which equals c = c_old*2+carry or c = (c_old<<1)+carry.

For sorting signed values, there are no operations that extract the comparison result and merge it into another value in a single operation. The setcc instruction, which can set a byte based on any comparison result, requires an additional instruction to combine it with previous comparison results. As this overhead would been countered for every comparison (i.e. 6 times for Sort4), a better solution is to map the int32 range to the uint32 range for comparison, and using the above algorithm. Adding -MIN(int32) to every v_i and keeping that value for comparisons achieves the mapping. Note that the comparison value is separate from v_i, and would ordinarily be computed by moving v_j into another register and then adding MIN(int32). Function finder has used lea instructions instead, which provide generalized addition operations that allow the destination operand to differ from any of the source operands.

Picture 2 shows the complete algorithm for Sort4ms, including the mapping from int32 value range to uint32 value range.

Picture 2: mimicry’s scalar Sort4 algorithm for signed values

(5) USING X86-64 SIMD INSTRUCTIONS
‍
The above algorithm can also be implemented using SIMD instructions. We limited the scope to the widely available AVX2, AVX and SSE instructions, and ignored AVX-512 extensions available only on some high-end processors. The SIMD instructions allow multiple values to be loaded or stored at once, multiple comparisons performed at once and values be shuffled with a single instruction. The index used in the mapping function can be extracted with the MOVMASK operation, providing a mask with each bit corresponding to a comparison result between two values. The respective SIMD sorting algorithm is

(1) vc = load vector v
(2) vc’ = shuffle vector vc
(3) c = parallel compare vc, vc’
(4) mask = movmask(c)
(5) vc = rshuffle(vc, map(mask))
(6) v = store vc

One difference to the scalar algorithm is that shuffle operations in the x86-64 SIMD architecture implement read shuffles. It is thus not possible to compute a part of the shuffle by counting the number of elements that are less than a selected element. However, for Sort3, a 3-bit mask created from three comparisons, and for Sort4, a 6-bit mask created from six comparisons includes all information needed to select a shuffle which puts the elements in order.

Picture 3 shows the complete SIMD Sort4 algorithm. We omitted the large mapping table from this paper, which includes 64 entries in the 16-byte shuffle format directly usable with the SIMD shuffle instruction. Working code is available from our repository: https://github.com/mimicry-ai/sort.

Picture 3: mimicry’s Sort4 algorithm with SIMD instructions

We obtained very high performance for Sort4, twice as fast as the best scalar implementation. For Sort3, the results are less attractive, as the available instructions to load and store parts of vectors (e.g. 3 values instead of 4) are relatively slow.

(6) MEASUREMENT RESULTS
‍
Table 1 lists the measured performance for our Sort functions as compared to the original libc++ implementation and AlphaDev’s routines². In our measurements, we executed 2bn sort operations with randomized short vectors, on four different microarchitectures including Intel Sandy Bridge, Intel Coffee Lake, AMD Zen+, and AMD Zen2. We averaged the execution time per sort operation, and we took the geometric mean of the relative speedup factors over the original libc++ function. With AlphaDev providing 5.32 times faster Sort3 performance in our measurements, our functions further enhanced it to a factor of 5.97 to 6.92. For Sort4, AlphaDev improved upon the original libc++ function by a factor of 4.25, which we accelerated further to factors between 4.66 and 10.03.

Table 1: Observed performance of Sort3 and Sort4 algorithms

One remarkable finding was the performance of SIMD versions of Sort3. While on some microarchitectures, it achieved a speedup factor between 6.7 and 7.3, faster than any other algorithm, on the Zen+ microarchitecture it ran only 5 times faster than the original libc++ function, compared to a factor of 5.4 for AlphaDev’s function and factors between 7.2 and 7.7 for mimicry’s scalar Sort3 functions. We attribute this weak performance to particularly slow instructions for reading and writing partial vectors (VPMASKMOV) on Zen+. We did not investigate other approaches for reading and writing partial vectors, such as with VPINSRD and VPEXTRD instructions.
‍
We also found the SIMD version of Sort4 running 18-20% faster than the SIMD version of Sort3 on all tested microarchitectures except for Zen+, and 56% faster on Zen+, despite encompassing less work. It illustrates the relatively high cost of masking out parts of SIMD vectors when loading and storing them on current microarchitectures.
‍
²We consider the comparison to the original libc++ std::sort routine as not completely fair, as that function handles variable-length vectors of any size. The code generated by clang version 10.0.0-4 with option -O3, however, does not include instructions to handle variable length arrays, but only small overhead to call the std::sort function and register spill code, which does not distort the measurement to a large extent. We kept it for lack of a better reference point, noting that it does not affect the relative performance figures between the various optimized functions.

(7) FUTURE WORK
‍
We suppose the outlined methods could be applied to longer vectors. One particular concern may become the size of the lookup tables, with their size growing exponentially in the number of needed comparisons, which grows quadratically with vector length n. Denser representations may provide some relief, at the cost of a few additional instructions to unpack the data. Additionally, splitting longer vectors into two shorter sort operations and then merging the results may become an attractive solution.

How variable length vector sorting can be combined with the methods mentioned above is something that remains to be investigated.
‍
It would be interesting to measure how the use of above algorithms could improve sorting routines in general, such as in standard libraries.
‍
On a larger scale, it is an open question on whether algorithmic improvements such as those described here may have been found by an AI on its own. Related to this question, assuming human ingenuity combined with AI methods provides better results than either on its own, we see great value in finding ways how such joint human/AI work can be performed

(8) CONCLUSIONS
‍
AlphaDev’s as well as our results have shown that significant improvements in high-quality software that has not been touched for many years are possible. This work focused on performance and related power consumption, but improvements in other areas such as scalability, maintainability, security, or resource consumption may well be within reach. AI methods have been found invaluable tools in discovering and exploring such improvements. We expect such methods combined with human ingenuity unlocking great innovation potential as methods mature and software engineers determine how to make best use of them.
‍
The source code for mentioned algorithms is available open source under Apache 2.0 license at https://github.com/mimicry-ai/sort.

Download Tech Report