Branchless Quicksort faster than std::sort and pdqsort with C and C++ API

Posted by chkas@reddit | programming | View on Reddit | 4 comments

[-]

sheppyrun@reddit

i've looked at a lot of branchless sort implementations and the pattern is usually the same. the compiler can vectorize the inner loop, which is where most of the speedup comes from. modern x86 branch predictors are already good enough that mispredictions aren't the bottleneck in most cases. what matters more is whether the speedup holds on random input vs nearly-sorted, because the memory access pattern changes completely. on arm the gap can be dramatic since the branch predictor architecture is different. i'd like to see numbers on both.

[-]

matthieum@reddit

modern x86 branch predictors are already good enough that mispredictions aren't the bottleneck in most cases

Wouldn't mispredictions be the norm on random data?

[-]

Determinant@reddit

No, if it's mispredicting every branch then it accidentally learned the negated version of the pattern. High quality random data should turn the branch predictor into a coin toss with 50% accuracy.

[-]

chkas@reddit (OP)

Thanks for the feedback, but a few things work differently here:

@Vectorization: Compilers cannot auto-vectorize standard Quicksort partitioning due to data-dependent pointer movements. The speedup comes purely from avoiding pipeline stalls.

@Branch Predictors: Modern predictors are great, but with completely random data, the branch is a 50/50 chance. There is no pattern to learn.

@Benchmarks: I actually included both architectures in the post. The table shows the numbers for both the Apple M1 (ARM) and the AMD Ryzen (x86).