Branchless Quicksort faster than std::sort and pdqsort with C and C++ API
Posted by chkas@reddit | programming | View on Reddit | 4 comments
Posted by chkas@reddit | programming | View on Reddit | 4 comments
sheppyrun@reddit
i've looked at a lot of branchless sort implementations and the pattern is usually the same. the compiler can vectorize the inner loop, which is where most of the speedup comes from. modern x86 branch predictors are already good enough that mispredictions aren't the bottleneck in most cases. what matters more is whether the speedup holds on random input vs nearly-sorted, because the memory access pattern changes completely. on arm the gap can be dramatic since the branch predictor architecture is different. i'd like to see numbers on both.
matthieum@reddit
Wouldn't mispredictions be the norm on random data?
Determinant@reddit
No, if it's mispredicting every branch then it accidentally learned the negated version of the pattern. High quality random data should turn the branch predictor into a coin toss with 50% accuracy.
chkas@reddit (OP)
Thanks for the feedback, but a few things work differently here:
@Vectorization: Compilers cannot auto-vectorize standard Quicksort partitioning due to data-dependent pointer movements. The speedup comes purely from avoiding pipeline stalls.
@Branch Predictors: Modern predictors are great, but with completely random data, the branch is a 50/50 chance. There is no pattern to learn.
@Benchmarks: I actually included both architectures in the post. The table shows the numbers for both the Apple M1 (ARM) and the AMD Ryzen (x86).