A Multi‑Threaded Branchless Quicksort in C

[-]

chkas@reddit (OP)

The logic in your post is misleading. You call the stores "redundant," but in modern architecture, those stores are exactly what buy you performance. Trading one store cycle for a potential 20-cycle branch misprediction penalty is a massive win, not a waste.

Regarding your critique of macros like IS_LOWER: these are standard idioms used to swap comparison logic easily. If you think a simple inline function or standard code is enough, you are free to try and debunk this by writing a Quicksort that matches these speeds on random data without using branchless techniques. You will find that no matter how much you inline, the branch predictor will hit a wall that only branchless logic can break through, as you are simply arguing against the reality of how modern CPUs actually execute code.

The accusation regarding SEO is merely intended to distract from the real issue. These aren't theoretical changes - they are fundamental optimizations. Even in small loops where vectorization isn't the main factor, the branchless version consistently outperforms the conditional one because it prevents pipeline stalls.

Furthermore, with larger loops, branchless code allows the compiler to safely use SIMD instructions, which is often impossible with an if-block. On an Apple M1, the difference becomes massive. I’ll let the code and the numbers speak for themselves.

// SPDX-License-Identifier: MIT
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>

#define MAXSIZE 100000

int numbers[MAXSIZE];
int small_numbers[MAXSIZE];

void init(void) {
    for (int i = 0; i < MAXSIZE; i++) {
        numbers[i] = rand() % 1000;
    }
}
double t(void) {
    static double t0;
    struct timeval tv;
    gettimeofday(&tv, NULL);
    double h = t0;
    t0 = tv.tv_sec + tv.tv_usec / 1000000.0;
    return t0 - h;
}
void test_if(int size) {
    for (int i = 0, j = 0; i < size; i++) {
        if (numbers[i] < 500) {
            small_numbers[j] = numbers[i];
            j += 1;
        }
    }
}
void test_bl(int size) {
    for (int i = 0, j = 0; i < size; i++) {
        small_numbers[j] = numbers[i];
        j += (numbers[i] < 500);
    }
}

int main(void) {
    init();
    t();
    for (int i = 0; i < 1000000; i++) test_if(1000);
    printf("%.3fs\n", t());
    for (int i = 0; i < 1000000; i++) test_bl(1000);
    printf("%.3fs\n", t());
    for (int i = 0; i < 1000; i++) test_if(100000);
    printf("%.3fs\n", t());
    for (int i = 0; i < 1000; i++) test_bl(100000);
    printf("%.3fs\n", t());
    return 0;
}

Results:

test_if(1000): 0.660s
test_bl(1000): 0.322s

test_bl(100000): 0.315s
test_bl(100000): 0.036s

[-]

Embarrassed-Media-62@reddit

Thanks ChatGPT

[-]

chkas@reddit (OP)

The "making it branchless" seems to add redundant stores to the code to remove the conditionals. Which is going to be faster on some systems and slower on others.

It runs faster on all modern CPUs.

I don't get why people write macros like "IS_LOWER(a,b)".

The reason for this is that this code can also be applied to types that cannot be compared using the < operator. That is why there is also the definition #define TYP int. This is C, not C++.

[-]

happyscrappy@reddit

It runs faster on all modern CPUs.

To make such an assertion you would have to test it first, wouldn't you?

And it's not even true. Compile this for a microcontroller and you'll find the opposite. Try a Cortex-M or a low-end RISC-V and you'll see rapidly that you assumed instead of knowing.

This is C, not C++.

You can use inline functions. It'll restrict your compatibility to only compilers written in the past 40 years though. Since you're already using // comments I expect that won't present much of an issue.

[-]

chkas@reddit (OP)

Yes, branchless programming isn't really practical on MCUs, but the article isn't aimed at MCUs anyway. After all, it's also talking about threads.

... and not a<b ...

How do you do that with an inline function?

[-]

happyscrappy@reddit

Yes, branchless programming isn't really practical on MCUs, but the article isn't aimed at MCUs anyway. After all, it's also talking about threads.

Branchless programming is practical on MCUs. Just not the tradeoff you made. Cortex-Ms have conditional execution on instructions. They thrive on branchless and the compiler will put in branchless code for you.

// Type your code here, or load an example.
int findthehighervalue(int a, int b)
{
    if (a < b)
    {
        return a;
    }
    else
    {
        return b;
    }
}

::

findthehighervalue(int, int):
        cmp     r0, r1
        movge   r0, r1
        bx      lr

(actually ARM7-A as godbolt doesn't do ARM7-M. The code would be the same in this case)

How do you do that with an inline function?

(me) I don't get why people write macros like

I feel like I expressed myself well enough.

[-]

chkas@reddit (OP)

The inline keyword is only a hint, not a command.

[-]

happyscrappy@reddit

Absolutely. But there's no command in macros either. If you subscribe to the idea that the compiler can do whatever it wants regardless of your indications then you gotta follow that through, right?

There is an opposite of inlining called outlining and compilers can do it any time they want.

It doesn't have to implement your conditions as branches. And it doesn't have to implement your "branchless" code without branches. It'll do it the most efficient way it can. What the poster really should have emphasized is he wrote a different algorithm that intentionally does redundant stores, it makes some stores non-conditional. Compilers are more reticent to add and remove stores than to change program flow so he got a particular compiler he used to issue different (straight line) code on one processor doing it this way. Whereas on ARMv7 for example it would already have been straightline code regardless as the architecture has conditional execution.

[-]

chkas@reddit (OP)

Compilers don't just "outline" tiny sorting networks for fun - also not at -O3. Using macros just ensures the code is right there for the optimizer to work with.

As for the "redundant stores": that’s the whole point. We are trading a predictable store for a messy branch that would otherwise stall the pipeline. Mentioning ARMv7 is a bit of a stretch here - on modern chips like the M1, you have to be very specific with your C code if you want the compiler to actually use CSEL or CMOV.

The results in the write-up show how much this helps. Feel free to compile it yourself - you should see a similar speedup on Intel x64 systems too. At the end of the day, intentional branchless code is a standard trick in high-performance libs for a reason.

[-]

happyscrappy@reddit

Compilers don't just "outline" tiny sorting networks for fun

I didn't say anything about for fun. Reductio ad absurdum.

also not at -O3

First you don't say what you were targeting, now you want to introduce compiler options you didn't list too. Why?

Using macros just ensures the code is right there for the optimizer to work with.

There's no good reason to use the macro. The code is right there as long as it is in the same compilation unit. And it is.

Mentioning ARMv7 is a bit of a stretch here - on modern chips like the M1, you have to be very specific with your C code if you want the compiler to actually use CSEL or CMOV.

There are still a crapton of ARMv7s in this world and ARMv8s running v7 code (Raspberry Pis, though they are moving people away from this now for big RAM configs).

And the compilers use CSEL or CMOV when it is efficient. It's not that hard to get a compiler to emit straight line code when it is efficient. You're just hung up on this idea of extra stores. No, you can't get a compiler to create extra stores. There's a reason for that.

void thirdthing(int *ap, int *bp)
{
    for (int i = 0; i < 1000; i++)
    {
        int a = ap[i];
        int b = bp[i];
        int c = b;
        if (a < b)
        {
            c = a;
        }
        ap[i] = c;
    }
}

}

That produces CSEL. if you use the ternary operator it'll actually vectorize the code. But if you make the stride other than 1 to avoid that then it uses CSEL.

The results in the write-up show how much this helps

The code you are comparing does more than just remove branches or add extra stores. You have 3 special cases for sorting.

[-]

chkas@reddit (OP)

Theoretical compiler behavior is one thing; actual benchmarks are another.

To test your claim that the compiler "just handles it," I ran this minimal test on the M1 (Clang -O3). It compares a standard if against the "extra store" branchless version you criticized:

// SPDX-License-Identifier: MIT
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>

#define SIZE (50 * 1000000)

int numbers[SIZE];
int small_numbers[SIZE];

void init(void) {
    for (int i = 0; i < SIZE; i++) {
        numbers[i] = rand() % 1000;
    }
}
double t(void) {
    static double t0;
    struct timeval tv;
    gettimeofday(&tv, NULL);
    double h = t0;
    t0 = tv.tv_sec + tv.tv_usec / 1000000.0;
    return t0 - h;
}
void test() {
    for (int i = 0, j = 0; i < 1000; i++) {
        if (numbers[i] < 500) {
            small_numbers[j] = numbers[i];
            j += 1;
        }
    }
}
void testbl() {
    for (int i = 0, j = 0; i < 1000; i++) {
        small_numbers[j] = numbers[i];
        j += (numbers[i] < 500);
    }
}

int main(void) {

    init();
    t();
    for (int i = 0; i < 1000000; i++) test();
    printf("%.3fs\n", t());
    for (int i = 0; i < 1000000; i++) testbl();
    printf("%.3fs\n", t());
    return 0;
}

Results: Standard if: 0.621s Branch-free: 0.322s

The compiler had every chance to emit CSEL for the first version, but it didn't. The result is a 2x performance gap because the hardware handles the "extra store" much better than a mispredicted branch.

This isn't about ARMv7 or "fun" - it's about the fact that intentional branchless code still beats compiler heuristics in high-performance paths. If you can get the if version to match 0.322s, I’d love to see the flags.

[-]

happyscrappy@reddit

"criticized", huh?

I'm sorry, did I get something wrong? You can't take constructive criticism? Your change will be better on some architectures and worse on some others. Do you actually have a counter to this or are you just going to go to "works on my machine" over and over?

I never said the compiler will "just handle it" for changes which involve adding stores. In fact, I said it is unlikely to change one into the other by adding stores because compilers are unlikely to assume adding stores is a valid optimization. Because changing the number of stores is more likely to change the side effects it is not likely to do so. Even though for sufficiently self-contained programs it can be shown that it is valid to do so.

I changed your code to only test actually straightlining code by changing test() to this:

void test() {
    for (int i = 0, j = 0; i < 1000; i++) {
        small_numbers[j] = numbers[i];
        if (numbers[i] < 500) {
            j += 1;
        }
    }
}

And I get these results on my machine (Apple M4):

$ ./a.out          
0.497s
0.501s

Successive runs produce these results:

$ ./a.out
0.501s
0.498s

$ ./a.out
0.496s
0.496s

And I reversed the calls to testbl() and test() to try to give the "cache advantage" to the non-tweaked code.

Why are the results the same? Because the functions do the same thing:

_test:
0000000100000638        mov     x8, #0x0
000000010000063c        mov     w9, #0x0
0000000100000640        adrp    x10, 8 ; 0x100008000
0000000100000644        add     x10, x10, #0x8
0000000100000648        adrp    x11, 48836 ; 0x10bec4000
000000010000064c        add     x11, x11, #0x208
0000000100000650        ldr     w12, [x10, x8]
0000000100000654        str     w12, [x11, w9, uxtw #2]
0000000100000658        cmp     w12, #0x1f4
000000010000065c        cinc    w9, w9, lt
0000000100000660        add     x8, x8, #0x4
0000000100000664        cmp     x8, #0xfa0
0000000100000668        b.ne    0x100000650
000000010000066c        ret

testbl has the same code, of course. As you can guess from what these instructions do.

Showing once again that it's really not hard to get a compiler to produce straight line code, despite what you said.

This isn't about ARMv7 [..] - it's about the fact that intentional branchless code still beats compiler heuristics in high-performance paths. If you can get the if version to match 0.322s, I’d love to see the flags.

It isn't about ARMv7? Really, where in your link did it say that? You're just saying this because you wrote a different algorithm that is faster on some architectures and slower on others.

As to "fun". You bright up "fun", not me. I found it odd. Now you say it's odd. At least we agree on one thing I suppose.

If you can get the if version to match 0.322s, I’d love to see the flags.

Holy cow dude. You know that's just your machine, right? Not even just your arch, but your machine. No code gets that speed on my machine, despite having a similar arch.

If there's a message here, it's not "ignore stores, remove your ifs". Ifs to not determine whether your code is branchless or not. The message is that if you really only have one machine you care about in terms of performance then consider that machine when designing your algorithm. And that's what you've done. You've created an alternate algorithm that speeds up execution on your machine, as you've measured.

Btw, I hate that t() function. The 2nd task to be measured is going to end up eating a portion of the execution of printf() because t() is called before printf() is called. On my Raspberry Pi3 (running as 64-bit OS) if I run the conditional store version second it adds a HALF SECOND to the time it is scored with. But weirdly if I run the always store case second it does not.

I changed it to be like this:

    t();
    for (int i = 0; i < 1000000; i++) test();
    double t1 = t();
    t();
    for (int i = 0; i < 1000000; i++) testbl();
    double t2 = t();
    printf("%.3fs\n", t1);
    printf("%.3fs\n", t2);
    return 0;

In the below numbers, test() and testbl() are your original implementations. I consistently see the reps of test() taking about 8.1s and testbl() about 3.4s. But if I remove the second bare call to t() after the double t1 = t() then I see the test() reps consistently taking 8.6s. It's absolutely reproducible. Even if I reverse the test() and testbl() calls the extra time ends up on test() or nowhere, never on testbl(). That doesn't make any sense.

The calls to testbl() are always quicker, a lot quicker. But the difference from making a change which should do nothing significant scares the heck out of me. What is causing this and why? I tried making some more changes and I can "scare the half second away", but I can't find out why it is there in some cases.

Okay, I'm done with this. I don't care enough anymore to chase any of this around.

[-]

chkas@reddit (OP)

You are misinterpreting your own results, and it's honestly getting a bit tiring to see you move the goalposts to avoid the point. Your modified test() is still branchless because you moved the store outside of the if-block.

In your code, the if-statement only guards a simple increment (j += 1). Any modern compiler will turn that into a CINC or CSEL instruction. Your own assembly output proves this: it shows a CINC, not a branch. You didn't debunk the "extra store" method; you accidentally used it and proved exactly why it's faster.

To see the actual performance drop, you have to put the store back inside the if-block where it would be in a traditional implementation. That is what creates the true control-flow bottleneck that stresses the branch predictor.

A true conditional store is slow because it forces a branch. Your "if" version and my branchless version are both fast because the store always happens, which allows the compiler to produce straight-line code. You aren't fighting my logic - you are confirming it by using the same "extra store" trick to get your M4 to produce the same assembly for both functions.

[-]

happyscrappy@reddit

Your modified test() is still branchless because you moved the store outside of the if-block.

Been my point for quite some time now. That you are measuring branchless by what you see in the code and ignoring what the compiler compiles to. You're not actually engaging in a discussion I guess, just repeating what you started with. Which was never valid to start.

So yeah, that's why you see a CINC in there, because it is branchless despite having an "if" in the code.

And yes, a store is redundant if the value stored is never used before it is written over. It's a term of art in computer architecture science. And in the real world.too. But you're not actually paying attention to anything other than what you said.

'Dead store elimination (DSE) is an optimization technique that removes redundant store operations in computer programs, improving efficiency.'

Sheesh.

A true conditional store is slow because it forces a branch

ARMv7. Remember? You really don't actually know what you are talking about.

(me) If there's a message here, it's not "ignore stores, remove your ifs". Ifs do not determine whether your code is branchless or not.

You're in over your head and banging on me about what you don't understand.

[-]

chkas@reddit (OP)

You're moving the goalposts again, and it’s exhausting going in circles. Congratulations on winning the upvote war and helping my post tank with your comments, but textbook definitions aren't a substitute for technical reality. You call the store "redundant" by the book, but ignore that it’s a deliberate architectural trade-off. I’m done with this discussion.

[-]

double-you@reddit

Yeah, that's pretty much as useful as people pointing directly to their repositories. I assume those short descriptions tell you things if you already know how to do all of it, which makes me question the point of the whole article.

The only reason I can think of for the IS_LOWER() is if you expect to make other types of comparisons at some point (structs, strings, ...).

[-]

happyscrappy@reddit

I guess that's it. The code reads like code which was written to work with generics and when that functionality was removed it wasn't removed from this code.