When 'if' slows you down, avoid it

[-]

Fidoz@reddit

Clever, but I've always preferred readable and maintainable over performant... But to be fair, I'm not at Jane Street

[-]

JustOneAvailableName@reddit

maintainable

Branchless is very maintainable, because code only ever executes one path. Most bugs are caused by branches.

[-]

Frolo_NA@reddit

define the branches at programming time and use polymorphism at runtime so that the code only executes one path.

[-]

Frolo_NA@reddit

the execution path is already known as soon as you instantiate an object.

[-]

UncleMeat11@reddit

The execution path of a branch is known as soon as you know its inputs.

The question is: does the branch predictor know enough to take the correct jump every single time? That's just not the case with runtime polymorphism.

[-]

You have a function. It takes a virtual class A as a parameter. This function just calls foo() on this parameter. For whatever reason this function is not inlined by the inliner. You are telling me that the branch predictor and speculative execution is going to, 100% of the time, jump to the correct implementation of foo?

[-]

Frolo_NA@reddit

i'm saying when you have an instance of A, you send it the message foo.

what foo does can only be 1 thing based on the class of A. there is no other option

[-]

UncleMeat11@reddit

If there are virtual functions then there can be more than one thing. The whole point of runtime polymorphism is that although you know that you have an A at compile time, the runtime type can be different and can dispatch to any number of different foos. That's a branch.

[-]

Frolo_NA@reddit

you move the branch to the "program writing phase", not the "execution phase".

program writing phase and compile time are not exactly the same thing.

[-]

UncleMeat11@reddit

Smalltalk has dynamic dispatch.

It is true that if you never leverage this feature then all calls can be devirtualized. But this is like saying that because all of your logical branches statically resolve to "true" or "false" after constant folding that actually imperitive programming is branchless.

[-]

Frolo_NA@reddit

you don't need to do this though.

dynamic dispatch is the mechanism to remove branching. the distinction is baked into classes. it just happens that in smalltalk classes are runtime constructs. nothing static is needed.

[-]

UncleMeat11@reddit

Maybe we have a different understanding of what a branch is.

I am not saying that there is a syntactic branch in program. I am saying that semantically control will jump to multiple different possible locations based on some runtime information such that we cannot do some sort of out of order execution across the control branch.

The thing that matters here is removing the semantic branch so that your program can get maximum benefit from pipelining and out of order execution.

[-]

Frolo_NA@reddit

i think that helps clarify things. i don't know how the smalltalk maps to cpu instructions here.

[-]

imaami@reddit

How do you implement runtime polymorphism without branches?

[-]

Frolo_NA@reddit

parse into polymorphic objects or have them from the start. like smalltalk

[-]

imaami@reddit

What does that mean in practice? I mean, it seems branching would be somewhat unavoidable, except maybe with heavy use of function pointers.

[-]

Frolo_NA@reddit

it means that when you receive an input, you give it a class. then as soon as the class is known, branching is removed.

in 'a mentoring course on smalltalk' by andres valloud, chapter 3 explains this way of thinking

[-]

JustOneAvailableName@reddit

Branchless is a different (lower) complexity class, so you can't leave translation fully to the compiler. Well, perhaps if you're very aware about how compilers work exactly and how your code translates, but at that point it's much easier to actually write the code branchless.

[-]

rtybanana@reddit

i can see your argument but also if code is unreadable then it’s unmaintainable, readability and maintainability are very closely related

[-]

Luolong@reddit

My issue with this is that not all code is meant to be equally maintainable. But people often treat all code as though likelihood of change is equally distributed across the whole code base.

I can predict with relative certainty that about two thirds of any legacy code base is maybe updated once a year. Sometimes some parts of the code never change since they are created.

[-]

rtybanana@reddit

i don’t personally buy this argument since you can’t, by definition, predict where issues are going to be hiding in your code - unless of course you’re infallible and only intentionally inserting bugs for fun!

it’s also very difficult to predict what the future holds for any part of a codebase. i wouldn’t feel comfortable writing obscure and difficult to maintain code on the basis that i think it will go untouched for years unless i had a very good reason to do so

[-]

Luolong@reddit

I would love go and dig deeper into this rabbit hole, but doing so would significantly depart from the original point of the OP, which is that sometimes, when performance is critical enough, one can use less obvious coding patterns to achieve same result wasting much less compute than the more common pattern would allow.

My comment was in response to someone mentioning readability as a Holy Grail to be applied equally in all parts of the code base as they were all equal.

I am in fact not advocating “predicting” which code paths are more or less volatile and which ones require less readable implementation based on a gut feeling alone.

Sometimes, shaving off few milliseconds of a tight loop is worth adopting more obscure implementation, because of compound interest this earns you in real world— smaller cloud bills, ability to crunch massive amounts of data with less resources, game engine optimisation, etc.

These use cases are rare and most developers will never need to worry about them for their whole career. But they do exist and dissing on those tricks just because you never needed them is shortsighted.

[-]

rtybanana@reddit

you were replying to me, and i hope it’s clear from the response you are replying to here (from me) that i’m not advocating readability above all else in all cases. however, i *do* think that readability should be preferred in the absence of any other specific requirement to the contrary, since in general readable = maintainable.

[-]

chucker23n@reddit

My issue with this is that not all code is meant to be equally maintainable.

And this, kids, is why security holes sometimes linger for decades!

[-]

Luolong@reddit

Perfect misread!

[-]

chucker23n@reddit

How is that a misread?

To treat some code as "eh, it doesn't have to be maintainable; it'll rarely change" means you risk having issues linger in there.

[-]

Luolong@reddit

What is considered “maintainable” is demonstrably different for each use case, individual developer, team or project.

Also, I was not advocating writing intentionally obscure code just because it is off the often visited or changed path. Your reading of this was purely your own.

It’s just that not all parts of code need to be designed with similar flexibility in mind.

Code that changes more often needs to be easier to replace and extend than code that just sits there since the beginning of the project and bootstraps your application or configures your JSON serialisation and is never touched again after initial setup for example.

Context matters!

[-]

zzz165@reddit

I feel like the less often code is updated, the more heavily you solid lean into readability. If the code is updated only once a year, then after 5 years it’s unlikely that the original author is still around and the style is likely different from the rest of the code base. The worst situation to be in is “don’t touch that code, no one knows what it does and it’s always just worked”.

[-]

cdb_11@reddit

I suppose you could have a code base where there really is no rhyme or reason to anything, but when people usually discuss code readability, it seems to me like it's all just subjective things, familiarity and personal preferences.

If you write C or Haskell, you don't need to concern yourself with people who don't know the language. I don't know Haskell, so this stuff is unreadable for me personally. If I wanted to maintain a Haskell code base, it's on me to learn how to read it first. Same goes for domain specific code.

You can hire people who are already familiar how to read branchless code, or you can teach them how to do it. The availability of developers familiar with something can be a factor, but "readability" in this sense is not a black and white issue. For example, last week someone here said that SIMD intrinsics were completely unreadable, while I could instantly recognize what the code in question did.

[-]

rtybanana@reddit

absolutely, as with all things engineering, ymmv.

if the performance gain is important to you then sure, hire people who can work with code of that nature, my point was just that it isn’t “free” i.e. you shouldn’t do this just because you *can* since doing it hurts readability and maintainability

[-]

Rustywolf@reddit

My unreadable code is very maintainable as long as you dont violate the core tenant of never updating it

[-]

nilsph@reddit

* tenet

[-]

imaami@reddit

*George

[-]

Rustywolf@reddit

Stupid autocorrect

[-]

CJKay93@reddit

Maintenance plan: don't.

[-]

JustOneAvailableName@reddit

You can still name variables and functions, you can still place comments. And you can just follow the one path, instead of mentally needed to track all possible states.

That said, it’s different looking code. You need some experience with reading that code type. Just as with functional code or recursive code, it needs to click in your brain before you can easily read it.

[-]

Guvante@reddit

Branches aren't the problem, mapping business logic to programming logic is.

And branchless forces a more complex mapping.

[-]

JustOneAvailableName@reddit

That depends on what you're trying to solve. If the mapping is the difficult part, the technical side isn't. Branchless is just not a great fit for those kinds of problems.

[-]

Guvante@reddit

Why do you presume that "that combination of branches" isn't a failure to map business logic?

[-]

JustOneAvailableName@reddit

If you're thinking about any kind of business logic, it's probably not the type of problem where branchless shines. Branchless is great if you need a provable one-to-one mapping, if you need a bound on computation/memory/latency, if you need provably correct code, if you need highly parallelizable code. So if the computation itself is the hard problem for some reason.

[-]

Guvante@reddit

That isn't a branchless thing that is micro-optimization thing...

You don't typically get weird bugs from micro-optimized code anyway as so much effort has been spent on it. Unless we are including "I just rewrote it and it broke" which IMO is a different problem than bug density.

[-]

Ok-Scheme-913@reddit

I mean, then most of maths would be trivial which is obviously not the case.

[-]

currentscurrents@reddit

There's still a branch in the logic here, it's just implicit.

Technically, you could write any program with implicit branches, as there are turing-complete systems (like cellular automata) without explicit branching. But it wouldn't be easy to read or maintainable.

[-]

mpersico@reddit

One comment is all that’s needed. // This is an optimization to avoid branching.

[-]

barsoap@reddit

Better: Have two functions, foo_naive and foo_fast. Fuzz test them against another for function, also include them in your benchmark harness. Have a third one, foo, that just calls foo_fast, that's the one you export.

Not only do you have simple code as a reference for the opaque code and an engineer-grade proof of equivalence, you have proof that the optimisation is actually necessary which is required because premature optimisation is the root of all evil. All in all those changes turn the code self-documenting so you can get rid of the comment.

[-]

Ok_Chemistry_6387@reddit

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%"

But now if foo_fast can't be inlined, I have a jmp..., if foo_fast ever needs upgrading, then someone forgets to update foo_naive and now you have no idea what is the right behaviour.

Just add a comment.

[-]

barsoap@reddit

That's a trivial inline. If a compiler can't do that, it's a bug. If the language supports it, just do a symbol redirection, export foo_fast as foo or whatever. The two implementations will never get out of sync because you're running the test suites regularly. Aren't you. The right behaviour is usually the naive version because "code that obviously contains no bugs" vs "code that contains no obvious bugs" -- that's why it's called naive. It should be obvious. If you want to make a change to the semantics, well, usually you don't, also, you'd start changing foo_naive, then once you're happy with the functionality you change the foo_fast to match. But only after re-benchmarking the new foo_naive might be fast enough at which point you just remove foo_fast and the test stuff etc, and rename foo_naive to foo.

[-]

Ok_Chemistry_6387@reddit

It isn’t trivial. All kinds of things prevent inlining. Even if it inlines fast_foo to foo… foo now may not inline.

The extra redirection may push you over a limit. Just add a comment

So now I have to do double the work to update my function? Just add a comment.

Now I have to test two functions? Just add a comment. What if my fast_foo makes assumptions that naive doesn’t? Ie stable vs unstable sorts? Alignments? exploits ub? Vector instructions? Just add a comment.

[-]

barsoap@reddit

You're not even trying to get the point, do you, and in fact you hate maintainable code. Functional equivalence of the two functions is the point, do you know what a fuzz test is. And yes I've used compilers which couldn't do that kind of inline reliably... 30 years ago. Now get off my lawn.

[-]

Ok_Chemistry_6387@reddit

Fuzzing wont prove equivalence… proves for some subset of inputs some output matched. Maybe the fast path is ok with losing stability in a sort? What about concurrency and memory ordering? Can think of a multitude of reasons it wont prove equivalency. Just leave a comment.

Now i have to pay the cost of maintaining a fuzz harness for what? I already wrote the fast code and it runs. Why drag this vestigial function around for literally zero gain? Just leave a comment,

What kind of function? You are literally saying it can inline without knowing the body. What if I unroll a loop? And blow the inline size cost What if i use recursion and it cant reason on stack depth? What if I exploit registers and increase pressure on register allocators? Etc etc. just leave a comment.

Now if you had said property testing could have maybe conceded some points.

It’s not unmaintainable. Just not familiar to you.

[-]

barsoap@reddit

Fuzzing wont prove equivalence

You're not telling me anything new, there's a reason why I called it an engineer-grade proof.

Maybe the fast path is ok with losing stability in a sort? What about concurrency and memory ordering? Can think of a multitude of reasons it wont prove equivalency.

I said "functional equivalence" and you come with memory ordering.

Now i have to pay the cost of maintaining a fuzz harness for what?

For being able to test in general, not just for this thing.

I already wrote the fast code and it runs.

You probably didn't yet, you wrote the slow code and are currently benchmarking.

Why drag this vestigial function around for literally zero gain?

So you're sure that you didn't change semantics when you optimised.

What kind of function? You are literally saying it can inline without knowing the body.

Irrelevant, you troll. You were complaining about the stub possibly breaking inlining. If the whole thing can't inline in the first place then that's a you problem, not the compiler's, and doing it your way won't make it inline either.

Now if you had said property testing could have maybe conceded some points.

You what.

[-]

sliversniper@reddit

Code is still unreadable.

It's like using short variable name.

// This is an optimization to avoid hand fatigue.

[-]

mpersico@reddit

Let me expand a little. When I first saw this code, I had to think of it about what it was doing and how it worked. When someone sees this comment, they should also think. I wrote the comment quickly while going through all of my morning emails and such. If I actually wrote this, I probably would have put a little more thought and text into it. My primary language throughout my career has been Perl. There are plenty of instances where a compact efficient piece of code of mine is proceeded by a multi line comment. Comments are cheap. Use them.

[-]

Frolo_NA@reddit

people act like these are a tradeoff. they aren't exactly.

make it work, make it right, make it fast. in that order

[-]

UncleMeat11@reddit

Often the "make it fast" requires substantial changes to global design.

"Oops our implementation is incredibly cache unfriendly so now we need to switch to a data oriented design" is not something you want to identify part way through.

[-]

Frolo_NA@reddit

there is a reason why make it fast is the last step on there. if you do the first 2, sometimes you won't even need the 3rd at all

[-]

UncleMeat11@reddit

And sometimes you know ahead of time that the thing you are building has particular performance needs and shifting entire architectures is a waste of time.

[-]

Frolo_NA@reddit

i think we're on the same page.

building the right thing is covered by make it work and make it right

[-]

Brian@reddit

I don't think you are - OP is saying some forms of "make it fast" are such that they can't be the last step here (other than perhaps, "write a prototype, then rewrite it from scratch in a very different way"). If you know you need the performance, it can be important to do it up-front if that performance requires a fundamentally different architecture of your application, even if that approach does work correctly, but isn't fast.

[-]

Frolo_NA@reddit

if you aren't building the right thing in the first place, how fast it goes is completely useless

[-]

SimpleNovelty@reddit

Yeah, the first step should always be know what you need. Requirements fundamentally shape where the effort needs to go.

[-]

barsoap@reddit

Then your architecture should probably support fast from the start, but that doesn't mean that the implementation needs to be fast.

You can write your whole business logic intended to run on a 10000 CPU cluster in a framework that is a bunch of well-defined functions stubbed out in postgres, written in a week. Having a clear and battle-tested relational algebra semantics for the bespoke stuff you'll need to do to make it really fast certainly won't hurt. And postgres is probably fast enough at least for a limited-scale rollout that the client can play around with.

[-]

UncleMeat11@reddit

By "architecture" I mean the large scale design of your program. Not the physical machine you are running on.

[-]

barsoap@reddit

And I meant the same thing. You cannot use just any software architecture if you need the oomph of 10000 CPUs. So you write your business logic in a way that does scale, even if the implementation of the backing infrastructure doesn't actually support that scale (yet), and you run the whole thing on maybe a single Threadripper.

"Make it work using relational algebra" is as much "make it work" as "make it work using pointer chasing". Neither is, in itself, a fast implementation, though the former is possible to make fast in general and at scale, while the latter will likely never be fast.

[-]

cdb_11@reddit

Rewriting everything three times would be ideal maybe, but it's too expensive.

[-]

Frolo_NA@reddit

its faster if you do it continuously as you work.

[-]

Ok_Chemistry_6387@reddit

Right and fast are often add competing ends of a spectrum. Performance has to be considered from day 1.

[-]

Frolo_NA@reddit

Good design means you can avoid locking into strict choices like that for as long as possible

[-]

fumei_tokumei@reddit

Always making good design choices takes time and therefore has a tradeoff.

[-]

Frolo_NA@reddit

its not equally weighted or equal priority like people are implying and good engineering habits and discipline make those choices easier to make.

[-]

Ok_Chemistry_6387@reddit

That is a pipe dream that I have never seen manifest in reality.

[-]

max123246@reddit

That's not how design works. As soon as you make it work, other code tends to rely on it and then you can't change your interface when you realize it is unable to be optimized.

You need to understand your performance goals first. Then design a system that works, is right, and is fast. All at the same time. It's an iteration loop that you do over and over before you ever ship a single line of code.

[-]

Bakoro@reddit

If your public interface is stable, then the internal implementations can change. That's how you can ship fast and increase performance as you go.

[-]

max123246@reddit

Not every public interface is equal. C++'s std::regex will forever be slower than launching Perl in C++. This is because it's ABI is forever stable

Stableness is at odds with performance. This is why interfaces are so powerful, it provides a clear contract. But sometimes, that contract is not good enough in practice and limits the number of valid implementations too much.

One such example is strcat. It's interface is a door gun leading to O(n^2) because it's contract requires implementations to iterate over the strings to find a \0 character. No amount of making it fast after that interface is stable will help you fix that

[-]

Frolo_NA@reddit

nothing stops you from changing your code to do exactly as you've said if that is the most optimal performance solution. provided that you've made it work and made it right first.

[-]

max123246@reddit

No one who says "make it work make it good make it fast" ever follows it up with "but actually make sure to do all 3 of those things before you promise a stable interface or have customers relying on stability". Prototypes become production code very quickly

[-]

Frolo_NA@reddit

that is why its important to have good engineering discipline. you don't know how long your code will live so you don't build more than you need or skimp on things like tests and ci/cd

[-]

cdb_11@reddit

its not right yet if change is hard to make

The context is public interfaces, they can't change. And this literally just happened to me like a month or two ago. Suddenly we've found a serious problem that the messages were too big, and the format -- the public interface -- had to change somehow. But we had to ensure partial backward compatibility, because other code already relies on it. We've managed to mitigate the problem to some extent, but it's nowhere near optimal and possibly might come back to bite us again in the future. Partially my fault, because I wanted to redesign it before, but other things came up, I forgot about it and things just kinda happened. Nothing more permanent than a temporary solution, as people like to say.

[-]

barsoap@reddit

C++'s std::regex will forever be slower than launching Perl in C++

Wait what? Perl's regexen are notoriously not regular, you can get quadratic and worse behaviour from them. From a quick scan of the docs std::regex are regular.

[-]

max123246@reddit

My bad, bad example. I had gotten that from word of mouth and now that I'm trying to source benchmarks I'm not finding anything. The rest of it is true though, just can't find anything on launching Perl is faster than regex in the c++ standard library

[-]

Frolo_NA@reddit

as well has having the test harness in place to make those changes with confidence

[-]

Frolo_NA@reddit

that is exactly how design works.

that's why you test first and code after.

[-]

Careful-Nothing-2432@reddit

Market makers translate latency optimizations to dollars. It’s a business priority. FWIW Jane Street, of all of these firms, probably has the highest emphasis on writing good code.

[-]

dangerbird2@reddit

I mean, if you're at Jane Street, you'd know with Ocaml you can have both 😉

[-]

NotFromSkane@reddit

Or based on people I know who've interned there, you can have neither.

[-]

npisnotp@reddit

It depends on the context of the software.

If you're writing a REST API, avoiding branch mispredictions is probably a bad fit and a code reviewer will ask you to remove it.

If you're writing a CPU scheduler, ignoring branch mispredictions can absolutely obliterate the OS performance and thus can be considered negligent.

[-]

slaymaker1907@reddit

Even for something low level, this sort of optimization is only relevant when you have code that can both easily be made branchless and has branches that are roughly equally likely. In practice, most branches have a strong bias one way or the other.

[-]

Relbang@reddit

We all do, just sometimes performance becomes more important over readability

[-]

rtybanana@reddit

you were replying to me, and i hope it’s clear from the response you are replying to here (from me) that i’m not advocating readability above all else in all cases. however, i *do* think that readability should be preferred in the absence of any other specific requirement to the contrary, since in general readable = maintainable.

[-]

double-you@reddit

TIL there are conditional increment opcodes (apple) or something that can be combined to do the same (x86).

I always figured most of "branchless" is just BS since the condition still exists (if (b > x) vs a += (b > x)).

Good article. Tests and disassembly included.

[-]

Dunge@reddit

I programmed all my life, admittedly not so low level, but I would have never guessed that always copying memory around would be faster than a simple "if" condition. I would have claimed the inverse a thousand times over thinking that "if" are probably the most optimized operations around.

[-]

ShinyHappyREM@reddit

CPUs are funny like that.

They also have write caches, so in the eyes of your program a read access can actually be slower than a write access. (Think of waiting for an email vs. putting an email into your email client's outbox and immediately continuing with your work.)

Another thing is that while almost all modifications to the program counter (aka instruction pointer) need a pipeline flush, returning from a function is an exception. That's because return addresses are copied into a special cache too, and that cache is always last-in-first-out. A function that can be optimized so that each return translates directly into a RET can be faster than a function where a return has to jump to the end to do some clean-up.

[-]

Careful-Nothing-2432@reddit

Well you’re kind of right, “if” *is* one of the most optimized operations around. We’ve put in so much work into optimizing it because it’s so inefficient but developers want to use it because it’s much easier than writing branchless code.

And it’s hard to make blanket statements about performance, it may be in some cases the overhead of copying is higher and using “if” actually is faster. Depends on the workload and context, a lot of this stuff boils down to just profiling and trying a few things.

[-]

cym13@reddit

It's worth noting that branchless approaches are also relevant for constant time code, which is often relevant in security and cryptography related contexts.

[-]

tom-morfin-riddle@reddit

And as long as we're noting that, note that the compiler might still find a branching version faster and optimize to something that isn't constant time, unless you tell it not to.

[-]

cym13@reddit

Sadly. We really need ways better than rewritting everything in assembly to support constant-time code.

[-]

Chisignal@reddit

Interesting question, is this something you could do for the general case? Like, <constant-time> arbitrary_code() </constant-time> and the compiler would just emit (however (in)efficient) code that always runs in constant time. It feels like this is either trivial or unsolvable lol

[-]

PrimozDelux@reddit

The problem is that the sort of people who are attracted to compiler dev really don't feel like implementing this sort of thing. It very much goes against the grain of what compilers do, so while it's conceptually trivial it will take a lot of effort since it's so diametrically opposed to what compilers do. Maybe with AI these sorts of tasks can be done, but then again I think the people who care about constant time evaluation are not the kind of people who want to let an LLM run roughshod over the compiler.

I'm still if it's trivial or unsolvable

[-]

cym13@reddit

That's not my area of expertise, but the way I envision it we mostly need guarantees.

Garantees from the CPU manufacturers first and foremost that some instructions are performed in constant-time (and ideally constant energy consumption as well, but at least constant-time), or maybe at the very least to guarantee their execution time won't change in the future. That's because the issues don't stop at compiler optimizations. I think ARM has started to do just that.

Then I'd ask the same of the compiler: some way to say "This section of the code will always be compiled the same way using a standard set of rules that are public and are guaranteed to use the special constant time instruction set". That way you could write functions that would still require care to be constant-time, but at least you would have the tools to do so.

[-]

well-litdoorstep112@reddit

It feels like this is either trivial or completely unsolvable lol

Like most problems in CS.

They could pad out certain branches with noops but then again - I don't think you can control out of order execution (eg some cores have more ALUs, some less), you can't control the physical memory layout (although LaurieWired did a video about that and you sort of can) so you'll have time differences because of caching etc.

[-]

floodyberry@reddit

constant time that a theoretical secure compiler would be concerned with would be not introducing branches in marked code that aren't already present. general constant time code (i.e. the attacker doesn't have physical access to the machine to do power analysis, and you won't be using any instructions that are variable time based on their input) works by

no secret indexes in to data. either an alternative approach is used (bitslicing aes) or the data is read in full every time and bit masking with 0 is used to ignore the indices you don't want. you don't want the compiler to realize "hey they're masking off every index except i" and short circuit to only read index i
no branches based on secret data. algorithms are (or well, should be) designed to do the maximal number of loops regardless of input data. conditionally swapping two values is done with bit masks. comparing two secret arrays is done by xoring every pair in the array (to get the difference) and oring these in to a single result so that zero means equal and non-zero means not equal. you don't want the compiler to realize you're doing something conditionally and introduce a conditional jump, or that you're just comparing two arrays and early exiting as soon as a difference is encountered.

as long as these rules are followed and the compiler doesn't "help you out", any timing variance will not be due to secret data and thus won't reveal anything to an attacker. out of order execution won't matter because you're executing the same code the same number of times regardless of input, and cache timing won't matter because you're reading/writing the same locations in the same order regardless of input.

[-]

tom-morfin-riddle@reddit

They could pad out certain branches with noops

And even this has led to fun attacks where power usage and/or heat become a side channel; nop sleds can be optimized at the microcode level to the processor idling. The solution winds being just be obsessive about every detail at every level, and fortunately there are reasonable and widely available primitives for most of the simple things nowadays.

[-]

fumei_tokumei@reddit

If you have truly arbitrary code, then you can't even know if it terminates. And it is hard to make non-terminating code constant time (At least in theory, if we just assume we all die within a billion or so years, then it is kind of easy to limit the execution time, but the usefulness may be questionable).

[-]

tom-morfin-riddle@reddit

Constant time yes, easily. You can write this yourself with a timer. Unfortunately time isn't the only resource we want to be consistent over, and "constant time" is usually "constant with respect to some variable" ;-). For instance string comparison needs to be consistent for strings of some particular length, but it's okay for it to take longer to compare longer strings. In practice we want a fairly small set of primitives which operate consistently with respect to every externally observable resource, and outside of very small kernels which utilize those primitives just assume it's completely unsolvable.

[-]

New-Anybody-6206@reddit

You could do something like that to hint to the compiler that you prefer branchless optimizations if possible, but someone has to add that support into the compiler.

[-]

jaesharp@reddit

branchless programming is hard mode but constant time is a nightmare. I'm so happy for open source tooling like tacit -- they make checking one's work so much easier - used to be near impossible unless you were an applied statistician.

[-]

scatmanFATMAN@reddit

There is a bug in the 2nd implementation. If there are no numbers less than 500, small_numbers[0] will be the last value in numbers which will be >= 500.

[-]

MiikaH@reddit

Not really. In both cases a real implementation would return "j" as the number of small numbers found. If it is zero you would ignore the first element same as with the branching approach.

[-]

OffbeatDrizzle@reddit

well well well, looks like we need a branch after all

[-]

josefx@reddit

One branch at the end of the loop, instead of an unpredictable branch every iteration.

[-]

gliese946@reddit

That's a branch you have to test once, when j is returned, as opposed to 1000 times within the loop.

[-]

White_C4@reddit

Arrays when initialized are filled with junk values. Even though if no numbers are less than 500, overwriting the 1st element in the branchless loop doesn't matter. It's still treated as junk value until "j" gets incremented which then the 1st element would be finally treated as a real value.

In that case, you'd need to carry the "j" variable outside of the loop block in order to ensure that you iterate through real values instead of junk ones.

[-]

Sopel97@reddit

that's not a bug, the size is determined by j

[-]

lelanthran@reddit

that's not a bug, the size is determined by j, small_numbers[0] is past the end

I don't understand. When there are no numbers less than 500, small_numbers[0] = numbers[SIZE-1].

If the last element in numbers is 999, then small_numbers[0] has 999 when the function returns, which I would consider a bug because, lacking a return j, the caller would have definitely initialised small_numbers with a sentinel value (INT_MAX or similar), which is overwritten for element 0 of small_numbers.

[-]

Sopel97@reddit

maybe it's a bug because small_numbers[0] causes a segfault? we'll never know...

[-]

SemaphoreBingo@reddit

small_numbers[0] causes a segfault?

How?

[-]

Sopel97@reddit

imagine a scenario in which the caller allocates just enough elements for the result

[-]

SemaphoreBingo@reddit

It's pre-allocated:

#define SIZE 100000
int numbers[SIZE];
int small_numbers[SIZE];

I can certainly imagine scenarios in which the API has changed and a bug exists but that's not this one.

[-]

Sopel97@reddit

oh, I didn't notice the full code, didn't scroll down previously. Even more perplexing to me why someone would think there's any bugs.

[-]

lelanthran@reddit

I'm not making up any scenarios - the entire program is presented in the post. It's essentially a single scenario.

[-]

Sopel97@reddit

the entire program is presented in the post

well, in this case there is obviously no bug, because no one observes the produced values so their correctness does not matter

[-]

ric2b@reddit

-O3 be like.

[-]

ShinyHappyREM@reddit

in this case there is obviously no bug, because no one observes the produced values so their correctness does not matter

found the C/C++ compiler writer

[-]

Sopel97@reddit

btw. this is the scenario that you made up

the caller would have definitely initialised small_numbers with a sentinel value (INT_MAX or similar), which is overwritten for element 0 of small_numbers

[-]

Brian@reddit

When there are no numbers less than 500, small_numbers[0] = numbers[SIZE-1].

Yes - but small_numbers[0] is past the end. There are 0 numbers in the list.

This isn't really different to when there's one number. You'd have, say, {10, 999} in the memory of the list, but the size would be 1, so only the "10" is considered part of the list.

the caller would have definitely initialised small_numbers with a sentinel value (

Part of the information returned is a list, and thus it's size. Relying on a sentinel value to indicate the end isn't generally a good approach (C strings being a common one we're stuck with now, despite length-prefixed being generally better).

[-]

SemaphoreBingo@reddit

the caller would have definitely initialised

Optimist.

[-]

thegreatpotatogod@reddit

If that's a concern for your application, it's trivial to add a single if statement (outside the loop, so not a significant performance penalty), to reset the value at small_numbers[j] to your default value.

[-]

jayd16@reddit

Its pseudo code but the idea is that you would return j as the number of valid results.

[-]

ShinyHappyREM@reddit

If there are no numbers less than 500

The code initializing the data set could've made sure that this isn't the case.

[-]

pfp-disciple@reddit

Not necessarily a bug. At the end, j will only be zero if no numbers are less than 500, and will be SIZE+1 if every number is less than 500. So j is a counter.

This kind of logic is a common source of "off by one" errors in C and C++, notably the buffer overflow and underflow errors. Naming the variable well really helps.

[-]

scatmanFATMAN@reddit

Yes that's true. However in this example, j is defined inside the forloop and cannot be used outside of it to check how many numbers were less than 500. I doublt defining j outside the loop would change the result of the article.

[-]

pfp-disciple@reddit

Fair point

[-]

ivny93@reddit

Every time that the last number of the input array is greater than 500 it will keep a wrong number in the output array. But there is j that can be used as the count of valid values (which in fact should be 0 in your example)

[-]

Fredifrum@reddit

Based on the the title, this could easily be a self help article about overthinking decisions and hypotheticals, lol.

[-]

jpfed@reddit

Someone should tell my anxiety that speculative execution carries heavy costs!

[-]

elsjpq@reddit

You wouldn't want to have a meltdown

[-]

backfire10z@reddit

Doesn’t this just swap one condition for another? How is checking `small_numbers[j] > 500` not considered a branch? Does every target have a more efficient conditional increment instruction?

[-]

cdb_11@reddit

Using actual if-statements and still getting it to optimize to a conditional move is another thing. But just a comparison like this is not a branch. On x86 you do the comparison with cmp, and then extract the bit you want out of the flags register.

[-]

backfire10z@reddit

Ah, so when they say branchless they specifically mean trying to remove `jmp’ rather than `cmp`, I see.

[-]

cdb_11@reddit

Yeah, instructions are pipelined and you start processing next instructions before you're even done with executing previous ones. To keep the pipeline full you need to predict jumps, otherwise the pipeline would always stop at the first jump, which are extremely common in practice. Normal jumps are fine, predictable conditional jumps are fine. By reducing unpredictable jumps you still get to benefit from this entire optimization.

[-]

Ok-Scheme-913@reddit

A famous stackoverflow question was about a list of numbers, where doing on operation over them was faster if you sorted them first.

I believe the operation was something like if (i < someNum).

By sorting it, they made the branch predictor learn the branch for the first section of numbers, then it learned the second, making it faster even with the additional sorting.

[-]

fumei_tokumei@reddit

First year CS students wondering why the Θ(n log n) algorithm is preferred over the Θ(n) one.

[-]

backfire10z@reddit

Yeah, that makes sense. I think I was hung up on the semantics of using the word “branchless” rather than the broader implications of whether the code is optimized or not.

Thanks for all the explanation!

[-]

levelstar01@reddit

Well yes because a jump is a branch and a comparison is not?

[-]

backfire10z@reddit

I was thinking of branching as more abstract: your behavior changes based on the result of a comparison, hence branching (generically). That is true when using `cmp` as well.

[-]

Keavon@reddit

The > operator is just a math operator like + or /. It simply produces a number, 0 or 1 (rather than the sum or quotient). Mentally substitute it with another common math operator and you'll agree there is no control flow branching, just an ALU computation placed in a register.

[-]

backfire10z@reddit

That’s actually a really nice way of thinking about it. You’re right, thanks.

[-]

ShinyHappyREM@reddit

In case your target CPUs don't support conditional moves you can do this:

// (Free Pascal pseudo code)
// Iterate through an array and copy all numbers less than 500 into a new array.

const
        Capacity   = 100000;
        Iterations =   1000;
        MaxVal     =   2000;

type
        Index = 0..(Capacity - 1);

var
        Numbers       : array[Index] of i32;  // i32 is a 32-bit signed integer
        Small_Numbers : array[Index] of i32;
        c, i, j, tmp  : i32;

// init
Randomize;
for i := 0 to high(Numbers) do  Numbers[i] := Random(MaxVal);  // 0..1999

// main loop
for c := 1 to Iterations do begin
        j := 0;
        for i := 0 to high(Numbers) do begin
                tmp := Numbers[i];
                if (tmp < 500) then begin
                        Small_Numbers[j] := tmp;
                        Inc(j);
                end;
        end;
end;


// Alternative version.

const
        Capacity   = 100000;
        Iterations =   1000;
        MaxVal     =   2000;

type
        Index = 0..(Capacity - 1);

var
        Src          : array[      Index] of i32;
        Dest         : array[0..1, Index] of i32;  // 0 = smaller, 1 = larger
        c, i, j, tmp : i32;

// init
Randomize;
for i := 0 to high(Src) do begin
        Src [   i] := Random(MaxVal);  // 0..1999
        Dest[0, i] := MaxVal;          // mark as unused via invalid value
        Dest[1, i] := 0;               // mark as unused via invalid value
end;

// main loop
for c := 1 to Iterations do begin
        for i := 0 to high(Src) do begin
                tmp        := Src[i];
                j          := tmp - 500;                         // if tmp is smaller then j will be negative, i.e. the highest bit will be set
                j          := u32(j) SHR ((SizeOf(j) * 8) - 1);  // get highest bit
                Dest[j, i] := tmp;                               // write value
        end;
end;

[-]

backfire10z@reddit

Yeah, this is what I was imagining when I made my original comment, though I think I was thinking about “compareless” rather than “branchless”.

[-]

gnuban@reddit

It's not a branch in the sense that control flow won't be steered based on the outcome, so the branch predictor won't be involved.

What you usually want is to turn a branching instruction like a jle into something like a cmov or as in the example, unconditionally adding the result of a comparison. That is, moving away from control flow into a built-in like "assign-if-compare-less-than" or "add comparison result".

Internally in the microcode, there is likely still some conditional logic happening, and something like cmov is not as quick as a straight addition. But it won't affect the global flow, since it can't branch the full execution flow, only in its own instruction computation.

[-]

ack_error@reddit

You're correct that there's nothing preventing the compiler from transforming the pointer bump into a branch, and that occasionally occurs. The code is trying to use a pattern that the compiler is more likely to but not guaranteed to emit as branchless. It generally works because compilers are designed with this in mind. The compiler authors know that and tune the heuristics toward branchless when such an unconventional pattern is used.

It would be better if there were more explicit ways to specify branchless, but the standard language lacks such facilities.

This also applies in the other direction. if (numbers > 500) is not guaranteed to be generated as a branch either, and if the compiler emits it as branchless there's no defined way to reverse it. But this usually isn't a problem as compilers are inclined to use branches due to the general effectiveness of branch prediction.

[-]

unique_ptr@reddit

How is checking numbers[i] > 500 not considered a branch?

Because the compiled assembly results in a conditional increment which does not generate a branch because the instruction is always executed.

[-]

ReDucTor@reddit

And now if your data lands on just one branch you end up with a loop carried data dependency that is most likely much slower. Making decisions like this you should always be aware of the costs not just the benefits.

[-]

rooktakesqueen@reddit

Both versions of the loop as written have a loop carried dependency on j, neither could be trivially parallelized.

[-]

ReDucTor@reddit

j only changes with an increment, its not a heavy data dependency like this does, its not a matter of the compiler being able to parallelise it but the CPU with the branchfree version every load depends on the previous load as that controls the offset, where the branch version if it predicts the branch its only the increment its dependent on and for something like this its virtually free.

[-]

ack_error@reddit

Think there's a mistake with the threshold in the quickbench, it's generating numbers in [128, 255) and then testing v > 128 which makes the branch highly predictable. Changing kThreshold to 192 causes the BM_BranchWrite case to slow down drastically:

https://quick-bench.com/q/10sO4gyp9fel6fXlss-TgY8z1ZU

The loads shouldn't be dependent, it's the stores that are due to the store address. But the only loop carried dependency is a single integer add with 1c latency. All loads and compares can all run in parallel.

[-]

ReDucTor@reddit

If you read my comment that was intentional to demonstrate when this approach is slower which is highly predictable data, unpredictable data the branch free version is better hands down.

Each loop store depends on all previous loads in the branch free approach, the increment to j cannot run in parallel it must wait for all previous loads, which means that the usage of it for the store also must wait. So you have all stores, compares and increments backing up waiting for data.

[-]

ack_error@reddit

Apologies, I did miss that part.

But I'm not sure why anything but the address add would be in the bottleneck critical path. It is true that the increment can't occur until previous loads complete, but the OOO logic shouldn't have a problem unrolling the loop and running the loads + compares ahead of the store and store increment which will max out at 1/cycle. The only thing I could think of would be getting stalled by load/store conflict checking, but I'm pretty sure modern x86 CPUs have good memory disambiguation prediction.

Looking at the llvm-mca output for both loops, I'm wondering if the issue has to do more with a specific addition optimization in Intel CPUs. For highly taken cases, the branchy version also has a loop carried dependency on the address addition. But I believe recent Intel CPUs are able to optimize small constant adds in the renamer, allowing it to execute these dependent adds faster than 1/clock, which is why llvm-mca is showing them as 0 latency. Forcing the compiler to increment from a register instead of from an immediate constant results in BM_BranchWrite slowing down:

https://quick-bench.com/q/Y1MOs3_VAhPOh02ZszCNmJim6zs

In that case, the load dependencies aren't involved, it's all about the addition chain and the CPU being able to execute chains of dependent ADD reg,imm8 faster than ADD reg,reg.

[-]

ReDucTor@reddit

Good catch with the the `add reg, imm8` difference, [here a slightly different](https://quick-bench.com/q/Y1MOs3_VAhPOh02ZszCNmJim6zs) approach with the `volatile` moved out of the `filterAbove` so it isn't occuring every time.

All good points, my assumption is the loop should be able to execute most of it out of order, with the only really being the store is dependent on all previous loads, compares of those loads and incrementing before it can execute, where the branching approach is only dependent on the increment.

To be honest, my initial comment is probably over selling the loop carried dependency as it doesn't completely hold back everything as if the dependency was on the load, the memory is still the biggest bottleneck.

[-]

Awesan@reddit

The comments here are strange. Of course any program with performance sensitive or nested loops needs to avoid unpredictable branches in the inner loop. This can be done in a variety of ways, some more "tricky" like in the article and some more architectural (look up data-oriented design).

Ignoring the problem is only fine if you don't care about writing good code. That's fine if that's what you want to do, but then why are you on a programming subreddit? I also think it's totally fine if you didn't know about branch prediction and are learning something new today, but again the attitude should be to learn and not to bury your head in the sand.

[-]

500AccountError@reddit

I remember way back in the day when these types of posts would be met with responses of many patterns for resolving the cpu pipelining issues, and discussions surrounding the merits of each.

[-]

Hardrocketjs@reddit

I don't get why you are getting downvoted. I know not everything that is performant vor the other way around, but personally i'm sick of the premature optimization is Bad phrase. You can write maintainable code, that is performant. Like you mentioned, i personally prefer data-oriented design with functional programming as it just makes it so easy to See what fails. I think we all need to come to the agreement that object-oriented design and heavy use oft state machines is also quite niche!

But to join in the chorus branchless programming kinda needs to advertise itself. Otherwise there are usually bigger bottlenecks!

[-]

-Knul-@reddit

AFAIK premature optimization is about 2 things:

1) you don't know beforehand where the bottlenecks are, so until you know, optimization isn't going to help too much

2) err on the side of having readable and maintainable code rather than performant.

Of course that doesn't mean that you should ignore performance completely, as apparently some do.

[-]

Icy-Concentrate2076@reddit

I think they are getting downvotes because most devs here and in general never work with something low level enough to find out it matters and think compilers are magical things that are perfect.

[-]

Awesan@reddit

Yes I think when branches are your bottleneck something went wrong in the design earlier on, the best programmers I know simply avoid this issue by not creating hard to predict branches in tight loops to begin with.

[-]

pillmillipedes@reddit

This is a great trick that every programmer should be aware of, but if you're willing to dig a bit deeper, another 5x improvement can be readily gained by vectorizing the loop:

typedef int ix16 [[gnu::vector_size(64), gnu::aligned(1)]];
typedef unsigned short u16;
u16 ix16_to_mask(ix16 x) { return __builtin_ia32_cvtd2mask512(x); }

void small_avx512(void) {
    assert(SIZE % 16 == 0 && "Why wouldn't it be?");
    ix16* A = (ix16*)numbers;
    for (int i = 0, j = 0; i < SIZE / 16; i++) {
        u16 mask = ix16_to_mask(A[i] < 500);
        __builtin_ia32_compressstoresi512_mask((void*)&small_numbers[j], A[i], mask);
        j += __builtin_popcount(mask);
    }
}

Time if: 0.394
Time bl: 0.068
Time simd: 0.012

godbolt link (might take a couple reruns to get a box with avx512 on it)

logically it's about the same, but it uses the vpcompressd instruction to pack several integers at once, corresponding to the bits set in mask.

Unfortunately this handy instruction is unavailable on avx2, though there are some threads on SO on how you can emulate it.

For a large enough dataset even more can be done re: data access patterns and prefetching but that's way out of scope for this thread to be honest.

[-]

QbProg@reddit

Do likely attributes obtain the same effect as branchless version?

[-]

cdb_11@reddit

likely/unlikely attributes will actually incentivize the compiler to prefer branches over branchless code. Clang has __builtin_unpredictable that should in theory should incentivize the compiler to prefer branchless code, but it seemed kinda busted to me last time when I tried it, not sure if it works. Check the compiler output when using it.

likely and unlikely can also rearrange the branches. According to the Intel optimization manual, that can change the default prediction, when the branch isn't in the history yet. By default backward branches are taken, and forward branches are not taken. So for example the conditional jump back at the end of a loop is assumed to actually loop back. And if you tag some branch as unlikely, that branch will go to the end of the function, and the CPU will assume it just falls through to the likely code by default.

[-]

AOEIU@reddit

If the branch is predictable then the branched version is as fast (or faster) regardless of a likely attribute. If it's not predictable then "likely" doesn't really do anything anyway.

A "likely" attribute just effects code generation. When branching a processor either falls through to the next instruction or does a jump to a different instruction. Processors are slightly better at falling through then jumping, so if a compiler can place "likely" code code in the fallthrough case. The performance difference is small (maybe ~10%) compared to a mis-predicted branch.

[-]

ShinyHappyREM@reddit

See the comments of this article for some info on that.

In short: likely/unlikely may affect the code generation, depending on the compiler, but at runtime it makes less of a difference.

[-]

andymaclean19@reddit

Interesting and a good trick. It is generating memory writes though when the other option doesn't so that might have some unintended side effects performance wise because you are doing more cache flushes.

I remember the EPIC instruction set had conditional flags so you could just skip an instruction conditionally without doing any branching and that was supposed to be faster. I wonder how many other architectures have this sort of thing built in and whether that changes the logic here.

[-]

Liquid_Magic@reddit

I get this and I appreciate the post. I read it and will have the sitting in the back of my mind now when programming.

However… I kinda hate that CPU’s work like that. I also hate hyper threading. I get it and it’s performant and ultimately things are because of all of it.

But there’s just something about it that bugs me. I just don’t like this idea that the CPU fucks around with code and tries to run it better. I’m sure it does but what the flying fuck. I honestly felt a little vindicated when all those security were revealed when it came to hyper threading.

Look I get that most of the time it works every time.

It just all feels like schnanigans.

[-]

ack_error@reddit

Not wrong, and this does make optimization more difficult, especially when different CPUs behave very differently. But in general the performance gains from caching, branch prediction, and out of order execution are just too massive to ignore. Disable all that and even the most advanced modern CPU core runs at the speed of an uncached 80386.

Embedded tech has more cases that swing the other way, where the cost or variance is too much and a simpler in-order core is used instead. Don't need a fat OOO core when you have a known platform with a compact in-order CPU core + dedicated on-chip SRAM that can give reliably low latency.

[-]

Brian@reddit

Optimising can be a bit weird like that. Back in the day it was a straightforward matter of stuff like counting cycles, but as CPUs have got more complicated, the dominant factor in speed has become "Do stuff that doesn't fuck up the hardware level optimisations". We kind of end up with the tail wagging the dog a bit: the stuff that is there to speed up user-written code now becomes the target for that code.

[-]

bart9h@reddit

The title is misleading.

The optimized version slows ME down.

[-]

wrecklord0@reddit

But the new code now has to perform a memory store at array[j] in a location that depends on whether the previous loop incremented j. Shouldn't this also prevent unfolding / pipeline optimizations? I see that it's faster in implementation but I wonder why, in terms of what the cpus ends up actually doing. How does it deal with having to store the data in a conditionally dependant position?

[-]

mattatghlabs@reddit

Before I even read the article, I assumed this was a strategic question. Often people will get mired in project planning because of all the what ifs.

[-]

Sea_Cap_2320@reddit

This is a great article for real time applications and very low level libraries. However to be honest, 99.99% of developers don't need to worry about this :)

[-]

Hiddencamper@reddit

True but I’d like my fresh out of the box windows laptop to use less than 16 gb ram and 40% cpu on meaningless bs and poorly optimized code.

[-]

TheRealPomax@reddit

It's knowing what's possible, even if you'll never need it, that makes you a better programmer =P

[-]

KentWallace@reddit

I read an interesting blog post awhile back where the author claimed to have written a nearly "if-less" Java codebase by using exceptions for all logical decisions instead of just errors. The code samples were elegant to read but on the other far extreme of performance than this post!

[-]

Leverkaas2516@reddit

Aside from performance, branchless paths can make it easier for the reader to reason about a segment of code, and easier to write tests for it. An okd colleague used to evangelize the technique for these reasons.

Of course, like anything, you should make your code easy to understand and modify first. Don't make it obtuse just to get a tiny bit of performance, unless you've measured it and you need that performance.

[-]

SemaphoreBingo@reddit

Why is small_numbers being updated at all?

Also on -O1 godbolt shows that gcc 16.1 x64 converts to a cmp.

[-]

Raknarg@reddit

this seems like the kind of thing where if there's a simple change that would result in more efficient code, the compiler will generate better code with optimizations enabled.

[-]

Booty_Bumping@reddit

Your compiler is already making all reasonable branchless optimizations for you, the vast majority of the time.

[-]

oliver_extracts@reddit

i think it depends heavily on the layer. inner loop of a database engine, branchless tricks pay off. wiring up business logic, the readability tradeoff usually isnt worth it. and yeah Fidoz nailed it, most of us arent at Jane Street, the perf gap doesnt matter at our scale.

[-]