You have a function. It takes a virtual class A as a parameter. This function just calls foo() on this parameter. For whatever reason this function is not inlined by the inliner. You are telling me that the branch predictor and speculative execution is going to, 100% of the time, jump to the correct implementation of foo?
If there are virtual functions then there can be more than one thing. The whole point of runtime polymorphism is that although you know that you have an A at compile time, the runtime type can be different and can dispatch to any number of different foos. That's a branch.
It is true that if you never leverage this feature then all calls can be devirtualized. But this is like saying that because all of your logical branches statically resolve to "true" or "false" after constant folding that actually imperitive programming is branchless.
dynamic dispatch is the mechanism to remove branching. the distinction is baked into classes. it just happens that in smalltalk classes are runtime constructs. nothing static is needed.
Maybe we have a different understanding of what a branch is.
I am not saying that there is a syntactic branch in program. I am saying that semantically control will jump to multiple different possible locations based on some runtime information such that we cannot do some sort of out of order execution across the control branch.
The thing that matters here is removing the semantic branch so that your program can get maximum benefit from pipelining and out of order execution.
Branchless is a different (lower) complexity class, so you can't leave translation fully to the compiler. Well, perhaps if you're very aware about how compilers work exactly and how your code translates, but at that point it's much easier to actually write the code branchless.
My issue with this is that not all code is meant to be equally maintainable. But people often treat all code as though likelihood of change is equally distributed across the whole code base.
I can predict with relative certainty that about two thirds of any legacy code base is maybe updated once a year. Sometimes some parts of the code never change since they are created.
i don’t personally buy this argument since you can’t, by definition, predict where issues are going to be hiding in your code - unless of course you’re infallible and only intentionally inserting bugs for fun!
it’s also very difficult to predict what the future holds for any part of a codebase. i wouldn’t feel comfortable writing obscure and difficult to maintain code on the basis that i think it will go untouched for years unless i had a very good reason to do so
I would love go and dig deeper into this rabbit hole, but doing so would significantly depart from the original point of the OP, which is that sometimes, when performance is critical enough, one can use less obvious coding patterns to achieve same result wasting much less compute than the more common pattern would allow.
My comment was in response to someone mentioning readability as a Holy Grail to be applied equally in all parts of the code base as they were all equal.
I am in fact not advocating “predicting” which code paths are more or less volatile and which ones require less readable implementation based on a gut feeling alone.
Sometimes, shaving off few milliseconds of a tight loop is worth adopting more obscure implementation, because of compound interest this earns you in real world— smaller cloud bills, ability to crunch massive amounts of data with less resources, game engine optimisation, etc.
These use cases are rare and most developers will never need to worry about them for their whole career. But they do exist and dissing on those tricks just because you never needed them is shortsighted.
you were replying to me, and i hope it’s clear from the response you are replying to here (from me) that i’m not advocating readability above all else in all cases. however, i *do* think that readability should be preferred in the absence of any other specific requirement to the contrary, since in general readable = maintainable.
What is considered “maintainable” is demonstrably different for each use case, individual developer, team or project.
Also, I was not advocating writing intentionally obscure code just because it is off the often visited or changed path. Your reading of this was purely your own.
It’s just that not all parts of code need to be designed with similar flexibility in mind.
Code that changes more often needs to be easier to replace and extend than code that just sits there since the beginning of the project and bootstraps your application or configures your JSON serialisation and is never touched again after initial setup for example.
I feel like the less often code is updated, the more heavily you solid lean into readability. If the code is updated only once a year, then after 5 years it’s unlikely that the original author is still around and the style is likely different from the rest of the code base. The worst situation to be in is “don’t touch that code, no one knows what it does and it’s always just worked”.
I suppose you could have a code base where there really is no rhyme or reason to anything, but when people usually discuss code readability, it seems to me like it's all just subjective things, familiarity and personal preferences.
If you write C or Haskell, you don't need to concern yourself with people who don't know the language. I don't know Haskell, so this stuff is unreadable for me personally. If I wanted to maintain a Haskell code base, it's on me to learn how to read it first. Same goes for domain specific code.
You can hire people who are already familiar how to read branchless code, or you can teach them how to do it. The availability of developers familiar with something can be a factor, but "readability" in this sense is not a black and white issue. For example, last week someone here said that SIMD intrinsics were completely unreadable, while I could instantly recognize what the code in question did.
if the performance gain is important to you then sure, hire people who can work with code of that nature, my point was just that it isn’t “free” i.e. you shouldn’t do this just because you *can* since doing it hurts readability and maintainability
You can still name variables and functions, you can still place comments. And you can just follow the one path, instead of mentally needed to track all possible states.
That said, it’s different looking code. You need some experience with reading that code type. Just as with functional code or recursive code, it needs to click in your brain before you can easily read it.
That depends on what you're trying to solve. If the mapping is the difficult part, the technical side isn't. Branchless is just not a great fit for those kinds of problems.
If you're thinking about any kind of business logic, it's probably not the type of problem where branchless shines. Branchless is great if you need a provable one-to-one mapping, if you need a bound on computation/memory/latency, if you need provably correct code, if you need highly parallelizable code. So if the computation itself is the hard problem for some reason.
That isn't a branchless thing that is micro-optimization thing...
You don't typically get weird bugs from micro-optimized code anyway as so much effort has been spent on it. Unless we are including "I just rewrote it and it broke" which IMO is a different problem than bug density.
There's still a branch in the logic here, it's just implicit.
Technically, you could write any program with implicit branches, as there are turing-complete systems (like cellular automata) without explicit branching. But it wouldn't be easy to read or maintainable.
Better: Have two functions, foo_naive and foo_fast. Fuzz test them against another for function, also include them in your benchmark harness. Have a third one, foo, that just calls foo_fast, that's the one you export.
Not only do you have simple code as a reference for the opaque code and an engineer-grade proof of equivalence, you have proof that the optimisation is actually necessary which is required because premature optimisation is the root of all evil. All in all those changes turn the code self-documenting so you can get rid of the comment.
"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%"
But now if foo_fast can't be inlined, I have a jmp..., if foo_fast ever needs upgrading, then someone forgets to update foo_naive and now you have no idea what is the right behaviour.
That's a trivial inline. If a compiler can't do that, it's a bug. If the language supports it, just do a symbol redirection, export foo_fast as foo or whatever. The two implementations will never get out of sync because you're running the test suites regularly. Aren't you. The right behaviour is usually the naive version because "code that obviously contains no bugs" vs "code that contains no obvious bugs" -- that's why it's called naive. It should be obvious. If you want to make a change to the semantics, well, usually you don't, also, you'd start changing foo_naive, then once you're happy with the functionality you change the foo_fast to match. But only after re-benchmarking the new foo_naive might be fast enough at which point you just remove foo_fast and the test stuff etc, and rename foo_naive to foo.
It isn’t trivial. All kinds of things prevent inlining. Even if it inlines fast_foo to foo… foo now may not inline.
The extra redirection may push you over a limit. Just add a comment
So now I have to do double the work to update my function? Just add a comment.
Now I have to test two functions? Just add a comment. What if my fast_foo makes assumptions that naive doesn’t? Ie stable vs unstable sorts? Alignments? exploits ub? Vector instructions? Just add a comment.
You're not even trying to get the point, do you, and in fact you hate maintainable code. Functional equivalence of the two functions is the point, do you know what a fuzz test is. And yes I've used compilers which couldn't do that kind of inline reliably... 30 years ago. Now get off my lawn.
Fuzzing wont prove equivalence… proves for some subset of inputs some output matched. Maybe the fast path is ok with losing stability in a sort? What about concurrency and memory ordering? Can think of a multitude of reasons it wont prove equivalency. Just leave a comment.
Now i have to pay the cost of maintaining a fuzz harness for what? I already wrote the fast code and it runs. Why drag this vestigial function around for literally zero gain? Just leave a comment,
What kind of function? You are literally saying it can inline without knowing the body. What if I unroll a loop? And blow the inline size cost What if i use recursion and it cant reason on stack depth? What if I exploit registers and increase pressure on register allocators? Etc etc. just leave a comment.
Now if you had said property testing could have maybe conceded some points.
It’s not unmaintainable. Just not familiar to you.
You're not telling me anything new, there's a reason why I called it an engineer-grade proof.
Maybe the fast path is ok with losing stability in a sort? What about concurrency and memory ordering? Can think of a multitude of reasons it wont prove equivalency.
I said "functional equivalence" and you come with memory ordering.
Now i have to pay the cost of maintaining a fuzz harness for what?
For being able to test in general, not just for this thing.
I already wrote the fast code and it runs.
You probably didn't yet, you wrote the slow code and are currently benchmarking.
Why drag this vestigial function around for literally zero gain?
So you're sure that you didn't change semantics when you optimised.
What kind of function? You are literally saying it can inline without knowing the body.
Irrelevant, you troll. You were complaining about the stub possibly breaking inlining. If the whole thing can't inline in the first place then that's a you problem, not the compiler's, and doing it your way won't make it inline either.
Now if you had said property testing could have maybe conceded some points.
Let me expand a little. When I first saw this code, I had to think of it about what it was doing and how it worked. When someone sees this comment, they should also think. I wrote the comment quickly while going through all of my morning emails and such. If I actually wrote this, I probably would have put a little more thought and text into it. My primary language throughout my career has been Perl. There are plenty of instances where a compact efficient piece of code of mine is proceeded by a multi line comment. Comments are cheap. Use them.
Often the "make it fast" requires substantial changes to global design.
"Oops our implementation is incredibly cache unfriendly so now we need to switch to a data oriented design" is not something you want to identify part way through.
And sometimes you know ahead of time that the thing you are building has particular performance needs and shifting entire architectures is a waste of time.
I don't think you are - OP is saying some forms of "make it fast" are such that they can't be the last step here (other than perhaps, "write a prototype, then rewrite it from scratch in a very different way"). If you know you need the performance, it can be important to do it up-front if that performance requires a fundamentally different architecture of your application, even if that approach does work correctly, but isn't fast.
Then your architecture should probably support fast from the start, but that doesn't mean that the implementation needs to be fast.
You can write your whole business logic intended to run on a 10000 CPU cluster in a framework that is a bunch of well-defined functions stubbed out in postgres, written in a week. Having a clear and battle-tested relational algebra semantics for the bespoke stuff you'll need to do to make it really fast certainly won't hurt. And postgres is probably fast enough at least for a limited-scale rollout that the client can play around with.
And I meant the same thing. You cannot use just any software architecture if you need the oomph of 10000 CPUs. So you write your business logic in a way that does scale, even if the implementation of the backing infrastructure doesn't actually support that scale (yet), and you run the whole thing on maybe a single Threadripper.
"Make it work using relational algebra" is as much "make it work" as "make it work using pointer chasing". Neither is, in itself, a fast implementation, though the former is possible to make fast in general and at scale, while the latter will likely never be fast.
That's not how design works. As soon as you make it work, other code tends to rely on it and then you can't change your interface when you realize it is unable to be optimized.
You need to understand your performance goals first. Then design a system that works, is right, and is fast. All at the same time. It's an iteration loop that you do over and over before you ever ship a single line of code.
Not every public interface is equal. C++'s std::regex will forever be slower than launching Perl in C++. This is because it's ABI is forever stable
Stableness is at odds with performance. This is why interfaces are so powerful, it provides a clear contract. But sometimes, that contract is not good enough in practice and limits the number of valid implementations too much.
One such example is strcat. It's interface is a door gun leading to O(n^2) because it's contract requires implementations to iterate over the strings to find a \0 character. No amount of making it fast after that interface is stable will help you fix that
nothing stops you from changing your code to do exactly as you've said if that is the most optimal performance solution. provided that you've made it work and made it right first.
No one who says "make it work make it good make it fast" ever follows it up with "but actually make sure to do all 3 of those things before you promise a stable interface or have customers relying on stability". Prototypes become production code very quickly
that is why its important to have good engineering discipline. you don't know how long your code will live so you don't build more than you need or skimp on things like tests and ci/cd
The context is public interfaces, they can't change. And this literally just happened to me like a month or two ago. Suddenly we've found a serious problem that the messages were too big, and the format -- the public interface -- had to change somehow. But we had to ensure partial backward compatibility, because other code already relies on it. We've managed to mitigate the problem to some extent, but it's nowhere near optimal and possibly might come back to bite us again in the future. Partially my fault, because I wanted to redesign it before, but other things came up, I forgot about it and things just kinda happened. Nothing more permanent than a temporary solution, as people like to say.
C++'s std::regex will forever be slower than launching Perl in C++
Wait what? Perl's regexen are notoriously not regular, you can get quadratic and worse behaviour from them. From a quick scan of the docs std::regex are regular.
My bad, bad example. I had gotten that from word of mouth and now that I'm trying to source benchmarks I'm not finding anything. The rest of it is true though, just can't find anything on launching Perl is faster than regex in the c++ standard library
Market makers translate latency optimizations to dollars. It’s a business priority. FWIW Jane Street, of all of these firms, probably has the highest emphasis on writing good code.
Even for something low level, this sort of optimization is only relevant when you have code that can both easily be made branchless and has branches that are roughly equally likely. In practice, most branches have a strong bias one way or the other.
you were replying to me, and i hope it’s clear from the response you are replying to here (from me) that i’m not advocating readability above all else in all cases. however, i *do* think that readability should be preferred in the absence of any other specific requirement to the contrary, since in general readable = maintainable.
I programmed all my life, admittedly not so low level, but I would have never guessed that always copying memory around would be faster than a simple "if" condition. I would have claimed the inverse a thousand times over thinking that "if" are probably the most optimized operations around.
They also have write caches, so in the eyes of your program a read access can actually be slower than a write access. (Think of waiting for an email vs. putting an email into your email client's outbox and immediately continuing with your work.)
Another thing is that while almost all modifications to the program counter (aka instruction pointer) need a pipeline flush, returning from a function is an exception. That's because return addresses are copied into a special cache too, and that cache is always last-in-first-out. A function that can be optimized so that each return translates directly into a RET can be faster than a function where a return has to jump to the end to do some clean-up.
Well you’re kind of right, “if” *is* one of the most optimized operations around. We’ve put in so much work into optimizing it because it’s so inefficient but developers want to use it because it’s much easier than writing branchless code.
And it’s hard to make blanket statements about performance, it may be in some cases the overhead of copying is higher and using “if” actually is faster. Depends on the workload and context, a lot of this stuff boils down to just profiling and trying a few things.
It's worth noting that branchless approaches are also relevant for constant time code, which is often relevant in security and cryptography related contexts.
And as long as we're noting that, note that the compiler might still find a branching version faster and optimize to something that isn't constant time, unless you tell it not to.
Interesting question, is this something you could do for the general case? Like, <constant-time> arbitrary_code() </constant-time> and the compiler would just emit (however (in)efficient) code that always runs in constant time. It feels like this is either trivial or unsolvable lol
The problem is that the sort of people who are attracted to compiler dev really don't feel like implementing this sort of thing. It very much goes against the grain of what compilers do, so while it's conceptually trivial it will take a lot of effort since it's so diametrically opposed to what compilers do. Maybe with AI these sorts of tasks can be done, but then again I think the people who care about constant time evaluation are not the kind of people who want to let an LLM run roughshod over the compiler.
That's not my area of expertise, but the way I envision it we mostly need guarantees.
Garantees from the CPU manufacturers first and foremost that some instructions are performed in constant-time (and ideally constant energy consumption as well, but at least constant-time), or maybe at the very least to guarantee their execution time won't change in the future. That's because the issues don't stop at compiler optimizations. I think ARM has started to do just that.
Then I'd ask the same of the compiler: some way to say "This section of the code will always be compiled the same way using a standard set of rules that are public and are guaranteed to use the special constant time instruction set". That way you could write functions that would still require care to be constant-time, but at least you would have the tools to do so.
It feels like this is either trivial or completely unsolvable lol
Like most problems in CS.
They could pad out certain branches with noops but then again - I don't think you can control out of order execution (eg some cores have more ALUs, some less), you can't control the physical memory layout (although LaurieWired did a video about that and you sort of can) so you'll have time differences because of caching etc.
constant time that a theoretical secure compiler would be concerned with would be not introducing branches in marked code that aren't already present. general constant time code (i.e. the attacker doesn't have physical access to the machine to do power analysis, and you won't be using any instructions that are variable time based on their input) works by
no secret indexes in to data. either an alternative approach is used (bitslicing aes) or the data is read in full every time and bit masking with 0 is used to ignore the indices you don't want. you don't want the compiler to realize "hey they're masking off every index except i" and short circuit to only read index i
no branches based on secret data. algorithms are (or well, should be) designed to do the maximal number of loops regardless of input data. conditionally swapping two values is done with bit masks. comparing two secret arrays is done by xoring every pair in the array (to get the difference) and oring these in to a single result so that zero means equal and non-zero means not equal. you don't want the compiler to realize you're doing something conditionally and introduce a conditional jump, or that you're just comparing two arrays and early exiting as soon as a difference is encountered.
as long as these rules are followed and the compiler doesn't "help you out", any timing variance will not be due to secret data and thus won't reveal anything to an attacker. out of order execution won't matter because you're executing the same code the same number of times regardless of input, and cache timing won't matter because you're reading/writing the same locations in the same order regardless of input.
And even this has led to fun attacks where power usage and/or heat become a side channel; nop sleds can be optimized at the microcode level to the processor idling. The solution winds being just be obsessive about every detail at every level, and fortunately there are reasonable and widely available primitives for most of the simple things nowadays.
If you have truly arbitrary code, then you can't even know if it terminates. And it is hard to make non-terminating code constant time (At least in theory, if we just assume we all die within a billion or so years, then it is kind of easy to limit the execution time, but the usefulness may be questionable).
Constant time yes, easily. You can write this yourself with a timer. Unfortunately time isn't the only resource we want to be consistent over, and "constant time" is usually "constant with respect to some variable" ;-). For instance string comparison needs to be consistent for strings of some particular length, but it's okay for it to take longer to compare longer strings. In practice we want a fairly small set of primitives which operate consistently with respect to every externally observable resource, and outside of very small kernels which utilize those primitives just assume it's completely unsolvable.
You could do something like that to hint to the compiler that you prefer branchless optimizations if possible, but someone has to add that support into the compiler.
branchless programming is hard mode but constant time is a nightmare. I'm so happy for open source tooling like tacit -- they make checking one's work so much easier - used to be near impossible unless you were an applied statistician.
There is a bug in the 2nd implementation. If there are no numbers less than 500, small_numbers[0] will be the last value in numbers which will be >= 500.
Not really. In both cases a real implementation would return "j" as the number of small numbers found. If it is zero you would ignore the first element same as with the branching approach.
Arrays when initialized are filled with junk values. Even though if no numbers are less than 500, overwriting the 1st element in the branchless loop doesn't matter. It's still treated as junk value until "j" gets incremented which then the 1st element would be finally treated as a real value.
In that case, you'd need to carry the "j" variable outside of the loop block in order to ensure that you iterate through real values instead of junk ones.
that's not a bug, the size is determined by j, small_numbers[0] is past the end
I don't understand. When there are no numbers less than 500, small_numbers[0] = numbers[SIZE-1].
If the last element in numbers is 999, then small_numbers[0] has 999 when the function returns, which I would consider a bug because, lacking a return j, the caller would have definitely initialised small_numbers with a sentinel value (INT_MAX or similar), which is overwritten for element 0 of small_numbers.
the caller would have definitely initialised small_numbers with a sentinel value (INT_MAX or similar), which is overwritten for element 0 of small_numbers
When there are no numbers less than 500, small_numbers[0] = numbers[SIZE-1].
Yes - but small_numbers[0] is past the end. There are 0 numbers in the list.
This isn't really different to when there's one number. You'd have, say, {10, 999} in the memory of the list, but the size would be 1, so only the "10" is considered part of the list.
the caller would have definitely initialised small_numbers with a sentinel value (
Part of the information returned is a list, and thus it's size. Relying on a sentinel value to indicate the end isn't generally a good approach (C strings being a common one we're stuck with now, despite length-prefixed being generally better).
If that's a concern for your application, it's trivial to add a single if statement (outside the loop, so not a significant performance penalty), to reset the value at small_numbers[j] to your default value.
Not necessarily a bug. At the end, j will only be zero if no numbers are less than 500, and will be SIZE+1 if every number is less than 500. So j is a counter.
This kind of logic is a common source of "off by one" errors in C and C++, notably the buffer overflow and underflow errors. Naming the variable well really helps.
Yes that's true. However in this example, j is defined inside the forloop and cannot be used outside of it to check how many numbers were less than 500. I doublt defining j outside the loop would change the result of the article.
Every time that the last number of the input array is greater than 500 it will keep a wrong number in the output array. But there is j that can be used as the count of valid values (which in fact should be 0 in your example)
Doesn’t this just swap one condition for another? How is checking `small_numbers[j] > 500` not considered a branch? Does every target have a more efficient conditional increment instruction?
Using actual if-statements and still getting it to optimize to a conditional move is another thing. But just a comparison like this is not a branch. On x86 you do the comparison with cmp, and then extract the bit you want out of the flags register.
Yeah, instructions are pipelined and you start processing next instructions before you're even done with executing previous ones. To keep the pipeline full you need to predict jumps, otherwise the pipeline would always stop at the first jump, which are extremely common in practice. Normal jumps are fine, predictable conditional jumps are fine. By reducing unpredictable jumps you still get to benefit from this entire optimization.
A famous stackoverflow question was about a list of numbers, where doing on operation over them was faster if you sorted them first.
I believe the operation was something like if (i < someNum).
By sorting it, they made the branch predictor learn the branch for the first section of numbers, then it learned the second, making it faster even with the additional sorting.
Yeah, that makes sense. I think I was hung up on the semantics of using the word “branchless” rather than the broader implications of whether the code is optimized or not.
I was thinking of branching as more abstract: your behavior changes based on the result of a comparison, hence branching (generically). That is true when using `cmp` as well.
The > operator is just a math operator like + or /. It simply produces a number, 0 or 1 (rather than the sum or quotient). Mentally substitute it with another common math operator and you'll agree there is no control flow branching, just an ALU computation placed in a register.
In case your target CPUs don't support conditional moves you can do this:
// (Free Pascal pseudo code)
// Iterate through an array and copy all numbers less than 500 into a new array.
const
Capacity = 100000;
Iterations = 1000;
MaxVal = 2000;
type
Index = 0..(Capacity - 1);
var
Numbers : array[Index] of i32; // i32 is a 32-bit signed integer
Small_Numbers : array[Index] of i32;
c, i, j, tmp : i32;
// init
Randomize;
for i := 0 to high(Numbers) do Numbers[i] := Random(MaxVal); // 0..1999
// main loop
for c := 1 to Iterations do begin
j := 0;
for i := 0 to high(Numbers) do begin
tmp := Numbers[i];
if (tmp < 500) then begin
Small_Numbers[j] := tmp;
Inc(j);
end;
end;
end;
// Alternative version.
const
Capacity = 100000;
Iterations = 1000;
MaxVal = 2000;
type
Index = 0..(Capacity - 1);
var
Src : array[ Index] of i32;
Dest : array[0..1, Index] of i32; // 0 = smaller, 1 = larger
c, i, j, tmp : i32;
// init
Randomize;
for i := 0 to high(Src) do begin
Src [ i] := Random(MaxVal); // 0..1999
Dest[0, i] := MaxVal; // mark as unused via invalid value
Dest[1, i] := 0; // mark as unused via invalid value
end;
// main loop
for c := 1 to Iterations do begin
for i := 0 to high(Src) do begin
tmp := Src[i];
j := tmp - 500; // if tmp is smaller then j will be negative, i.e. the highest bit will be set
j := u32(j) SHR ((SizeOf(j) * 8) - 1); // get highest bit
Dest[j, i] := tmp; // write value
end;
end;
It's not a branch in the sense that control flow won't be steered based on the outcome, so the branch predictor won't be involved.
What you usually want is to turn a branching instruction like a jle into something like a cmov or as in the example, unconditionally adding the result of a comparison. That is, moving away from control flow into a built-in like "assign-if-compare-less-than" or "add comparison result".
Internally in the microcode, there is likely still some conditional logic happening, and something like cmov is not as quick as a straight addition. But it won't affect the global flow, since it can't branch the full execution flow, only in its own instruction computation.
You're correct that there's nothing preventing the compiler from transforming the pointer bump into a branch, and that occasionally occurs. The code is trying to use a pattern that the compiler is more likely to but not guaranteed to emit as branchless. It generally works because compilers are designed with this in mind. The compiler authors know that and tune the heuristics toward branchless when such an unconventional pattern is used.
It would be better if there were more explicit ways to specify branchless, but the standard language lacks such facilities.
This also applies in the other direction. if (numbers > 500) is not guaranteed to be generated as a branch either, and if the compiler emits it as branchless there's no defined way to reverse it. But this usually isn't a problem as compilers are inclined to use branches due to the general effectiveness of branch prediction.
And now if your data lands on just one branch you end up with a loop carried data dependency that is most likely much slower. Making decisions like this you should always be aware of the costs not just the benefits.
j only changes with an increment, its not a heavy data dependency like this does, its not a matter of the compiler being able to parallelise it but the CPU with the branchfree version every load depends on the previous load as that controls the offset, where the branch version if it predicts the branch its only the increment its dependent on and for something like this its virtually free.
Think there's a mistake with the threshold in the quickbench, it's generating numbers in [128, 255) and then testing v > 128 which makes the branch highly predictable. Changing kThreshold to 192 causes the BM_BranchWrite case to slow down drastically:
The loads shouldn't be dependent, it's the stores that are due to the store address. But the only loop carried dependency is a single integer add with 1c latency. All loads and compares can all run in parallel.
If you read my comment that was intentional to demonstrate when this approach is slower which is highly predictable data, unpredictable data the branch free version is better hands down.
Each loop store depends on all previous loads in the branch free approach, the increment to j cannot run in parallel it must wait for all previous loads, which means that the usage of it for the store also must wait. So you have all stores, compares and increments backing up waiting for data.
But I'm not sure why anything but the address add would be in the bottleneck critical path. It is true that the increment can't occur until previous loads complete, but the OOO logic shouldn't have a problem unrolling the loop and running the loads + compares ahead of the store and store increment which will max out at 1/cycle. The only thing I could think of would be getting stalled by load/store conflict checking, but I'm pretty sure modern x86 CPUs have good memory disambiguation prediction.
Looking at the llvm-mca output for both loops, I'm wondering if the issue has to do more with a specific addition optimization in Intel CPUs. For highly taken cases, the branchy version also has a loop carried dependency on the address addition. But I believe recent Intel CPUs are able to optimize small constant adds in the renamer, allowing it to execute these dependent adds faster than 1/clock, which is why llvm-mca is showing them as 0 latency. Forcing the compiler to increment from a register instead of from an immediate constant results in BM_BranchWrite slowing down:
In that case, the load dependencies aren't involved, it's all about the addition chain and the CPU being able to execute chains of dependent ADD reg,imm8 faster than ADD reg,reg.
Good catch with the the `add reg, imm8` difference, [here a slightly different](https://quick-bench.com/q/Y1MOs3_VAhPOh02ZszCNmJim6zs) approach with the `volatile` moved out of the `filterAbove` so it isn't occuring every time.
All good points, my assumption is the loop should be able to execute most of it out of order, with the only really being the store is dependent on all previous loads, compares of those loads and incrementing before it can execute, where the branching approach is only dependent on the increment.
To be honest, my initial comment is probably over selling the loop carried dependency as it doesn't completely hold back everything as if the dependency was on the load, the memory is still the biggest bottleneck.
The comments here are strange. Of course any program with performance sensitive or nested loops needs to avoid unpredictable branches in the inner loop. This can be done in a variety of ways, some more "tricky" like in the article and some more architectural (look up data-oriented design).
Ignoring the problem is only fine if you don't care about writing good code. That's fine if that's what you want to do, but then why are you on a programming subreddit? I also think it's totally fine if you didn't know about branch prediction and are learning something new today, but again the attitude should be to learn and not to bury your head in the sand.
I remember way back in the day when these types of posts would be met with responses of many patterns for resolving the cpu pipelining issues, and discussions surrounding the merits of each.
I don't get why you are getting downvoted. I know not everything that is performant vor the other way around, but personally i'm sick of the premature optimization is Bad phrase. You can write maintainable code, that is performant.
Like you mentioned, i personally prefer data-oriented design with functional programming as it just makes it so easy to See what fails.
I think we all need to come to the agreement that object-oriented design and heavy use oft state machines is also quite niche!
But to join in the chorus branchless programming kinda needs to advertise itself. Otherwise there are usually bigger bottlenecks!
I think they are getting downvotes because most devs here and in general never work with something low level enough to find out it matters and think compilers are magical things that are perfect.
Yes I think when branches are your bottleneck something went wrong in the design earlier on, the best programmers I know simply avoid this issue by not creating hard to predict branches in tight loops to begin with.
This is a great trick that every programmer should be aware of, but if you're willing to dig a bit deeper, another 5x improvement can be readily gained by vectorizing the loop:
typedef int ix16 [[gnu::vector_size(64), gnu::aligned(1)]];
typedef unsigned short u16;
u16 ix16_to_mask(ix16 x) { return __builtin_ia32_cvtd2mask512(x); }
void small_avx512(void) {
assert(SIZE % 16 == 0 && "Why wouldn't it be?");
ix16* A = (ix16*)numbers;
for (int i = 0, j = 0; i < SIZE / 16; i++) {
u16 mask = ix16_to_mask(A[i] < 500);
__builtin_ia32_compressstoresi512_mask((void*)&small_numbers[j], A[i], mask);
j += __builtin_popcount(mask);
}
}
Time if: 0.394
Time bl: 0.068
Time simd: 0.012
godbolt link (might take a couple reruns to get a box with avx512 on it)
logically it's about the same, but it uses the vpcompressd instruction to pack several integers at once, corresponding to the bits set in mask.
Unfortunately this handy instruction is unavailable on avx2, though there are some threads on SO on how you can emulate it.
For a large enough dataset even more can be done re: data access patterns and prefetching but that's way out of scope for this thread to be honest.
likely/unlikely attributes will actually incentivize the compiler to prefer branches over branchless code. Clang has __builtin_unpredictable that should in theory should incentivize the compiler to prefer branchless code, but it seemed kinda busted to me last time when I tried it, not sure if it works. Check the compiler output when using it.
likely and unlikely can also rearrange the branches. According to the Intel optimization manual, that can change the default prediction, when the branch isn't in the history yet. By default backward branches are taken, and forward branches are not taken. So for example the conditional jump back at the end of a loop is assumed to actually loop back. And if you tag some branch as unlikely, that branch will go to the end of the function, and the CPU will assume it just falls through to the likely code by default.
If the branch is predictable then the branched version is as fast (or faster) regardless of a likely attribute. If it's not predictable then "likely" doesn't really do anything anyway.
A "likely" attribute just effects code generation. When branching a processor either falls through to the next instruction or does a jump to a different instruction. Processors are slightly better at falling through then jumping, so if a compiler can place "likely" code code in the fallthrough case. The performance difference is small (maybe ~10%) compared to a mis-predicted branch.
Interesting and a good trick. It is generating memory writes though when the other option doesn't so that might have some unintended side effects performance wise because you are doing more cache flushes.
I remember the EPIC instruction set had conditional flags so you could just skip an instruction conditionally without doing any branching and that was supposed to be faster. I wonder how many other architectures have this sort of thing built in and whether that changes the logic here.
I get this and I appreciate the post. I read it and will have the sitting in the back of my mind now when programming.
However… I kinda hate that CPU’s work like that. I also hate hyper threading. I get it and it’s performant and ultimately things are because of all of it.
But there’s just something about it that bugs me. I just don’t like this idea that the CPU fucks around with code and tries to run it better. I’m sure it does but what the flying fuck. I honestly felt a little vindicated when all those security were revealed when it came to hyper threading.
Look I get that most of the time it works every time.
Not wrong, and this does make optimization more difficult, especially when different CPUs behave very differently. But in general the performance gains from caching, branch prediction, and out of order execution are just too massive to ignore. Disable all that and even the most advanced modern CPU core runs at the speed of an uncached 80386.
Embedded tech has more cases that swing the other way, where the cost or variance is too much and a simpler in-order core is used instead. Don't need a fat OOO core when you have a known platform with a compact in-order CPU core + dedicated on-chip SRAM that can give reliably low latency.
Optimising can be a bit weird like that. Back in the day it was a straightforward matter of stuff like counting cycles, but as CPUs have got more complicated, the dominant factor in speed has become "Do stuff that doesn't fuck up the hardware level optimisations". We kind of end up with the tail wagging the dog a bit: the stuff that is there to speed up user-written code now becomes the target for that code.
But the new code now has to perform a memory store at array[j] in a location that depends on whether the previous loop incremented j. Shouldn't this also prevent unfolding / pipeline optimizations? I see that it's faster in implementation but I wonder why, in terms of what the cpus ends up actually doing. How does it deal with having to store the data in a conditionally dependant position?
This is a great article for real time applications and very low level libraries. However to be honest, 99.99% of developers don't need to worry about this :)
I read an interesting blog post awhile back where the author claimed to have written a nearly "if-less" Java codebase by using exceptions for all logical decisions instead of just errors. The code samples were elegant to read but on the other far extreme of performance than this post!
Aside from performance, branchless paths can make it easier for the reader to reason about a segment of code, and easier to write tests for it. An okd colleague used to evangelize the technique for these reasons.
Of course, like anything, you should make your code easy to understand and modify first. Don't make it obtuse just to get a tiny bit of performance, unless you've measured it and you need that performance.
this seems like the kind of thing where if there's a simple change that would result in more efficient code, the compiler will generate better code with optimizations enabled.
i think it depends heavily on the layer. inner loop of a database engine, branchless tricks pay off. wiring up business logic, the readability tradeoff usually isnt worth it. and yeah Fidoz nailed it, most of us arent at Jane Street, the perf gap doesnt matter at our scale.
"On this benchmark, -O3 did not yield additional speedup, only larger generated code" - so this little maneuver is pointless if you use compiler properly
Fidoz@reddit
Clever, but I've always preferred readable and maintainable over performant... But to be fair, I'm not at Jane Street
JustOneAvailableName@reddit
Branchless is very maintainable, because code only ever executes one path. Most bugs are caused by branches.
Frolo_NA@reddit
define the branches at programming time and use polymorphism at runtime so that the code only executes one path.
UncleMeat11@reddit
Runtime polymorphism is a branch.
Frolo_NA@reddit
the execution path is already known as soon as you instantiate an object.
UncleMeat11@reddit
The execution path of a branch is known as soon as you know its inputs.
The question is: does the branch predictor know enough to take the correct jump every single time? That's just not the case with runtime polymorphism.
Frolo_NA@reddit
yes we're in agreement.
UncleMeat11@reddit
You have a function. It takes a virtual class A as a parameter. This function just calls foo() on this parameter. For whatever reason this function is not inlined by the inliner. You are telling me that the branch predictor and speculative execution is going to, 100% of the time, jump to the correct implementation of foo?
Frolo_NA@reddit
i'm saying when you have an instance of A, you send it the message foo.
what foo does can only be 1 thing based on the class of A. there is no other option
UncleMeat11@reddit
If there are virtual functions then there can be more than one thing. The whole point of runtime polymorphism is that although you know that you have an A at compile time, the runtime type can be different and can dispatch to any number of different foos. That's a branch.
Frolo_NA@reddit
you move the branch to the "program writing phase", not the "execution phase".
program writing phase and compile time are not exactly the same thing.
UncleMeat11@reddit
Smalltalk has dynamic dispatch.
It is true that if you never leverage this feature then all calls can be devirtualized. But this is like saying that because all of your logical branches statically resolve to "true" or "false" after constant folding that actually imperitive programming is branchless.
Frolo_NA@reddit
you don't need to do this though.
dynamic dispatch is the mechanism to remove branching. the distinction is baked into classes. it just happens that in smalltalk classes are runtime constructs. nothing static is needed.
UncleMeat11@reddit
Maybe we have a different understanding of what a branch is.
I am not saying that there is a syntactic branch in program. I am saying that semantically control will jump to multiple different possible locations based on some runtime information such that we cannot do some sort of out of order execution across the control branch.
The thing that matters here is removing the semantic branch so that your program can get maximum benefit from pipelining and out of order execution.
Frolo_NA@reddit
i think that helps clarify things. i don't know how the smalltalk maps to cpu instructions here.
imaami@reddit
How do you implement runtime polymorphism without branches?
Frolo_NA@reddit
parse into polymorphic objects or have them from the start. like smalltalk
imaami@reddit
What does that mean in practice? I mean, it seems branching would be somewhat unavoidable, except maybe with heavy use of function pointers.
Frolo_NA@reddit
it means that when you receive an input, you give it a class. then as soon as the class is known, branching is removed.
in 'a mentoring course on smalltalk' by andres valloud, chapter 3 explains this way of thinking
JustOneAvailableName@reddit
Branchless is a different (lower) complexity class, so you can't leave translation fully to the compiler. Well, perhaps if you're very aware about how compilers work exactly and how your code translates, but at that point it's much easier to actually write the code branchless.
rtybanana@reddit
i can see your argument but also if code is unreadable then it’s unmaintainable, readability and maintainability are very closely related
Luolong@reddit
My issue with this is that not all code is meant to be equally maintainable. But people often treat all code as though likelihood of change is equally distributed across the whole code base.
I can predict with relative certainty that about two thirds of any legacy code base is maybe updated once a year. Sometimes some parts of the code never change since they are created.
rtybanana@reddit
i don’t personally buy this argument since you can’t, by definition, predict where issues are going to be hiding in your code - unless of course you’re infallible and only intentionally inserting bugs for fun!
it’s also very difficult to predict what the future holds for any part of a codebase. i wouldn’t feel comfortable writing obscure and difficult to maintain code on the basis that i think it will go untouched for years unless i had a very good reason to do so
Luolong@reddit
I would love go and dig deeper into this rabbit hole, but doing so would significantly depart from the original point of the OP, which is that sometimes, when performance is critical enough, one can use less obvious coding patterns to achieve same result wasting much less compute than the more common pattern would allow.
My comment was in response to someone mentioning readability as a Holy Grail to be applied equally in all parts of the code base as they were all equal.
I am in fact not advocating “predicting” which code paths are more or less volatile and which ones require less readable implementation based on a gut feeling alone.
Sometimes, shaving off few milliseconds of a tight loop is worth adopting more obscure implementation, because of compound interest this earns you in real world— smaller cloud bills, ability to crunch massive amounts of data with less resources, game engine optimisation, etc.
These use cases are rare and most developers will never need to worry about them for their whole career. But they do exist and dissing on those tricks just because you never needed them is shortsighted.
rtybanana@reddit
you were replying to me, and i hope it’s clear from the response you are replying to here (from me) that i’m not advocating readability above all else in all cases. however, i *do* think that readability should be preferred in the absence of any other specific requirement to the contrary, since in general readable = maintainable.
chucker23n@reddit
And this, kids, is why security holes sometimes linger for decades!
Luolong@reddit
Perfect misread!
chucker23n@reddit
How is that a misread?
To treat some code as "eh, it doesn't have to be maintainable; it'll rarely change" means you risk having issues linger in there.
Luolong@reddit
What is considered “maintainable” is demonstrably different for each use case, individual developer, team or project.
Also, I was not advocating writing intentionally obscure code just because it is off the often visited or changed path. Your reading of this was purely your own.
It’s just that not all parts of code need to be designed with similar flexibility in mind.
Code that changes more often needs to be easier to replace and extend than code that just sits there since the beginning of the project and bootstraps your application or configures your JSON serialisation and is never touched again after initial setup for example.
Context matters!
zzz165@reddit
I feel like the less often code is updated, the more heavily you solid lean into readability. If the code is updated only once a year, then after 5 years it’s unlikely that the original author is still around and the style is likely different from the rest of the code base. The worst situation to be in is “don’t touch that code, no one knows what it does and it’s always just worked”.
cdb_11@reddit
I suppose you could have a code base where there really is no rhyme or reason to anything, but when people usually discuss code readability, it seems to me like it's all just subjective things, familiarity and personal preferences.
If you write C or Haskell, you don't need to concern yourself with people who don't know the language. I don't know Haskell, so this stuff is unreadable for me personally. If I wanted to maintain a Haskell code base, it's on me to learn how to read it first. Same goes for domain specific code.
You can hire people who are already familiar how to read branchless code, or you can teach them how to do it. The availability of developers familiar with something can be a factor, but "readability" in this sense is not a black and white issue. For example, last week someone here said that SIMD intrinsics were completely unreadable, while I could instantly recognize what the code in question did.
rtybanana@reddit
absolutely, as with all things engineering, ymmv.
if the performance gain is important to you then sure, hire people who can work with code of that nature, my point was just that it isn’t “free” i.e. you shouldn’t do this just because you *can* since doing it hurts readability and maintainability
Rustywolf@reddit
My unreadable code is very maintainable as long as you dont violate the core tenant of never updating it
nilsph@reddit
* tenet
imaami@reddit
*George
Rustywolf@reddit
Stupid autocorrect
CJKay93@reddit
Maintenance plan: don't.
JustOneAvailableName@reddit
You can still name variables and functions, you can still place comments. And you can just follow the one path, instead of mentally needed to track all possible states.
That said, it’s different looking code. You need some experience with reading that code type. Just as with functional code or recursive code, it needs to click in your brain before you can easily read it.
Guvante@reddit
Branches aren't the problem, mapping business logic to programming logic is.
And branchless forces a more complex mapping.
JustOneAvailableName@reddit
That depends on what you're trying to solve. If the mapping is the difficult part, the technical side isn't. Branchless is just not a great fit for those kinds of problems.
Guvante@reddit
Why do you presume that "that combination of branches" isn't a failure to map business logic?
JustOneAvailableName@reddit
If you're thinking about any kind of business logic, it's probably not the type of problem where branchless shines. Branchless is great if you need a provable one-to-one mapping, if you need a bound on computation/memory/latency, if you need provably correct code, if you need highly parallelizable code. So if the computation itself is the hard problem for some reason.
Guvante@reddit
That isn't a branchless thing that is micro-optimization thing...
You don't typically get weird bugs from micro-optimized code anyway as so much effort has been spent on it. Unless we are including "I just rewrote it and it broke" which IMO is a different problem than bug density.
Ok-Scheme-913@reddit
I mean, then most of maths would be trivial which is obviously not the case.
currentscurrents@reddit
There's still a branch in the logic here, it's just implicit.
Technically, you could write any program with implicit branches, as there are turing-complete systems (like cellular automata) without explicit branching. But it wouldn't be easy to read or maintainable.
mpersico@reddit
One comment is all that’s needed. // This is an optimization to avoid branching.
barsoap@reddit
Better: Have two functions,
foo_naiveandfoo_fast. Fuzz test them against another for function, also include them in your benchmark harness. Have a third one,foo, that just callsfoo_fast, that's the one you export.Not only do you have simple code as a reference for the opaque code and an engineer-grade proof of equivalence, you have proof that the optimisation is actually necessary which is required because premature optimisation is the root of all evil. All in all those changes turn the code self-documenting so you can get rid of the comment.
Ok_Chemistry_6387@reddit
"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%"
But now if foo_fast can't be inlined, I have a jmp..., if foo_fast ever needs upgrading, then someone forgets to update foo_naive and now you have no idea what is the right behaviour.
Just add a comment.
barsoap@reddit
That's a trivial inline. If a compiler can't do that, it's a bug. If the language supports it, just do a symbol redirection,
export foo_fast as fooor whatever. The two implementations will never get out of sync because you're running the test suites regularly. Aren't you. The right behaviour is usually the naive version because "code that obviously contains no bugs" vs "code that contains no obvious bugs" -- that's why it's called naive. It should be obvious. If you want to make a change to the semantics, well, usually you don't, also, you'd start changingfoo_naive, then once you're happy with the functionality you change thefoo_fastto match. But only after re-benchmarking the newfoo_naivemight be fast enough at which point you just removefoo_fastand the test stuff etc, and renamefoo_naivetofoo.Ok_Chemistry_6387@reddit
It isn’t trivial. All kinds of things prevent inlining. Even if it inlines fast_foo to foo… foo now may not inline.
The extra redirection may push you over a limit. Just add a comment
So now I have to do double the work to update my function? Just add a comment.
Now I have to test two functions? Just add a comment. What if my fast_foo makes assumptions that naive doesn’t? Ie stable vs unstable sorts? Alignments? exploits ub? Vector instructions? Just add a comment.
barsoap@reddit
You're not even trying to get the point, do you, and in fact you hate maintainable code. Functional equivalence of the two functions is the point, do you know what a fuzz test is. And yes I've used compilers which couldn't do that kind of inline reliably... 30 years ago. Now get off my lawn.
Ok_Chemistry_6387@reddit
Fuzzing wont prove equivalence… proves for some subset of inputs some output matched. Maybe the fast path is ok with losing stability in a sort? What about concurrency and memory ordering? Can think of a multitude of reasons it wont prove equivalency. Just leave a comment.
Now i have to pay the cost of maintaining a fuzz harness for what? I already wrote the fast code and it runs. Why drag this vestigial function around for literally zero gain? Just leave a comment,
What kind of function? You are literally saying it can inline without knowing the body. What if I unroll a loop? And blow the inline size cost What if i use recursion and it cant reason on stack depth? What if I exploit registers and increase pressure on register allocators? Etc etc. just leave a comment.
Now if you had said property testing could have maybe conceded some points.
It’s not unmaintainable. Just not familiar to you.
barsoap@reddit
You're not telling me anything new, there's a reason why I called it an engineer-grade proof.
I said "functional equivalence" and you come with memory ordering.
For being able to test in general, not just for this thing.
You probably didn't yet, you wrote the slow code and are currently benchmarking.
So you're sure that you didn't change semantics when you optimised.
Irrelevant, you troll. You were complaining about the stub possibly breaking inlining. If the whole thing can't inline in the first place then that's a you problem, not the compiler's, and doing it your way won't make it inline either.
You what.
sliversniper@reddit
Code is still unreadable.
It's like using short variable name.
// This is an optimization to avoid hand fatigue.
mpersico@reddit
Let me expand a little. When I first saw this code, I had to think of it about what it was doing and how it worked. When someone sees this comment, they should also think. I wrote the comment quickly while going through all of my morning emails and such. If I actually wrote this, I probably would have put a little more thought and text into it. My primary language throughout my career has been Perl. There are plenty of instances where a compact efficient piece of code of mine is proceeded by a multi line comment. Comments are cheap. Use them.
Frolo_NA@reddit
people act like these are a tradeoff. they aren't exactly.
make it work, make it right, make it fast. in that order
UncleMeat11@reddit
Often the "make it fast" requires substantial changes to global design.
"Oops our implementation is incredibly cache unfriendly so now we need to switch to a data oriented design" is not something you want to identify part way through.
Frolo_NA@reddit
there is a reason why make it fast is the last step on there. if you do the first 2, sometimes you won't even need the 3rd at all
UncleMeat11@reddit
And sometimes you know ahead of time that the thing you are building has particular performance needs and shifting entire architectures is a waste of time.
Frolo_NA@reddit
i think we're on the same page.
building the right thing is covered by make it work and make it right
Brian@reddit
I don't think you are - OP is saying some forms of "make it fast" are such that they can't be the last step here (other than perhaps, "write a prototype, then rewrite it from scratch in a very different way"). If you know you need the performance, it can be important to do it up-front if that performance requires a fundamentally different architecture of your application, even if that approach does work correctly, but isn't fast.
Frolo_NA@reddit
if you aren't building the right thing in the first place, how fast it goes is completely useless
SimpleNovelty@reddit
Yeah, the first step should always be know what you need. Requirements fundamentally shape where the effort needs to go.
barsoap@reddit
Then your architecture should probably support fast from the start, but that doesn't mean that the implementation needs to be fast.
You can write your whole business logic intended to run on a 10000 CPU cluster in a framework that is a bunch of well-defined functions stubbed out in postgres, written in a week. Having a clear and battle-tested relational algebra semantics for the bespoke stuff you'll need to do to make it really fast certainly won't hurt. And postgres is probably fast enough at least for a limited-scale rollout that the client can play around with.
UncleMeat11@reddit
By "architecture" I mean the large scale design of your program. Not the physical machine you are running on.
barsoap@reddit
And I meant the same thing. You cannot use just any software architecture if you need the oomph of 10000 CPUs. So you write your business logic in a way that does scale, even if the implementation of the backing infrastructure doesn't actually support that scale (yet), and you run the whole thing on maybe a single Threadripper.
"Make it work using relational algebra" is as much "make it work" as "make it work using pointer chasing". Neither is, in itself, a fast implementation, though the former is possible to make fast in general and at scale, while the latter will likely never be fast.
cdb_11@reddit
Rewriting everything three times would be ideal maybe, but it's too expensive.
Frolo_NA@reddit
its faster if you do it continuously as you work.
Ok_Chemistry_6387@reddit
Right and fast are often add competing ends of a spectrum. Performance has to be considered from day 1.
Frolo_NA@reddit
Good design means you can avoid locking into strict choices like that for as long as possible
fumei_tokumei@reddit
Always making good design choices takes time and therefore has a tradeoff.
Frolo_NA@reddit
its not equally weighted or equal priority like people are implying and good engineering habits and discipline make those choices easier to make.
Ok_Chemistry_6387@reddit
That is a pipe dream that I have never seen manifest in reality.
max123246@reddit
That's not how design works. As soon as you make it work, other code tends to rely on it and then you can't change your interface when you realize it is unable to be optimized.
You need to understand your performance goals first. Then design a system that works, is right, and is fast. All at the same time. It's an iteration loop that you do over and over before you ever ship a single line of code.
Bakoro@reddit
If your public interface is stable, then the internal implementations can change. That's how you can ship fast and increase performance as you go.
max123246@reddit
Not every public interface is equal. C++'s std::regex will forever be slower than launching Perl in C++. This is because it's ABI is forever stable
Stableness is at odds with performance. This is why interfaces are so powerful, it provides a clear contract. But sometimes, that contract is not good enough in practice and limits the number of valid implementations too much.
One such example is strcat. It's interface is a door gun leading to O(n^2) because it's contract requires implementations to iterate over the strings to find a \0 character. No amount of making it fast after that interface is stable will help you fix that
Frolo_NA@reddit
nothing stops you from changing your code to do exactly as you've said if that is the most optimal performance solution. provided that you've made it work and made it right first.
max123246@reddit
No one who says "make it work make it good make it fast" ever follows it up with "but actually make sure to do all 3 of those things before you promise a stable interface or have customers relying on stability". Prototypes become production code very quickly
Frolo_NA@reddit
that is why its important to have good engineering discipline. you don't know how long your code will live so you don't build more than you need or skimp on things like tests and ci/cd
cdb_11@reddit
The context is public interfaces, they can't change. And this literally just happened to me like a month or two ago. Suddenly we've found a serious problem that the messages were too big, and the format -- the public interface -- had to change somehow. But we had to ensure partial backward compatibility, because other code already relies on it. We've managed to mitigate the problem to some extent, but it's nowhere near optimal and possibly might come back to bite us again in the future. Partially my fault, because I wanted to redesign it before, but other things came up, I forgot about it and things just kinda happened. Nothing more permanent than a temporary solution, as people like to say.
barsoap@reddit
Wait what? Perl's regexen are notoriously not regular, you can get quadratic and worse behaviour from them. From a quick scan of the docs std::regex are regular.
max123246@reddit
My bad, bad example. I had gotten that from word of mouth and now that I'm trying to source benchmarks I'm not finding anything. The rest of it is true though, just can't find anything on launching Perl is faster than regex in the c++ standard library
Frolo_NA@reddit
as well has having the test harness in place to make those changes with confidence
Frolo_NA@reddit
that is exactly how design works.
that's why you test first and code after.
Careful-Nothing-2432@reddit
Market makers translate latency optimizations to dollars. It’s a business priority. FWIW Jane Street, of all of these firms, probably has the highest emphasis on writing good code.
dangerbird2@reddit
I mean, if you're at Jane Street, you'd know with Ocaml you can have both 😉
NotFromSkane@reddit
Or based on people I know who've interned there, you can have neither.
npisnotp@reddit
It depends on the context of the software.
If you're writing a REST API, avoiding branch mispredictions is probably a bad fit and a code reviewer will ask you to remove it.
If you're writing a CPU scheduler, ignoring branch mispredictions can absolutely obliterate the OS performance and thus can be considered negligent.
slaymaker1907@reddit
Even for something low level, this sort of optimization is only relevant when you have code that can both easily be made branchless and has branches that are roughly equally likely. In practice, most branches have a strong bias one way or the other.
Relbang@reddit
We all do, just sometimes performance becomes more important over readability
rtybanana@reddit
you were replying to me, and i hope it’s clear from the response you are replying to here (from me) that i’m not advocating readability above all else in all cases. however, i *do* think that readability should be preferred in the absence of any other specific requirement to the contrary, since in general readable = maintainable.
double-you@reddit
TIL there are conditional increment opcodes (apple) or something that can be combined to do the same (x86).
I always figured most of "branchless" is just BS since the condition still exists (
if (b > x)vsa += (b > x)).Good article. Tests and disassembly included.
Dunge@reddit
I programmed all my life, admittedly not so low level, but I would have never guessed that always copying memory around would be faster than a simple "if" condition. I would have claimed the inverse a thousand times over thinking that "if" are probably the most optimized operations around.
ShinyHappyREM@reddit
CPUs are funny like that.
They also have write caches, so in the eyes of your program a read access can actually be slower than a write access. (Think of waiting for an email vs. putting an email into your email client's outbox and immediately continuing with your work.)
Another thing is that while almost all modifications to the program counter (aka instruction pointer) need a pipeline flush, returning from a function is an exception. That's because return addresses are copied into a special cache too, and that cache is always last-in-first-out. A function that can be optimized so that each
returntranslates directly into aRETcan be faster than a function where areturnhas to jump to the end to do some clean-up.Careful-Nothing-2432@reddit
Well you’re kind of right, “if” *is* one of the most optimized operations around. We’ve put in so much work into optimizing it because it’s so inefficient but developers want to use it because it’s much easier than writing branchless code.
And it’s hard to make blanket statements about performance, it may be in some cases the overhead of copying is higher and using “if” actually is faster. Depends on the workload and context, a lot of this stuff boils down to just profiling and trying a few things.
cym13@reddit
It's worth noting that branchless approaches are also relevant for constant time code, which is often relevant in security and cryptography related contexts.
tom-morfin-riddle@reddit
And as long as we're noting that, note that the compiler might still find a branching version faster and optimize to something that isn't constant time, unless you tell it not to.
cym13@reddit
Sadly. We really need ways better than rewritting everything in assembly to support constant-time code.
Chisignal@reddit
Interesting question, is this something you could do for the general case? Like,
<constant-time> arbitrary_code() </constant-time>and the compiler would just emit (however (in)efficient) code that always runs in constant time. It feels like this is either trivial or unsolvable lolPrimozDelux@reddit
The problem is that the sort of people who are attracted to compiler dev really don't feel like implementing this sort of thing. It very much goes against the grain of what compilers do, so while it's conceptually trivial it will take a lot of effort since it's so diametrically opposed to what compilers do. Maybe with AI these sorts of tasks can be done, but then again I think the people who care about constant time evaluation are not the kind of people who want to let an LLM run roughshod over the compiler.
I'm still if it's trivial or unsolvable
cym13@reddit
That's not my area of expertise, but the way I envision it we mostly need guarantees.
Garantees from the CPU manufacturers first and foremost that some instructions are performed in constant-time (and ideally constant energy consumption as well, but at least constant-time), or maybe at the very least to guarantee their execution time won't change in the future. That's because the issues don't stop at compiler optimizations. I think ARM has started to do just that.
Then I'd ask the same of the compiler: some way to say "This section of the code will always be compiled the same way using a standard set of rules that are public and are guaranteed to use the special constant time instruction set". That way you could write functions that would still require care to be constant-time, but at least you would have the tools to do so.
well-litdoorstep112@reddit
Like most problems in CS.
They could pad out certain branches with noops but then again - I don't think you can control out of order execution (eg some cores have more ALUs, some less), you can't control the physical memory layout (although LaurieWired did a video about that and you sort of can) so you'll have time differences because of caching etc.
floodyberry@reddit
constant time that a theoretical secure compiler would be concerned with would be not introducing branches in marked code that aren't already present. general constant time code (i.e. the attacker doesn't have physical access to the machine to do power analysis, and you won't be using any instructions that are variable time based on their input) works by
no secret indexes in to data. either an alternative approach is used (bitslicing aes) or the data is read in full every time and bit masking with 0 is used to ignore the indices you don't want. you don't want the compiler to realize "hey they're masking off every index except i" and short circuit to only read index i
no branches based on secret data. algorithms are (or well, should be) designed to do the maximal number of loops regardless of input data. conditionally swapping two values is done with bit masks. comparing two secret arrays is done by xoring every pair in the array (to get the difference) and oring these in to a single result so that zero means equal and non-zero means not equal. you don't want the compiler to realize you're doing something conditionally and introduce a conditional jump, or that you're just comparing two arrays and early exiting as soon as a difference is encountered.
as long as these rules are followed and the compiler doesn't "help you out", any timing variance will not be due to secret data and thus won't reveal anything to an attacker. out of order execution won't matter because you're executing the same code the same number of times regardless of input, and cache timing won't matter because you're reading/writing the same locations in the same order regardless of input.
tom-morfin-riddle@reddit
And even this has led to fun attacks where power usage and/or heat become a side channel; nop sleds can be optimized at the microcode level to the processor idling. The solution winds being just be obsessive about every detail at every level, and fortunately there are reasonable and widely available primitives for most of the simple things nowadays.
fumei_tokumei@reddit
If you have truly arbitrary code, then you can't even know if it terminates. And it is hard to make non-terminating code constant time (At least in theory, if we just assume we all die within a billion or so years, then it is kind of easy to limit the execution time, but the usefulness may be questionable).
tom-morfin-riddle@reddit
Constant time yes, easily. You can write this yourself with a timer. Unfortunately time isn't the only resource we want to be consistent over, and "constant time" is usually "constant with respect to some variable" ;-). For instance string comparison needs to be consistent for strings of some particular length, but it's okay for it to take longer to compare longer strings. In practice we want a fairly small set of primitives which operate consistently with respect to every externally observable resource, and outside of very small kernels which utilize those primitives just assume it's completely unsolvable.
New-Anybody-6206@reddit
You could do something like that to hint to the compiler that you prefer branchless optimizations if possible, but someone has to add that support into the compiler.
jaesharp@reddit
branchless programming is hard mode but constant time is a nightmare. I'm so happy for open source tooling like tacit -- they make checking one's work so much easier - used to be near impossible unless you were an applied statistician.
scatmanFATMAN@reddit
There is a bug in the 2nd implementation. If there are no numbers less than 500, small_numbers[0] will be the last value in numbers which will be >= 500.
MiikaH@reddit
Not really. In both cases a real implementation would return "j" as the number of small numbers found. If it is zero you would ignore the first element same as with the branching approach.
OffbeatDrizzle@reddit
well well well, looks like we need a branch after all
josefx@reddit
One branch at the end of the loop, instead of an unpredictable branch every iteration.
gliese946@reddit
That's a branch you have to test once, when j is returned, as opposed to 1000 times within the loop.
White_C4@reddit
Arrays when initialized are filled with junk values. Even though if no numbers are less than 500, overwriting the 1st element in the branchless loop doesn't matter. It's still treated as junk value until "j" gets incremented which then the 1st element would be finally treated as a real value.
In that case, you'd need to carry the "j" variable outside of the loop block in order to ensure that you iterate through real values instead of junk ones.
Sopel97@reddit
that's not a bug, the size is determined by
jlelanthran@reddit
I don't understand. When there are no numbers less than 500,
small_numbers[0] = numbers[SIZE-1].If the last element in numbers is
999, thensmall_numbers[0]has999when the function returns, which I would consider a bug because, lacking areturn j, the caller would have definitely initialisedsmall_numberswith a sentinel value (INT_MAXor similar), which is overwritten for element0ofsmall_numbers.Sopel97@reddit
maybe it's a bug because small_numbers[0] causes a segfault? we'll never know...
SemaphoreBingo@reddit
How?
Sopel97@reddit
imagine a scenario in which the caller allocates just enough elements for the result
SemaphoreBingo@reddit
It's pre-allocated:
I can certainly imagine scenarios in which the API has changed and a bug exists but that's not this one.
Sopel97@reddit
oh, I didn't notice the full code, didn't scroll down previously. Even more perplexing to me why someone would think there's any bugs.
lelanthran@reddit
I'm not making up any scenarios - the entire program is presented in the post. It's essentially a single scenario.
Sopel97@reddit
well, in this case there is obviously no bug, because no one observes the produced values so their correctness does not matter
ric2b@reddit
-O3 be like.
ShinyHappyREM@reddit
found the C/C++ compiler writer
Sopel97@reddit
btw. this is the scenario that you made up
Brian@reddit
When there are no numbers less than 500, small_numbers[0] = numbers[SIZE-1].
Yes - but small_numbers[0] is past the end. There are 0 numbers in the list.
This isn't really different to when there's one number. You'd have, say,
{10, 999}in the memory of the list, but the size would be1, so only the "10" is considered part of the list.Part of the information returned is a list, and thus it's size. Relying on a sentinel value to indicate the end isn't generally a good approach (C strings being a common one we're stuck with now, despite length-prefixed being generally better).
SemaphoreBingo@reddit
Optimist.
thegreatpotatogod@reddit
If that's a concern for your application, it's trivial to add a single if statement (outside the loop, so not a significant performance penalty), to reset the value at small_numbers[j] to your default value.
jayd16@reddit
Its pseudo code but the idea is that you would return j as the number of valid results.
ShinyHappyREM@reddit
The code initializing the data set could've made sure that this isn't the case.
pfp-disciple@reddit
Not necessarily a bug. At the end, j will only be zero if no numbers are less than 500, and will be SIZE+1 if every number is less than 500. So j is a counter.
This kind of logic is a common source of "off by one" errors in C and C++, notably the buffer overflow and underflow errors. Naming the variable well really helps.
scatmanFATMAN@reddit
Yes that's true. However in this example, j is defined inside the forloop and cannot be used outside of it to check how many numbers were less than 500. I doublt defining j outside the loop would change the result of the article.
pfp-disciple@reddit
Fair point
ivny93@reddit
Every time that the last number of the input array is greater than 500 it will keep a wrong number in the output array. But there is j that can be used as the count of valid values (which in fact should be 0 in your example)
Fredifrum@reddit
Based on the the title, this could easily be a self help article about overthinking decisions and hypotheticals, lol.
jpfed@reddit
Someone should tell my anxiety that speculative execution carries heavy costs!
elsjpq@reddit
You wouldn't want to have a meltdown
backfire10z@reddit
Doesn’t this just swap one condition for another? How is checking `small_numbers[j] > 500` not considered a branch? Does every target have a more efficient conditional increment instruction?
cdb_11@reddit
Using actual if-statements and still getting it to optimize to a conditional move is another thing. But just a comparison like this is not a branch. On x86 you do the comparison with
cmp, and then extract the bit you want out of the flags register.backfire10z@reddit
Ah, so when they say branchless they specifically mean trying to remove `jmp’ rather than `cmp`, I see.
cdb_11@reddit
Yeah, instructions are pipelined and you start processing next instructions before you're even done with executing previous ones. To keep the pipeline full you need to predict jumps, otherwise the pipeline would always stop at the first jump, which are extremely common in practice. Normal jumps are fine, predictable conditional jumps are fine. By reducing unpredictable jumps you still get to benefit from this entire optimization.
Ok-Scheme-913@reddit
A famous stackoverflow question was about a list of numbers, where doing on operation over them was faster if you sorted them first.
I believe the operation was something like if (i < someNum).
By sorting it, they made the branch predictor learn the branch for the first section of numbers, then it learned the second, making it faster even with the additional sorting.
fumei_tokumei@reddit
First year CS students wondering why the Θ(n log n) algorithm is preferred over the Θ(n) one.
backfire10z@reddit
Yeah, that makes sense. I think I was hung up on the semantics of using the word “branchless” rather than the broader implications of whether the code is optimized or not.
Thanks for all the explanation!
levelstar01@reddit
Well yes because a jump is a branch and a comparison is not?
backfire10z@reddit
I was thinking of branching as more abstract: your behavior changes based on the result of a comparison, hence branching (generically). That is true when using `cmp` as well.
Keavon@reddit
The
>operator is just a math operator like+or/. It simply produces a number, 0 or 1 (rather than the sum or quotient). Mentally substitute it with another common math operator and you'll agree there is no control flow branching, just an ALU computation placed in a register.backfire10z@reddit
That’s actually a really nice way of thinking about it. You’re right, thanks.
ShinyHappyREM@reddit
In case your target CPUs don't support conditional moves you can do this:
backfire10z@reddit
Yeah, this is what I was imagining when I made my original comment, though I think I was thinking about “compareless” rather than “branchless”.
gnuban@reddit
It's not a branch in the sense that control flow won't be steered based on the outcome, so the branch predictor won't be involved.
What you usually want is to turn a branching instruction like a
jleinto something like acmovor as in the example, unconditionally adding the result of a comparison. That is, moving away from control flow into a built-in like "assign-if-compare-less-than" or "add comparison result".Internally in the microcode, there is likely still some conditional logic happening, and something like
cmovis not as quick as a straight addition. But it won't affect the global flow, since it can't branch the full execution flow, only in its own instruction computation.ack_error@reddit
You're correct that there's nothing preventing the compiler from transforming the pointer bump into a branch, and that occasionally occurs. The code is trying to use a pattern that the compiler is more likely to but not guaranteed to emit as branchless. It generally works because compilers are designed with this in mind. The compiler authors know that and tune the heuristics toward branchless when such an unconventional pattern is used.
It would be better if there were more explicit ways to specify branchless, but the standard language lacks such facilities.
This also applies in the other direction.
if (numbers > 500)is not guaranteed to be generated as a branch either, and if the compiler emits it as branchless there's no defined way to reverse it. But this usually isn't a problem as compilers are inclined to use branches due to the general effectiveness of branch prediction.unique_ptr@reddit
Because the compiled assembly results in a conditional increment which does not generate a branch because the instruction is always executed.
ReDucTor@reddit
And now if your data lands on just one branch you end up with a loop carried data dependency that is most likely much slower. Making decisions like this you should always be aware of the costs not just the benefits.
rooktakesqueen@reddit
Both versions of the loop as written have a loop carried dependency on
j, neither could be trivially parallelized.ReDucTor@reddit
j only changes with an increment, its not a heavy data dependency like this does, its not a matter of the compiler being able to parallelise it but the CPU with the branchfree version every load depends on the previous load as that controls the offset, where the branch version if it predicts the branch its only the increment its dependent on and for something like this its virtually free.
ack_error@reddit
Think there's a mistake with the threshold in the quickbench, it's generating numbers in [128, 255) and then testing v > 128 which makes the branch highly predictable. Changing kThreshold to 192 causes the BM_BranchWrite case to slow down drastically:
https://quick-bench.com/q/10sO4gyp9fel6fXlss-TgY8z1ZU
The loads shouldn't be dependent, it's the stores that are due to the store address. But the only loop carried dependency is a single integer add with 1c latency. All loads and compares can all run in parallel.
ReDucTor@reddit
If you read my comment that was intentional to demonstrate when this approach is slower which is highly predictable data, unpredictable data the branch free version is better hands down.
Each loop store depends on all previous loads in the branch free approach, the increment to j cannot run in parallel it must wait for all previous loads, which means that the usage of it for the store also must wait. So you have all stores, compares and increments backing up waiting for data.
ack_error@reddit
Apologies, I did miss that part.
But I'm not sure why anything but the address add would be in the bottleneck critical path. It is true that the increment can't occur until previous loads complete, but the OOO logic shouldn't have a problem unrolling the loop and running the loads + compares ahead of the store and store increment which will max out at 1/cycle. The only thing I could think of would be getting stalled by load/store conflict checking, but I'm pretty sure modern x86 CPUs have good memory disambiguation prediction.
Looking at the llvm-mca output for both loops, I'm wondering if the issue has to do more with a specific addition optimization in Intel CPUs. For highly taken cases, the branchy version also has a loop carried dependency on the address addition. But I believe recent Intel CPUs are able to optimize small constant adds in the renamer, allowing it to execute these dependent adds faster than 1/clock, which is why llvm-mca is showing them as 0 latency. Forcing the compiler to increment from a register instead of from an immediate constant results in BM_BranchWrite slowing down:
https://quick-bench.com/q/Y1MOs3_VAhPOh02ZszCNmJim6zs
In that case, the load dependencies aren't involved, it's all about the addition chain and the CPU being able to execute chains of dependent ADD reg,imm8 faster than ADD reg,reg.
ReDucTor@reddit
Good catch with the the `add reg, imm8` difference, [here a slightly different](https://quick-bench.com/q/Y1MOs3_VAhPOh02ZszCNmJim6zs) approach with the `volatile` moved out of the `filterAbove` so it isn't occuring every time.
All good points, my assumption is the loop should be able to execute most of it out of order, with the only really being the store is dependent on all previous loads, compares of those loads and incrementing before it can execute, where the branching approach is only dependent on the increment.
To be honest, my initial comment is probably over selling the loop carried dependency as it doesn't completely hold back everything as if the dependency was on the load, the memory is still the biggest bottleneck.
Awesan@reddit
The comments here are strange. Of course any program with performance sensitive or nested loops needs to avoid unpredictable branches in the inner loop. This can be done in a variety of ways, some more "tricky" like in the article and some more architectural (look up data-oriented design).
Ignoring the problem is only fine if you don't care about writing good code. That's fine if that's what you want to do, but then why are you on a programming subreddit? I also think it's totally fine if you didn't know about branch prediction and are learning something new today, but again the attitude should be to learn and not to bury your head in the sand.
500AccountError@reddit
I remember way back in the day when these types of posts would be met with responses of many patterns for resolving the cpu pipelining issues, and discussions surrounding the merits of each.
Hardrocketjs@reddit
I don't get why you are getting downvoted. I know not everything that is performant vor the other way around, but personally i'm sick of the premature optimization is Bad phrase. You can write maintainable code, that is performant. Like you mentioned, i personally prefer data-oriented design with functional programming as it just makes it so easy to See what fails. I think we all need to come to the agreement that object-oriented design and heavy use oft state machines is also quite niche!
But to join in the chorus branchless programming kinda needs to advertise itself. Otherwise there are usually bigger bottlenecks!
-Knul-@reddit
AFAIK premature optimization is about 2 things:
1) you don't know beforehand where the bottlenecks are, so until you know, optimization isn't going to help too much
2) err on the side of having readable and maintainable code rather than performant.
Of course that doesn't mean that you should ignore performance completely, as apparently some do.
Icy-Concentrate2076@reddit
I think they are getting downvotes because most devs here and in general never work with something low level enough to find out it matters and think compilers are magical things that are perfect.
Awesan@reddit
Yes I think when branches are your bottleneck something went wrong in the design earlier on, the best programmers I know simply avoid this issue by not creating hard to predict branches in tight loops to begin with.
pillmillipedes@reddit
This is a great trick that every programmer should be aware of, but if you're willing to dig a bit deeper, another 5x improvement can be readily gained by vectorizing the loop:
Time if: 0.394
Time bl: 0.068
Time simd: 0.012
godbolt link (might take a couple reruns to get a box with avx512 on it)
logically it's about the same, but it uses the
vpcompressdinstruction to pack several integers at once, corresponding to the bits set inmask.Unfortunately this handy instruction is unavailable on avx2, though there are some threads on SO on how you can emulate it.
For a large enough dataset even more can be done re: data access patterns and prefetching but that's way out of scope for this thread to be honest.
QbProg@reddit
Do likely attributes obtain the same effect as branchless version?
cdb_11@reddit
likely/unlikelyattributes will actually incentivize the compiler to prefer branches over branchless code. Clang has__builtin_unpredictablethat should in theory should incentivize the compiler to prefer branchless code, but it seemed kinda busted to me last time when I tried it, not sure if it works. Check the compiler output when using it.likelyandunlikelycan also rearrange the branches. According to the Intel optimization manual, that can change the default prediction, when the branch isn't in the history yet. By default backward branches are taken, and forward branches are not taken. So for example the conditional jump back at the end of a loop is assumed to actually loop back. And if you tag some branch as unlikely, that branch will go to the end of the function, and the CPU will assume it just falls through to the likely code by default.AOEIU@reddit
If the branch is predictable then the branched version is as fast (or faster) regardless of a likely attribute. If it's not predictable then "likely" doesn't really do anything anyway.
A "likely" attribute just effects code generation. When branching a processor either falls through to the next instruction or does a jump to a different instruction. Processors are slightly better at falling through then jumping, so if a compiler can place "likely" code code in the fallthrough case. The performance difference is small (maybe ~10%) compared to a mis-predicted branch.
ShinyHappyREM@reddit
See the comments of this article for some info on that.
In short: likely/unlikely may affect the code generation, depending on the compiler, but at runtime it makes less of a difference.
andymaclean19@reddit
Interesting and a good trick. It is generating memory writes though when the other option doesn't so that might have some unintended side effects performance wise because you are doing more cache flushes.
I remember the EPIC instruction set had conditional flags so you could just skip an instruction conditionally without doing any branching and that was supposed to be faster. I wonder how many other architectures have this sort of thing built in and whether that changes the logic here.
Liquid_Magic@reddit
I get this and I appreciate the post. I read it and will have the sitting in the back of my mind now when programming.
However… I kinda hate that CPU’s work like that. I also hate hyper threading. I get it and it’s performant and ultimately things are because of all of it.
But there’s just something about it that bugs me. I just don’t like this idea that the CPU fucks around with code and tries to run it better. I’m sure it does but what the flying fuck. I honestly felt a little vindicated when all those security were revealed when it came to hyper threading.
Look I get that most of the time it works every time.
It just all feels like schnanigans.
ack_error@reddit
Not wrong, and this does make optimization more difficult, especially when different CPUs behave very differently. But in general the performance gains from caching, branch prediction, and out of order execution are just too massive to ignore. Disable all that and even the most advanced modern CPU core runs at the speed of an uncached 80386.
Embedded tech has more cases that swing the other way, where the cost or variance is too much and a simpler in-order core is used instead. Don't need a fat OOO core when you have a known platform with a compact in-order CPU core + dedicated on-chip SRAM that can give reliably low latency.
Brian@reddit
Optimising can be a bit weird like that. Back in the day it was a straightforward matter of stuff like counting cycles, but as CPUs have got more complicated, the dominant factor in speed has become "Do stuff that doesn't fuck up the hardware level optimisations". We kind of end up with the tail wagging the dog a bit: the stuff that is there to speed up user-written code now becomes the target for that code.
bart9h@reddit
The title is misleading.
The optimized version slows ME down.
wrecklord0@reddit
But the new code now has to perform a memory store at array[j] in a location that depends on whether the previous loop incremented j. Shouldn't this also prevent unfolding / pipeline optimizations? I see that it's faster in implementation but I wonder why, in terms of what the cpus ends up actually doing. How does it deal with having to store the data in a conditionally dependant position?
mattatghlabs@reddit
Before I even read the article, I assumed this was a strategic question. Often people will get mired in project planning because of all the what ifs.
Sea_Cap_2320@reddit
This is a great article for real time applications and very low level libraries. However to be honest, 99.99% of developers don't need to worry about this :)
Hiddencamper@reddit
True but I’d like my fresh out of the box windows laptop to use less than 16 gb ram and 40% cpu on meaningless bs and poorly optimized code.
TheRealPomax@reddit
It's knowing what's possible, even if you'll never need it, that makes you a better programmer =P
KentWallace@reddit
I read an interesting blog post awhile back where the author claimed to have written a nearly "if-less" Java codebase by using exceptions for all logical decisions instead of just errors. The code samples were elegant to read but on the other far extreme of performance than this post!
Leverkaas2516@reddit
Aside from performance, branchless paths can make it easier for the reader to reason about a segment of code, and easier to write tests for it. An okd colleague used to evangelize the technique for these reasons.
Of course, like anything, you should make your code easy to understand and modify first. Don't make it obtuse just to get a tiny bit of performance, unless you've measured it and you need that performance.
SemaphoreBingo@reddit
Why is
small_numbersbeing updated at all?Also on
-O1godbolt shows that gcc 16.1 x64 converts to acmp.Raknarg@reddit
this seems like the kind of thing where if there's a simple change that would result in more efficient code, the compiler will generate better code with optimizations enabled.
Booty_Bumping@reddit
Your compiler is already making all reasonable branchless optimizations for you, the vast majority of the time.
oliver_extracts@reddit
i think it depends heavily on the layer. inner loop of a database engine, branchless tricks pay off. wiring up business logic, the readability tradeoff usually isnt worth it. and yeah Fidoz nailed it, most of us arent at Jane Street, the perf gap doesnt matter at our scale.
DrNybble@reddit
Look at the generated code with Godbolt, a lot of times I've seen that the compiler generates branchless code with an "if" statement anyways.
user_overflow@reddit
"On this benchmark, -O3 did not yield additional speedup, only larger generated code" - so this little maneuver is pointless if you use compiler properly
tomvorlostriddle@reddit
I thought that was some vague inspirational phrase about there is no trying and not taking no for an answer etc.