Title is misleading. Yes you can beat it using parallelism built into SIMD architectures assuming the values fit into words that can be parallel processed. But asymptotically best search without parallelism is still Θ( n log n )
The choice of function names like vld1q_u16 and _mm_loadu_si128 for SIMD instructions has got to be one of the biggest hurdles to their general adoption. The amount of mental energy you have to devote just to understanding what the code is doing is ridiculous. It's completely unreadable.
This doesn't work because the calls are intrinsics and may also be macros. ARM ACLE specifically says that whether the intrinsics are defined as macros or whether the address of the intrinsics can be taken is unspecified, and some compilers define them as macros with arguments.
As others have said, most names follow a pattern once you're used to it. The ARM NEON intrinsic names are better than Intel's. It's partly a consequence of trying to squish readable names into a global namespace for intrinsics that must be available from both C and C++ without overloading. And sometimes the names are obscure because the operations are obscure:
Wouldn't really help to have an intrinsic like signed_saturating_rounding_doubling_multiply_accumulate_high_half(). But there are other annoyances with common SIMD intrinsics, like:
Having the IDE give unhelpful (src1, src2, src3) as the arguments for a madd intrinsic because that's how the intrinsic was defined in the compiler header, only to cross-reference it to the documentation which lists them as a, b, and c, which then has to be translated to Vd, Vn, and Vm registers in order to match then to V[n], V[m], and V[d] and then operand1/2/3 in the pseudocode (in different order), in order to figure out which ones are the multiply terms and which one is the addend.
Accidentally triggering an illegal instruction because on Intel, despite regularly named intrinsics, 16-bit, 32-bit, and 64-bit logical shifts left and right are SSE2, 16-bit and 32-bit arithmetic shifts right are SSE2, but 64-bit arithmetic shifts right are AVX-512. Or similarly on ARM, finding out that a particular multiply intrinsic is ARMv8 while the multiply add intrinsic is ARMv8.1, and nowhere on the intrinsic page for the latter does it tell you that the madd version requires ARMv8.1 / FEAT_RDM unless you also cross-check the assembly instruction.
Having to type vreinterpretq_f32_u32(vandq_u32(vreinterpretq_u32_f32(vals), sign_mask)) because ARM decided to use the unwieldy "reinterpret" instead of "cast".
Having to do nasty pointer casts because Intel did silly things like have a 64-bit load intrinsic _mm_loadl_epi64 take a __m128i pointer even though it only loads 64 bits.
Seriously feels like a naming constraint was “doesn’t overflow a default excel column”.
You do get a feel for things after a while, but they keep adding more with even more obscure names. A few extra vowels and underscores would make a world of difference.
You could probably come up with a better naming convention, like maybe u16x8_load or something. But let's not make the names longer, because the majority of these are just basic operations, like basic arithmetic or memory loads/stores.
This is one of those articles that I can tell is high quality, but I need half the text to be hyperlinks to explanations of the words it's using.. It's going to take me the better part of a day just to halfway understand the terminology used in the explanation :D
Podem me tirar uma dúvida por gentileza.Estou querendo matricular meu filho na Kodland. Valor de 3900,000 pagos em 12 vezes de 331,00. Curso de desenvolvimento de jogos e animações. Estou um pouco insegura porque além de ser on-line, vi miitas reclamações no reclame aqui, mas a com pontuação elevada de confiança. Alguém aqui sabe me informar se já soube de alguma criança que concluiu o curso lá e ficou satisfeita, aprendeu mesmo.
Why not use a btree with node size 16? Then you can load a single node in SIMD (with cache locality!) and do all comparisons at once to figure which node to load next
ScottContini@reddit
Title is misleading. Yes you can beat it using parallelism built into SIMD architectures assuming the values fit into words that can be parallel processed. But asymptotically best search without parallelism is still Θ( n log n )
Kwantuum@reddit
Even with parallelism
Slime0@reddit
The choice of function names like
vld1q_u16and_mm_loadu_si128for SIMD instructions has got to be one of the biggest hurdles to their general adoption. The amount of mental energy you have to devote just to understanding what the code is doing is ridiculous. It's completely unreadable.angelicosphosphoros@reddit
Just make aliases: `static constexpr auto LoadSseUnaligned = _mm_loadu_si128;`
ack_error@reddit
This doesn't work because the calls are intrinsics and may also be macros. ARM ACLE specifically says that whether the intrinsics are defined as macros or whether the address of the intrinsics can be taken is unspecified, and some compilers define them as macros with arguments.
angelicosphosphoros@reddit
Well, in such case, it is possible to just use defines.
Slime0@reddit
That's not gonna help when I'm looking at code someone else wrote, or reading something online for information about how to use the functions.
svick@reddit
If you use .Net, then the first one is
AdvSimd.LoadVector128and the second one isSse2.LoadVector128.And you can also use
Vector128.Loadif you want to have cross-platform code.floodyberry@reddit
simd is generally adopted, it's used everywhere. the reason it's not in general purpose code is not the function names, it's that
a) it's not portable
b) not all cpus support all extensions, so code must either be compiled for a specific architecture, or dynamic dispatching must be used
c) subtle differences, even between extensions on the same architecture, can require completely different implementations
ack_error@reddit
As others have said, most names follow a pattern once you're used to it. The ARM NEON intrinsic names are better than Intel's. It's partly a consequence of trying to squish readable names into a global namespace for intrinsics that must be available from both C and C++ without overloading. And sometimes the names are obscure because the operations are obscure:
https://developer.arm.com/architectures/instruction-sets/intrinsics/vqrdmlah_s16
Wouldn't really help to have an intrinsic like
signed_saturating_rounding_doubling_multiply_accumulate_high_half(). But there are other annoyances with common SIMD intrinsics, like:Having the IDE give unhelpful (src1, src2, src3) as the arguments for a madd intrinsic because that's how the intrinsic was defined in the compiler header, only to cross-reference it to the documentation which lists them as a, b, and c, which then has to be translated to Vd, Vn, and Vm registers in order to match then to V[n], V[m], and V[d] and then operand1/2/3 in the pseudocode (in different order), in order to figure out which ones are the multiply terms and which one is the addend.
Accidentally triggering an illegal instruction because on Intel, despite regularly named intrinsics, 16-bit, 32-bit, and 64-bit logical shifts left and right are SSE2, 16-bit and 32-bit arithmetic shifts right are SSE2, but 64-bit arithmetic shifts right are AVX-512. Or similarly on ARM, finding out that a particular multiply intrinsic is ARMv8 while the multiply add intrinsic is ARMv8.1, and nowhere on the intrinsic page for the latter does it tell you that the madd version requires ARMv8.1 / FEAT_RDM unless you also cross-check the assembly instruction.
Having to type
vreinterpretq_f32_u32(vandq_u32(vreinterpretq_u32_f32(vals), sign_mask))because ARM decided to use the unwieldy "reinterpret" instead of "cast".Having to do nasty pointer casts because Intel did silly things like have a 64-bit load intrinsic
_mm_loadl_epi64take a__m128ipointer even though it only loads 64 bits.cosmic-parsley@reddit
Seriously feels like a naming constraint was “doesn’t overflow a default excel column”.
You do get a feel for things after a while, but they keep adding more with even more obscure names. A few extra vowels and underscores would make a world of difference.
cdb_11@reddit
You could probably come up with a better naming convention, like maybe
u16x8_loador something. But let's not make the names longer, because the majority of these are just basic operations, like basic arithmetic or memory loads/stores.cdb_11@reddit
The names actually do make sense, so you can get used to it.
saf_e@reddit
Its just about const, you replace log2(x) with log(4) and get better absolite number, but asymptotically it's the same.
pepejovi@reddit
This is one of those articles that I can tell is high quality, but I need half the text to be hyperlinks to explanations of the words it's using.. It's going to take me the better part of a day just to halfway understand the terminology used in the explanation :D
melt4082@reddit
Podem me tirar uma dúvida por gentileza.Estou querendo matricular meu filho na Kodland. Valor de 3900,000 pagos em 12 vezes de 331,00. Curso de desenvolvimento de jogos e animações. Estou um pouco insegura porque além de ser on-line, vi miitas reclamações no reclame aqui, mas a com pontuação elevada de confiança. Alguém aqui sabe me informar se já soube de alguma criança que concluiu o curso lá e ficou satisfeita, aprendeu mesmo.
ctafsiras@reddit
Sure, you can beat binary search with SIMD, but can you beat the existential dread of deciphering Intel intrinsic names six months later?
mr_birkenblatt@reddit
Why not use a btree with node size 16? Then you can load a single node in SIMD (with cache locality!) and do all comparisons at once to figure which node to load next
trailingunderscore_@reddit
Because he already has the values in a sorted array.
monocasa@reddit
So, a hybrid radix tree?
Unfair-Sleep-3022@reddit
A good read in these trying times. What a rare sight to behold.