Same algorithm, 16x faster: optimizing a vector search engine’s hot path
Posted by BgA_stan@reddit | programming | View on Reddit | 23 comments
Posted by BgA_stan@reddit | programming | View on Reddit | 23 comments
Anthony356@reddit
One tiny gripe about the post: the way it's formatted is like... Twitter longpost syndrome? I dont know if there's an existing name for it.
Why is every single sentence on its own line?
It makes it so annoying to read.
Each sentence is a continuation on the same subject.
But they are separated out.
It can help with emphasis when used sparingly.
But it gets REALLY tedious really quickly.
Spoken language naturally has pauses. Punctuation and line breaks represent that in text. If you wouldnt make a big dramatic pause after every single sentence, you shouldnt do it in text. Someone else mentioned that the tone feels condescending and the line breaks are probably why.
BgA_stan@reddit (OP)
Sry man not a great writer 😅. Thanks for the feedback
unicodemonkey@reddit
I'm not sure whether I'm reading a true-to-the-form longpost or a LLM-rewritten text nowadays.
Dramatic_Turnover936@reddit
the thing that makes this kind of optimization possible is having the instrumentation to see where time is actually going. a lot of teams skip the profiling setup because it feels like overhead, then spend months guessing at bottlenecks. the flamegraph is doing more work in this post than the algorithmic change.
Lambda-Spira@reddit
Most optimizations focus on the algorithm itself. In practice, the biggest gains often come from how data flows through the system. Once you fix the hot path, the same algorithm can behave very differently.
fishthecomish@reddit
I’ve been working with vector databases for a couple years now, and this approach is genuinely impressive. A single flat array to minimize cache misses and eliminate pointer chasing is exactly the kind of SIMD‑friendly optimization that pays off big at scale. It’s damn smart. The only thing that eventually pushes back is hardware imo, once you scale into the hundreds of millions of vectors, memory limits become the real constraint.
BgA_stan@reddit (OP)
I'm modelling this after DiskANN. Currently its FAR from it, but it should scale well even for billon vectors (at least according to the research paper, idk if my implementation will :P)
Determinant@reddit
It can't scale into the billions because each vector uses about 1500 array elements so billions of vectors would require an array with a contiguous region of memory in the terabytes.
Game over
ants_a@reddit
You could just get a computer with terabytes of memory...
Determinant@reddit
It's not about having terabytes of memory, it's about requesting a single contiguous region of memory from the operating system that's in the terabytes.
You would need a supercomputer.
ants_a@reddit
There is 128 PiB of address space available, not that hard to memory map even storage volume sized address spaces. But even for in-memory stuff, you can get 12TB in a 2 socket 1U chassis, not something I am inclined to call a supercomputer.
VictoryMotel@reddit
Putting items in an array is table stakes for anything that needs to be fast. There are still a lot of things to have to deal with. SIMD still can't do gather or scatter efficiently so if you are trying to get nearest neighbors to anything not clumped together in memory it won't have a speed up.
fiah84@reddit
the best algorithms often end up being limited by memory bandwidth, no?
funtimes-forall@reddit
This is very validating. I did almost the exact same thing about 20 years ago. At the time, the dev environment didn't support SIMD really well I emulated the instructions and register set in C. When I got it working I assembled it and it worked perfectly. I didn't benchmark the speedup but it went from being a dripping faucet to a firehose. It was an all nighter and the sun was just coming up. It was a good day.
golgol12@reddit
Ooof. Just reading that vector to shared pointers. That oooozes slow.
BTW, what you did is effectively converting a small part of your program to DOD style. (Data Oriented Development).
Pointers and new. It's how to add seconds to loops.
Syncher_Pylon@reddit
Cache-friendly data layout is consistently the biggest performance win in hot paths. Moving from scattered structs to flat arrays sounds obvious in hindsight, but it's the kind of optimization most people skip because the algorithmic complexity doesn't change. The 16x improvement just from memory layout is a good reminder that big-O isn't everything.
programming-ModTeam@reddit
No content written mostly by an LLM. If you don't want to write it, we don't want to read it.
vini_2003@reddit
Ok, GPT. One of your last profile comments has an em dash, too.
sailing67@reddit
ngl i love posts like this becuase it reminds me 90% of 'scaling' is just staring at flamegraphs adn deleting dumb work. 16x is wild.
Alternative-Toe-5739@reddit
I Keep it 55th street on HOOD. Big triple O General 55th street neighbourhood cryp on hood
TerrorBite@reddit
This seems like an interesting post, but the way it's written makes me feel talked down to.
captain_obvious_here@reddit
Very interesting
Minimum_Success5442A@reddit
The hot path optimization here is solid. For anyone building similar systems - worth also looking at SIMD intrinsics for the distance calculations themselves. AVX-512 can get you another 2-4x on top of cache-friendly access patterns. Also interesting: did you benchmark the difference between heap-allocated vectors vs. a flat buffer with index-based access? The former adds pointer chasing overhead that compounds at scale.