KREP - A blazingly fast string search utility designed for performance-critical applications. It implements multiple optimized search algorithms and leverages modern hardware capabilities to deliver maximum throughput.
Posted by davidesantangelo@reddit | programming | View on Reddit | 15 comments
levodelellis@reddit
I measured about a year ago, mmap is slow. IIRC calling read in a loop for 1gb of data was significantly faster than calling mmap on a 1gb file. IDR if it was 2x faster or 5x, or what kernel version I used, so I'd recommend measuring it if you really want the speed
burntsushi@reddit
It depends on what you're doing. And in the case of krep, which attempts to parallelize searching a single file, one might imagine how memory mapping makes that much easier than alternatives from an implementation perspective.
But in terms of just single threaded search, you can test the impact of memory mapping with ripgrep:
Which shows a small but real improvement. But if you try this on a big code repository, the results not only flip, but get significantly better for the non-mmap case:
Which is why ripgrep, by default, uses a heuristic to determine when to use memory maps.
The last time I did a more thorough analysis here, the perf difference varied quite a bit by platform. It's why, for example, memory maps are totally disabled on macOS since my experiments revealed it was never an improvement.
levodelellis@reddit
Do you have any rule of thumb? I could have sworn 1gb was slower than read. I might test myself soon. I somewhat wonder if the kernel uses an alternative more optimized path when the file >= 2 or 4 gb. I know I didn't test with 12gb
I imagine calling mmap then unmapping * many files could be expensive. Is one of your heuristic to see if there's a dozen or 100 files and switch to read in that case? I don't think I have a use case where I'd want to use mmap since I don't want the file system to change the data I have in memory
burntsushi@reddit
Yeah exactly. I think it's something like, if ripgrep can definitively tell that it's searching 10 or fewer files, then it uses memory maps. Otherwise it just uses regular
read
calls. There are other factors, like memory maps can't be used certain kinds of special files (like/proc/cpuinfo
).I suspect a better heuristic would be to query the file size, and only memory map for very large files. But that's an extra
stat
call for every file.Bottom line is that I've never personally seen memory maps lead to a huge speed-up. On large files, it's measurable and noticeable, but not that big of an advantage. So I honestly don't spend a ton of time trying to think of better heuristics.
levodelellis@reddit
Random thought, I think I remember the numbers you said when I used
read
for full files, were you measuring one read for the entire file? My numbers came from checking 4k to 4MB. IIRC all OSes best number was something in between.burntsushi@reddit
In the comment I posted above, the
--no-mmap
does not read the entire file into memory (unless it is smaller than ripgrep's buffer size). For large files, this will result in multipleread
calls.There are some cases where ripgrep will read the entire contents of a file onto the heap with one
read
call in practice. But those cases are generally limited to multiline search when memory mapping can't be used. This case is not shown above.Casalvieri3@reddit
So you’re posting a link to your own, apparently flawed (if the comments below are correct) utility? Why?
davidesantangelo@reddit (OP)
Oh no, you've discovered my shady plan! Post an open-source project for the programming community to test it, give feedback, and help improve it... What a crazy idea, right? Maybe I should keep it a secret and perfect it in a cave until it becomes magically perfect on its own!
burntsushi@reddit
Author of ripgrep here. I made a number of observations about this tool when it was posted to HN a few weeks ago: https://news.ycombinator.com/item?id=43334661
Perhaps most critically, it prints wrong results... and it does it way slower than both grep and ripgrep:
And... it doesn't even print matches?
NotImplemented@reddit
Good job testing and pointing this out. Thanks!
And really bad form by OP to re-post this without having addressed these obvious problems.
Performance is meaningless if there are no guarantees for correctness.
tommcdo@reddit
The phrase "leverages modern hardware capabilities" sounds like a marketable way to say "we didn't bother writing efficient code"
PrincessOfZephyr@reddit
Unfortunate name.
shinitakunai@reddit
So... why would I use this instead of notepad++
letsloosemoretime@reddit
r/ihadstroke
doyouevenliff@reddit
how does it compare to ripgrep?