Bloom filters: the niche trick behind a 16× faster API | Blog | incident.io
Posted by fagnerbrack@reddit | programming | View on Reddit | 69 comments
Posted by fagnerbrack@reddit | programming | View on Reddit | 69 comments
Immotommi@reddit
I always like stories of people optimising performance, because I think it doesn't get done enough. But, as soon as I saw jsonb in database fields, I knew that the starting point was dumb.
I don't love the "don't over-engineer at the start" sentiment because I think you end up with things like this. Like this is just obviously terrible. If you are deserialising json just to query, you haven't "avoided over-engineering," you have just done no engineering
SaxAppeal@reddit
Yeah the deeper I read into the article the more I became convinced this was just really fucking stupid from the start. They could have slapped TimescaleDB on top of their Postgres instance, and timescale hypertables probably would have saved them enough time to skip the bloom filter custom indexing gymnastic altogether. They’re literally building custom managed indexes on top of their database, that’s like using a drill to try and hammer in a screw.
But the whole problem was storing json blobs in Postgres for this use case in the first place. Like it was an interesting read, but I feel like the better solution would have been to take the time to get the hell off jsonb columns, not build a Rube Goldberg indexing machine around their fucked up data store.
ScientistComplex2020@reddit
Don’t know. They used the blobs because the customers can define the columns. Seems like a reasonable set when you are starting out and if you have the budget to separate the customers at schema level, you still have to fix the p95.
nomme@reddit
May I ask if you have any ideas what a more robust longterm solution would look like? That is, you want the customer to define parts of the schema, but you're past the "starting out" phase.
shared_ptr@reddit
100% this!
shared_ptr@reddit
(Am on team) we explicitly didn't want another technology. We're hiring really fast (2x'ing the team each year) and assign a really high value to having a very simple developer environment setup.
If we added TimescaleDB that is:
- Another thing that can go wrong in production for a service that has a 99.99% contractual SLA
- Another dependency that needs adding into the developer environment locally
- Another technology we need our engineers to know how to use and debug in production
Vs this, which is just a bitmap in Postgres and a small piece of Go code that everyone can understand in \~30m.
SaxAppeal@reddit
Timescaledb is a Postgres extension. It’s not a new technology really, or a deployment at all. It’s literally just postgres and Postgres queries. You can install it on an existing Postgres db, it’s just a library.
shared_ptr@reddit
Sadly not available in the stack we're using, so would've needed a replatform and a fairly substantial data migration to make work.
Obviously vs the project that implemented this (1 week for 2 engineers) didn't win out on the ROI.
SaxAppeal@reddit
Ah you’re using a managed postgres service. That makes sense.
shared_ptr@reddit
Yep, is CloudSQL in GCP. Getting bigger nowadays (about 5TB) and with several replicas, but otherwise humming along fine.
We have alerts whenever CPU goes >30% and jump on them straight away to ensure we keep a sizeable buffer for bursts, such as when there is a global internet outage and all our customers jump on the platform.
We try and keep the infra as simple as possible so we have options for portability (we're planning out multi-cloud strategies this year) so a vanilla managed service is ideal.
crazedizzled@reddit
If you optimized everything at the start, you'd never ship a product. Sometimes it's like "yeah that's going to be an issue in a year" and you deal with it later.
DiscipleofDeceit666@reddit
I was thinking of doing a jsonb thing with the final result of porting over to mongodb. I felt the need to build some scaffolding so we can easily port over to mongodb but maybe I’m too early? Seemed like an easy way to diff related records.
shared_ptr@reddit
I'm on the team at incident (am a founding engineer) and am responsible for a bunch of how we store this stuff.
I don't really get this? I've scaled Postgres databases well into the 10TB+ range helping build multi-billion dollar businesses on them. JSONB has never been an impediment, and we extract things into columns when we need to.
The path to providing the product experience we want to (flexible schemas for querying and custom attributes with operations) isn't clear to me if you don't have a somewhat flexible column type. And I mean, I guess if all this approach can help us reach is a multi-billion dollar outcome, it doesn't feel so pressing to consider alternatives!
bguggs@reddit
You've posted a bunch here and it's coming off a tad defensive. Reminder not to take criticisms from people who don't know the requirements or care about your business/users seriously
shared_ptr@reddit
That’s fair! I’m not that bothered by it in truth, our team have been chuckling at the comments describing us as “fucking idiots”
For all the reasons above we’re confident this was a sensible path to take. It’s a healthy nudge, appreciate it.
ptoki@reddit
Yeah. It gives me these vibes of that kid who does "look ma no hands" while riding the bike.
I can get it, ship the product fast, be able to push the change in a day when business has that new idea - then those fancy non rigid databases are ok.
But the moment you have many customers and two medium sized pc cant handle the load you need to change the approach and the technology.
And, as you said, when needing to filter stuff in app or even worse in usrs frontend - you went wrong waaaaay long time ago....
shared_ptr@reddit
I'm on the team at incident, I wondered how many customers is 'many' to you?
We have thousands of customers, some of them really big (e.g. Netflix, Verecel, HashiCorp, Loveable, etc)
Our product has a contractual 99.99% SLA that we've upheld for the last year.
Is many customers for you a lot more than this?
RationalDialog@reddit
Yeah maybe we miss the details but having jsonb and then not using it but quering it in-memory in code seems to effing inefficient.
technical debt so to speak. so that initial thinking and engineering is crucial important as if done proper, they might not even need this crutch. jsonb scream "No idea how to deal with this now so let's put everything in here".
SlinkyAvenger@reddit
90% of the time, you don't know the scope of the problem at the outset. Would you want to "engineer" for a constantly moving target with something more specialized (or, dare I say, niche) like the particulars of DBs or data theory, or would you rather get your product out to market to validate it, then bring on said niche resources when you run into better-defined obstacles?
Like, indexes are bog-standard in DBs but you aren't shit-talking OP for the one added to support the 30-day data set because there was no way the company would anticipate the need for it.
And, as always, premature optimization is the root of all evil.
visicalc_is_best@reddit
This is a bit dismissive — you don’t need to deserialize to query. JSONB fields can be easily indexed: https://www.crunchydata.com/blog/indexing-jsonb-in-postgres
oglokipierogi@reddit
Can anybody help me understand whether an entity-attribute-value pattern (with entity being the alert) would have been a better schema design for filterable user defined attributes than the JSONB column?
Dave3of5@reddit
EAV doesn't scale very well at all. The main problems are that if you allow a customer to put their own entities in and filter on these they might put like 20 different entities in. A "normal" EAV query would then consist of 20 self joins with a filter for each join on the specific attribute that the user wants to filter on.
Imagine Firstname is one and lastname is another. If they want to filter on firstname = David and lastname = Smith they do something like:
SELECT e1.entity_id FROM EntityAttributeValue e1 JOIN EntityAttributeValue e2 ON e1.entity_id = e2.entity_id WHERE e1.attribute = 'first_name' AND CAST(e1.value AS STRING) = 'David' AND e2.attribute = 'last_name' AND CAST(e2.value AS STRING) 'Smith';
The more filters you add the more the whole thing becomes slow as traditional indexing doesn't work with this schema.
Note also the casting here also makes the thing slower as you need to cast you values which are generic to something more specific (think age). This casting prevents index usage.
Also EAV kind of prevents you putting anything you want as the value column needs sized to fit the thing you want and you need to know it's type to be able to create a filter query.
I've worked with a very heavy EAV style system where essentially the entire dataset was put into a single table. That company is dead we couldn't scale for our customers demands and it was too expensive to sell the software to our customers.
To me it seems a better approach to this JSON column is to make the types designed by the customer normal attributes on the notification table but shard this per customer. So each customer you have has their own notifications tables with their own indexes. You could also instead of tables shard the customer specific data into their own database instead.
This scales better than a single multi-tenant table per customer.
oglokipierogi@reddit
Thanks a lot for the discussion here.
When you say each customer should have their own notification tables with “normal attributes”. Are you suggesting that each notification table would have differing schemas?
That seems frightening in terms of managing migrations over time, but perhaps is a better route.
Dave3of5@reddit
Yeah a core schema of core fields then the rest are auto generated based upon what the customer wants.
In terms of migrations a giant table where every clients main data is stored to me is a much bigger problem when it comes to migrations. How would you change the schema on that table without downtime.
This app in that the OP is working for one of it's main function is basically this table + dynamic schema.
Think of it this way what if they wanted to change the way those custom attributes work. You'd have to run some sort of migrations touching every row in that table whilst taking the system down.
EAV would be similar.
It's generally well known that EAV is not scalable works well for small scale system but once your in the millions of rows in that main table it all falls apart. That's actually one of the reasons that NoSQL DBs came about.
IllIIllIIllIIll@reddit
In my opinion EAV is almost always an anti pattern
Felkin@reddit
"niche trick" - literally one of the most important algorithms in databases, practically always taught in serious database classes in university and mentioned in every other important database paper.
I've sat in audience chairs listening to people from redshift, snowflake and others giving talks and this gets mentioned so often.
It's not niche in either database research communities or the big database providers. Just the author is in some odd bubble.
Cool read otherwise
gwillen@reddit
It's really well known in a specific field. I suspect that to the vast majority of programmers it's obscure. I could be wrong (I'm certainly very familiar with Bloom filters myself.)
Felkin@reddit
Well by that logic gradient descent is also obscure, because it's part of a specific subfield ;)
I think most software engineers have to interact with databases eventually and then this knowledge certainly should pop up as part of understanding how modern database query engines operate. At least in my view database theory is one of the more critical courses in CS curricula.
gwillen@reddit
Most programmers (especially for an expansive definition that includes anybody who writes code) didn't take a CS curriculum, though.
quetzalcoatl-pl@reddit
with AI tools the term "programmers" already expanded also to "those who don't write code but watch it written" :D
max123246@reddit
I wish more programmers were taught good engineering practices. Even most colleges seem to leave you with so few tools in your toolbox
Empanatacion@reddit
Mentoring a lot of freshly minted CS grads has taught me that learning good engineering practices involves unlearning a lot of wheel reinventing urges that their degree saddled them with.
max123246@reddit
Luckily not every school, but it does seem like most. It's why I urge everyone I know to take MIT's free courses online since it just gets you that practical experience of building stuff and actually teaches you the right concepts, at least in my experience so far
Fornicatinzebra@reddit
I have never heard of "logic gradient descent", so it is indeed obscure. (10+ years programming, non-comp sci BSc and MSc)
geofft@reddit
add a comma after "logic" and read again :)
Fornicatinzebra@reddit
Lol whoops, regardless "gradient descent" is new to me as a term
amejin@reddit
You think needing to know the specifics of how a database reduces your query to tell you something isn't there is important to using the database?
Not once in 14 years of writing high performance queries have I ever thought to myself "I wonder if the bloom filter will be the thing that speeds up this proc?"
I'm sorry, I don't agree with your assertion.
shared_ptr@reddit
Am on the team with Mike, it's exactly this! You can be an engineer building successful products and not have come across probabilistic algorithms like bloom filters or hyperloglog etc if you didn't study computer science.
If you get really into the weeds which I have (as in, you've ended up spending time in the Postgres source code looking for bugs) then you'll come across it, but it's easy to miss even if you have a very long professional career.
happyscrappy@reddit
It's just clickbait.
"One weird trick you've never heard of."
Same as anyone else using this the writer doesn't care that many people have heard of it. They just want to intrigue you into tricking.
mjec@reddit
Something I picked up from a colleague years ago: I have "bloom filter" as a slack alert word because either they're doing something very cool or making a terrible mistake
OMGItsCheezWTF@reddit
They feel so simple until you explain them to someone who hasn't met them before.
"No you have a large bit array in memory and then you paint bits based on your constraint and if any of the bits have already been painted then you may have a collision but if none of the bits are painted you definitely don't"
This feels so simple yet so many people can't see it.
nemec@reddit
The implementation is different, but conceptually I've found it handy to think about it like a hash table/set: you know how creating a hash of value, then looking it up in the hash array is super fast? And also that hash collisions exist, so just because the hash exists doesn't mean the value is actually in the set? Bloom Filters basically stop there, so if there is no hash match, it's definitely not in the backend, but if there is, it's either a true match or a collision. And depending on the array size (and input size) you can still get a very low probability of collisions which makes it worthwhile.
Wolfy87@reddit
I heard if you say "bloom filter" three times into a whiteboard, /u/mjec turns up to review your design document.
DivineSentry@reddit
Who’s that?
DigThatData@reddit
whisper to the whiteboard and find out
sai-kiran@reddit
Look at the comment username he responded to.
warpedspockclone@reddit
I downloaded an Orlando Bloom filter for my zoom meetings
SaxAppeal@reddit
I think in the case of this article it’s both tbh
Kjufka@reddit
webshit bubble
Gooch_Limdapl@reddit
What percentage of programmers in the world would you estimate work to provide big databases or conduct research on database implementation?
shared_ptr@reddit
Am on the team at incident: it's not that niche I agree, it is 'niche' if you don't come from a computer science background though, which is where Mike (author) is coming from!
SyntheticDuckFlavour@reddit
i.e. niche
CrossFloss@reddit
They are everywhere: networking, caching, distributed systems, ...
Felkin@reddit
In an article about speeding up a database :)
max123246@reddit
I'm curious, do bloom filters beat out quotient filters in practice? I had thought quotient filters were the better choice performance wise
ptoki@reddit
If done right the old style sql filtering is unbeatable.
I done poorly (lack of indexes, too long rows etc) either can win.
TQuake@reddit
Sounds like it’s important in a niche…
max123246@reddit
Niche in the sense that most people don't know it. Not niche in the fact that nearly all web workloads involve a database
Empanatacion@reddit
I don't know how USB-C works either, and I use it every day.
Bayakoo@reddit
Not a common trick in web dev/sass dev though I’d imagine.
jduartedj@reddit
honestly the real takeaway here isn't even the bloom filter itself, its the fact that they had jsonb columns being deserialized in app-level code for filtering. thats the kind of thing that works fine with 1000 rows and becomes a nightmare at scale.
the bloom filter is cool but its also kind of a band-aid for a schema problem they shouldve addressed earlier. like... postgres has GIN indexes on jsonb that work really well for containment queries. you dont need to pull data into memory to filter it
that said i've used bloom filters in a completely different context, deduplicating events in a high throughput pipeline where checking a set would eat too much ram. saved us from needing redis just for dedup. so they definetly have their place, just maybe not as a substitute for proper schema design lol
Lonsdale1086@reddit
Can I advise you review the section that addresses why they went with bloom over gin indexes?
https://incident.io/blog/bloom-filters#gin-vs-bloom
jduartedj@reddit
oh nice, hadnt scrolled to that section, thanks for pointing it out. just read it - their argument about gin index size on high cardinality data makes a lot of sense, especially the bit about the partial index pattern not scaling when you have many tenants. fair enough.
still think for most teams gin would be the right call, but for their specific shape of data (many small partial-style filters across a huge table) bloom does seem like the better fit. cool to see them actually walk through the tradeoff instead of just picking the trendy thing
shared_ptr@reddit
Yeah I definitely wouldn’t advise people run at this by default.
It only works for us because we have specific product features needing totally flexible schemas: you can add arbitrary attributes to your alert like Feature, Team, Product, etc which are backed by a flexible ‘catalog’ that is itself a database.
Finding a nice filtering pattern for that when you can’t know the structure in advance is the real thorny problem this tries solving. If you don’t have that shape problem, a gin index will be the easiest route by far.
shared_ptr@reddit
Am on the team at incident and yes, we couldn't use gin for this for reasons I hope the article is really clear about.
sailing67@reddit
bloom filters are one of those things that feel like cheating when you first learn about them. like yeah technically theres a small chance of false positives but in practice for the right use case the tradeoff is so obviously worth it. used one a few years ago for a url shortener to skip db lookups on never-seen urls and the perf difference was immediately noticeable. more people should know about these
ficiek@reddit
kiteboarderni@reddit
Ai slop
lurch303@reddit
Well written