Prolly Trees: The useful data structure that was independently invented four times (that we know of)
Posted by nick_at_dolt@reddit | programming | View on Reddit | 27 comments
Prolly trees, aka Merkle Search Trees, aka Content-Defined Merkle Trees, are a little-known but useful data structure for building Conflict-Free Replicated Data Types. They're so useful that there at least four known instances of someone inventing them independently. I decided to dig deeper into their history.
punkpang@reddit
*discovered.
nick_at_dolt@reddit (OP)
I thought long and hard about whether to say "invented" or "discovered" in the article, and ultimately settled on using a mix of both. The distinction is a fascinating debate, but I didn't want it to distract from the main focus.
EntireBobcat1474@reddit
Quick question on the use of these data structures (e.g. how they improve over a plain ole Merkle tree) is the main benefit that the key preservation helps with locality with things like sequential or rolling logs so that updates to the MST are localized to a small region of the tree and the updates / diffs are usually small messages of just those localized changes?
And just to double check I understand what (the INRIA design at least) the MST is - given a totally ordered "key" set such as Z that we want to project into an MST:
Some easy to deduce properties: 1. Addition of a node will usually just change O(layer(x)) hashes in the MST 2. Diff of two MSTs can can localized to just the branches that contain differences due to the ordering
That #2 property seems like the main innovation on top of a plain ole Merkle tree hashing of an unordered set, though I'm not really sure what types of problems/contexts this would be very beneficial in.
nick_at_dolt@reddit (OP)
That's exactly the idea. Then because everything is stored in a hash table, you can store both the before-and-after versions of the tree without needing to copy the parts that didn't change.
The benefit of using a hash function to determine the tree shape is that if two clients have the same set of data, they'll produce the exact same tree with the exact same hashes. This isn't true of something like B-trees, where the shape of the tree depends on the order that the elements were inserted. This property is called history-independence, and it's critical in order to do an efficient diff of two trees: trees with the same data need to have the same hash.
Without Merkle Search Trees / Prolly Trees, the only way to achieve history-independence would be to make every node in the tree have a set number of children. But if you do that, inserting or deleting data at the beginning of the tree would cause everything to shift, changing every hash that comes after it, which completely destroys the idea that changes should localized.
This is really useful for any situation where you need to solve what's called the "set reconciliation problem". Basically, when you have two different ordered sets and you want to quickly identify what items are in one but not the other. Most existing solutions require that the keys are uniformly distributed, but this doesn't have that requirement.
Some contexts where this is beneficial:
- Git uses a Merkle tree to represent directory structures, allowing clients to just pull the parts of the repo that have changed. But this scales poorly if the tree isn't balanced and some directories have a lot more files than others. But if Git stored large directories as a MST instead of a single node, then that would limit the size of nodes and lead to a more balanced tree. (In fact, this is exactly what Bup built on top of Git.)
- I develop Dolt, which is a SQL database with Git-style version control. Every table is stored as a MST, which makes it very compact to store a snapshot of the table at every commit, and very efficient to diff tables between commits.
LargeHandsBigGloves@reddit
Why do I even follow this subreddit when it's just AI blog post spam? I'm not even giving you the click :/
msebast2@reddit
LOL. I understand the frustration but this article was interesting and clearly written by an actual human.
LargeHandsBigGloves@reddit
π guess I chose the wrong article to get annoyed over
Lachiko@reddit
it's always like that isn't it? (nah this time i'm going to say something!)
LargeHandsBigGloves@reddit
Yeah π I thought the down votes would stop with all the positivity but I don't care enough about Internet points to delete the comment.
nathan753@reddit
This is actually one of them that isn't from a bot scraper
LargeHandsBigGloves@reddit
You're right :)
itszor@reddit
The article didn't come across as AI spam to me, fwiw.
church-rosser@reddit
TBF they never read the article and refused to even give it a click.
LargeHandsBigGloves@reddit
Yeah I specified I wasn't opening it specifically to avoid the "this 1 ain't ai" okay, it's personal website spam still isn't it?
mccoyn@reddit
I read a few paragraphs, but because I didn't have any idea what a prolly tree was by that point, I gave up. Put the most important part at the beginning of the article, then go into details.
LargeHandsBigGloves@reddit
I went back and read it since y'all all complained. It was a good read. Definitely not the slop I expected. Still don't see why we allow so many links to blog style posts that have no context besides the title.
church-rosser@reddit
Kudos for reconsidering the article and noting here that you've altered your perspective. We need more of your type around here and in the world in general. πͺ
I loathe the AI spam on this sub, but this article does seem to have made some interesting points and it appears the author did some actual reading (even if an LLM was consulted in so doing).
nick_at_dolt@reddit (OP)
No LLM was consulted in the writing of this article.
I'm not ideologically opposed to the idea of using LLMs in research and editing, but I'm also realistic about what they can and can't do, and in my experience trying to incorporate LLM output into this kind of technical writing is almost always more effort and worse quality than just writing it myself.
nick_at_dolt@reddit (OP)
I totally get your feelings about just dumping links to blogs with no context. Nothing feels less like a community than a glorified dumping ground of self-promotion. Thanks for deciding to read after all!
nick_at_dolt@reddit (OP)
That's a fair criticism. Figuring what is and isn't proper context before getting right to the point can be the hardest part of writing, but also the most important. Seems I may have missed the mark and spend too much time at the beginning on tangents. Thank you for the feedback!
Full-Spectral@reddit
They prolly just forgot they had already been invented.
NotATroll71106@reddit
As someone who hadn't heard of a prolly tree, I had to reread this a few times to be sure I hadn't created them myself. Instead of the chunks associated with nodes having uneven breakdowns, the "prolly" with mine is that you don't necessarily end up with an optimal result when you reach the bottom. (In my case, it was ducktaping two paths together at subgraph boundaries.)
FrostedBerryPop16@reddit
If reinventing the wheel was a programming contest, Prolly Trees would be the reigning champions
Page_197_Slaps@reddit
Youβre Prolly right
fireduck@reddit
This sounds like a thing I "invented". Minus the self-balancing part, I didn't have that.
https://wiki.snowblossom.org/index.php/Hashed_Trie
Yeah, it seemed like something that someone has done before. I'm sure I was not the first inventor. I think there are at least a few cryptocurrency projects using something very similar. Especially when you want a cryptographically sounds hash of an entire tree and all its data to make sure nodes all have the same data.
Then it gets encoded in the p2p protocol as something like:
- here are the DB changes for this block
- if you did it correctly, your old hash of the database was X and now it is Y
- We can use our shared agreement on 'Y' in order to prove that some data is or is not in the agreed tree. This allows us to prove to lite clients checking things that we are not lying or omitting data.
nick_at_dolt@reddit (OP)
Thanks for sharing! I gave it a read.
The idea of storing all nodes in a hash-map and having nodes refer to each other by their hash is a common design, called a Merkle tree or a Merkle DAG. Git uses it to represent your file hierarchy, and it's used a lot in blockchains too, for pretty much all the reasons you laid out in your writeup.
Interestingly, yours is the first time I've seen someone make a Trie this way, so this specific application feels unique. I'm curious what Snowblossom is using Tries for, since it's not immediately obvious how a Trie would be useful for storing unspent transactions. Would you be willing to share more about what led you to this design?
fireduck@reddit
Yeah, I use a decent bit of straight merkle trees for other things.
So in Snowblossom the root unspent transaction output trie hash is stored in the block header. This way we can be sure all the nodes have the same idea of what is in that structure, which is the set of all things that can be spent.
This gets us a few super cool things.
1) Since the database is essentially append only (add new nodes as needed while all existing nodes stay there) we can easily see and mutate a past version of the tree. This is important for a cryptocurrency for block re-orgs. Suppose we have blocks A->B->C-D and someone comes up with a better new set of blocks for some reason, B2->C2->D2->E2. We can read the database as it was at block A, and see if B2 validates based on that, and store B2 in the database without yet throwing away anything. This lets us explore and validate block chains and then after they are all validated decide on which is the best chain to follow on. In this example, we wouldn't know that the new chain was better very likely until we had fully processed and validated E2.
2) The trie in Snowblossom is by address. This allows us to give nice proofs to light clients. Suppose a light client is just monitoring block headers from multiple nodes (which isn't expensive, don't need to download the whole chain). Then when the client asks a full node, what are the available unspent outputs an address, the node can not only look into the tree and answer but also prove the answer.
Like if the address in hex started with AABBCCDDEE the node could respond, that address has nothing. We will prove it by giving you the trie root hash (which you know from the block header) then node AA, node AABB, node AABBCC and see, that node only has AABBCC07, and has nothing for your address. So the node can prove it is giving correct and complete answers to the light clients. The light client could re-hash all these nodes and do a check the hashes to make sure that all matched up to the utxo trie from the block header.
I view Bitcoin as an excellent first project. It was missing key things like this that only become obvious after you work with it for a long time. This was one of the reasons I wanted to write Snowblossom, to do some of those things in a bit of a cleaner way.