Undergraduate Upends a 40-Year-Old Data Science Conjecture

[-]

nmmmnu@reddit

YouTube video with the author himself

https://youtu.be/ArQNyOU1hyE?feature=shared

[-]

New-Award1433@reddit

bt he didn't explain the last constantly increasing graph

[-]

Aendrin@reddit

Here's an explanation, to the best of my ability in a reddit post, and worked example that should help people wrap their heads around it.

The basic idea of this algorithm is to improve the long-term insertion performance of the hash tables by making the early insertions more expensive in exchange for making the later ones cheaper. This is specifically for making insertion into a hash table you know will get very full cheaper overall.

It's easiest to understand what this paper is doing by looking at an example. For this example, I'll use 0 to represent an open space, and X to represent a full space. I have spaces between each 4 slots in order to make it a little easier to keep track of position.

In each of these cases, we will be inserting 15 elements into a 16 element hash table, with each successive element denoted by 0-9, a-f. We know this ahead of time, that this table will be 93.75% full. The paper uses δ frequently to talk about fullness, where δ is defined as the proportion of the table that will be empty once all elements are inserted. In this case, δ is equal to 1-0.9375 = 0.0625.

Traditional Greedy Uniform Probing

The traditional (and previously believed to be optimal) approach is to randomly pick a sequence of slots to check for each element, and as soon as one is available, insert it in that location.

Let's walk through what that might look like:

OOOO OOOO OOOO OOOO

We insert our first element, 0, and get a random sequence of indices to check. For instance, it might look like [0, 6, 10, 4, ...]. We check index 0, see it is free, and insert 0 in that location. Now our hash table looks like

0OOO OOOO OOOO OOOO

Let's insert 3 more elements: 1, 2, 3. Element 1 gets a sequence starting with 4, and gets inserted there. Element 2 gets a sequence starting with 2, and gets inserted there. Element 3 has a sequence that looks like [2, 9, ...]. We check index 2, see it is occupied, and so then check index 9, which is free.

0O2O 1OOO O3OO OOOO

If we continue this process all the way up to e, we might get a table that looks like the below. Now, we need to insert e. The only available indices are 1 and 8. We generate a sequence for e, and it is [13, 12, 9, 7, 0, 4, 6, 8, 15, 14, 1, 5, 2, 3, 11, 10]. We needed to do 8 searches just to put a single element in!

0O24 15c8 O3b7 aO96

In this case, we had 2 out of 16 elements free. Each time we checked a new index, we had a 1/8[^1] chance to find a free space. On average, this will take 8 attempts.[^2] It turns out that when we have slightly larger examples that don't get quite so close to the last few elements of the hash table, the expected cost to insert the last few elements of the hash table is 1/δ. That is because each time you search for a place to insert an element, δ of the table will be empty.

For a long time, it was believed that was the best that you could do when you are inserting the last elements into the hash table without rearranging them.

The New Method

In this comment, I'm just going to go over the method that they call elastic probing in the paper. The basic idea of this approach is to split the array into sub-arrays of descending size, and only care about two of the sub-arrays at a time. We keep working on the first two sub-arrays until the first is twice as full as the eventual fullness, and then move our window. By doing extra work in early insertions, later insertions are much easier.

Unfortunately, it's difficult to fully demonstrate this with a toy example of size 16, but I will do my best to get the idea across. The paper works with 3/4 as their default fullness, but I will talk about 1/2 so that it works better. We have our first subarray of size 8, then 4, then 2 and 2. I've inserted | between each subarray. Label the arrays A1, A2, A3, A4.

OOOO OOOO|OOOO|OO|OO

First, we follow the normal rules to fill A1 1/2 of the way full:

OO03 21OO|OOOO|OO|OO

At any time, we are dealing with 2 successive subarrays. We say that A_i is e_1 free, and A_(i+1) is e_2 free.

At each insertion, we keep track of the current arrays we are inserting into, and follow these simple rules: 1. If A_i is less than half full, insert into A_i. 2. If A_i is more than twice as full as the eventual hash table, then move on to the next pair of subarrays, A_{i+1} and A_{i+2}. 3. If A_{i+1} is more than half full, insert into A_i. 4. Otherwise, try inserting into A_i a couple times, and if nothing is found, insert into A_{i+1}.

The first and second cases are relatively quick. The third case is actually very rare, which is specified in the paper. The fourth case is the common one, and the specific number of times to attempt insertion is dependent on both the eventual fullness of the table and how full A_i is at that point. Remember that all times I mention half full, the actual approach is 3/4 full.

Let's go through some insertions with this approach, and see how the array evolves:

OO03 21OO|OOOO|OO|OO

Insert 4: [3, 6] in A1, [9, ..] in A2

OO03 214O|OOOO|OO|OO

Insert 5: [2, 4] in A1, [8, ..] in A2

OO03 214O|5OOO|OO|OO

Insert 6: [1, 3] in A1, [10, ..] in A2

O603 214O|5OOO|OO|OO

Here we started to check more times in A1, because it is more full.

Insert 7: [4, 7, 6] in A1, [11, ..] in A2

O603 2147|5OOO|OO|OO

Insert 8: [4, 7, 6] in A1, [11, ..] in A2

O603 2147|5OOO|OO|OO

Insert 8: [5, 6, 1, 4] in A1, [8, 11, ..] in A2

O603 2147|5OO8|OO|OO

Insert 9: [3, 6, 0, 4] in A1, [9, ..] in A2

9603 2147|5OO8|OO|OO

We just finished filling up A1. In a real example, this wouldn't be all the way full, but with small numbers it works out this way.

Insert a: [8, 10] in A2, [13, ..] in A3

9603 2147|5Oa8|OO|OO

Summary

The real advantage of this approach is that not only does it reduce the worst case insertions at the end of filling up the array, it also reduces the average amount of work done to fill up the array. I recommend reading the section in the paper on "Bypassing the coupon-collecting bottleneck" to see the author's interpretation of why it works.

[1]: This is not quite true in this case, because we do not check indices more than once. However, for large hash tables, the contribution from previously checked values ends up being fairly negligible.

[2]: Inserting an element in a large hash table like this follows a geometric distribution, which is where we get the 1/p expected number of attempts.

[-]

jacksaccountonreddit@reddit

Thanks for the summary. The paper's author also has a YouTube video explaining the basics here.

I haven't had a chance to study the paper yet, but based on these summaries and in light of advancements in hash tables in the practical world, I'm a bit skeptical that this new probing scheme will lead to real improvements (although it's certainly interesting from a theoretical perspective). The background that I'm coming from is my work on benchmarking a range of C and C++ hash tables (including my own) last year.

Specifically, the probing scheme described seems to jump around the buckets array a lot, which is very bad for cache locality. This kind of jumping around is the reason that schemes like double hashing have become less popular than simpler and theoretically worse schemes likes linear probing.

SIMD hash tables, such as boost::unordered_flat_map and absl::flat_hash_map, probe 16 adjacent buckets at a time using very few branches and instructions (and non-SIMD variants based on the same design can probe eight at a time). When you can probe so many buckets at once, and with good cache locality, long probe lengths/key displacements — which this new scheme seems to be addressing — become a relatively insignificant issue. These tables can be pushed to high load factors without much performance deterioration.

And then there's my own hash table, which, during lookups, will never probe more buckets than the number of keys that hashed to the same bucket at the key being looked-up (although during insertion it does regular quadratic probing and does relocate keys, unlike this new scheme or the SIMD tables). If this new scheme cannot greatly reduce the number of buckets probed during lookups, then its practical usefulness seems limited (insertion is less than half of the story).

What I'd really like to see is an implementation that can be benchmarked against existing tables.

[-]

New-Award1433@reddit

bt he didnt explain the part about the last graph that looks like a constant increse

[-]

drulludanni@reddit

I like that his video is a "youtube kids" video, you can never start learning about optimal hash tables too early.

[-]

orig_cerberus1746@reddit

Somebody clicked seemingly clicked the wrong button, I was wondering why there were an ad for youtube kids.

[-]

seattle_q@reddit

Super late question here: thanks for the explanation and examples!

I dont understand the worst case complexity bounding part. If the lower arrays in the funnel are full, I assume you come back to upper arrays to fill it. Meaning won’t we just end up in O(N) complexity in the worst case?

[-]

flying-sheep@reddit

Your markdown formatting is broken.

The footnotes don’t show at all on new reddit, and a bunch of underscores are interpreted as italics on old reddit, so I don’t think there’s a way to see your post as intended at all.

[-]

orig_cerberus1746@reddit

One thing I've been wondering if it was possible to use multi threading to try to fit a value in the first array and in the second and then etc... At same time in sufficiently large hash tables.

[-]

Expert_Corner_667@reddit

It seems like the "tiny pointer" concept, which really ought to be called "structured pointers" is partly geared to support multi-threading, so instead of "(address)" a pointer can be "(owner_thread, address_in_owner_thread's_space)" .

[-]

Ok_Cap_7572@reddit

Did you write the "8" insert twice or is that intended?

[-]

Successful-Money4995@reddit

That's a great write up. So the idea is to intentionally spend extra time at the beginning in order to spend less time at the end? And overall, it comes out to less?

Is it good on lookup?

[-]

Aendrin@reddit

Thanks!

To your first two points, yes that is exactly the main idea.

As to lookup performance, I'm not 100% sure how a lookup would be implemented. As I understand it, the process is by checking all possible locations the key could be stored, but doing it in a way that is probabilistically quick based on the table's construction.

The 011, 101

Define A_{i, j} for a key k as the jth value in the sequence checked for key k in subarray A_i. Then, check in the order A_{1, 1}, A_{1, 2}, A_{2, 1}, A_{2, 2}, A_{1, 3}, A_{2, 3}, A_{3, 1}, A_{3, 2}, A_{3, 3}, A_{1, 4}, ... This sequence checks O(i * j^2 ) values for A_{i, j}, and I think it works out to be relatively small because of the sizes of the arrays.

But once again, I'm really not sure.

[-]

orig_cerberus1746@reddit

How lookup is normally done? Also randomly or just a flat for loop?

[-]

TwistedStack@reddit

The lookup is what I've been wondering about. I was thinking you'd get a hash collision between arrays so you need to know which array the data you want is actually in. Like which hash has the record you actually want, h2() on the 1st array or h5() on the 2nd array?

[-]

TL-PuLSe@reddit

Having read the paper (the article baited me until wanting to know what the tradeoff was) and your comment, I was wondering if you got something I missed.

How did they land on 0.75?

[-]

Aendrin@reddit

My understanding is that 0.75 is a fairly arbitrary cutoff. It's mentioned a few times in the paper, but I think any constant value above 2/3ish that is sufficiently smaller than the eventual fullness achieves the same asymptotic complexity, with a little bit of tweaking to the algorithm values.

The main important element about a fullness constant is that as long as it is not exceeded, the insertion into the secondary array is constant time. If they used 0.9 instead of 0.75, the insertion into the secondary array would take at most 10 insertions on average instead of at most 4, but regardless it is independent of the fullness or size of the table, which is what they care about.

[-]

TL-PuLSe@reddit

Thanks! Was just wondering if it was optimal or arbitrary but you explained it well.

[-]

PeaSlight6601@reddit

Am I understanding that this whole thing is tuned to a specific hash fullness. I want a hash with a million entries and constant time insertion when it is 80% full.

But if we go up to 90% full I might lose my constant time guarantee?

[-]

QVRedit@reddit

Interesting story about this, but no actual explanation of the algorithm. I guess you have to read the paper ‘tiny pointers’ ?

[-]

DauntingPrawn@reddit

Ggggggggggggg5g4gggggtg5ggggggg5gggg4gggggg4gggggtgggggggg4g4gg

[-]

MiscBrahBert@reddit

You just said the quiet part out loud.

[-]

DauntingPrawn@reddit

I said what I said.

[-]

Majik_Sheff@reddit

It's about time someone said it. Bravo!

[-]

omgFWTbear@reddit

He says what we are all thinking!

[-]

ChannelSorry5061@reddit

Yes, you have to read the actual paper...

[-]

QVRedit@reddit

Thanks, I was just confused by the text saying that they were inspired by the paper..

[-]

_Fibbles_@reddit

The 'Tiny Pointers' paper did inspire him. The paper that describes the hash table is 'Optimal Bounds for Open Addressing Without Reordering'

[-]

Skithiryx@reddit

Interesting that they found similar approaches already being used but no one had proven it broke Yao’s conjecture:

The basic structure of funnel hashing, the hash table that we use to prove Theorem 2, is quite simple, and subsequent to the initial version of this paper, the authors have also learned of several other hash tables that make use of the same high-level idea in different settings [7, 9]. Multi-level adaptive hashing [7] uses a similar structure (but only at low load factors) to obtain a hash table with O(log log n) levels that supports high parallelism in its queries – this idea was also applied subsequently to the design of contention-resolution schemes [4]. Filter hashing [9] applied the structure to high load factors to get an alternative to d-ary cuckoo hashing that, unlike known analyses of standard d-ary cuckoo hashing, can be implemented with constant-time polynomial hash functions. Filter hashing achieves the same O(log2⁡δ−1)-style bound as funnel hashing, and indeed, one way to think about funnel hashing is as a variation of filter hashing that is modified to be an instance of greedy open-addressing and to have optimal worst-case probe complexity, thereby leading to Theorem 2 and disproving Yao’s conjecture [21].

[-]

derinus@reddit

Since Yao's conjecture is false, Krapivin should get his Turing Award. Thats 40 years of false prove preventing potential improvement attempts.

[-]

Soar_Dev_Official@reddit

TL;DR for non CS people:

Hash tables are a very old, very fast and very efficient way of storing certain kinds of data. Because of these factors, they're widely considered to be the gold standard for what they do and have been studied very heavily and were thought to be as good as they could possibly be.

Pointers are bits in memory that refer to other locations in memory- Krapivin was part of a team that was developing 'tiny pointers', which are, well, tiny pointers- they use less data to represent the same concept by doing a number of clever tricks. Krapivin found that, by building a hash table using tiny pointers, he was able to improve their performance on all the key operations that a hash table does.

More research needs to be done to validate this paper, but if it holds, it'll represent a major shift in the way low-level data structures are implemented. Most likely, nobody outside of that field will notice except that some programs will run marginally faster, but, for those that it matters to, this would be a very exciting development.

As far as I can tell, Krapivin himself is being somewhat overstated by the article. His professor was a co-author on tiny pointers, he just took their work and implemented it into a hash table. This isn't to underrate him at all, it's extremely rare for an undergrad computer science student to be implementing any papers in their spare time, much less bleeding-edge research. That this then produced a novel result, again, speaks to his capabilities, but full credit should go to the paper's authors. As always, research progress is made by teams toiling away for years, not by a lone genius who just thought different from everyone else.

[-]

moltonel@reddit

Caveat: this paper applies to hash tables using uniform probing, which are a pretty rare use case. So don't expect this new algorithm to replace the one in your favorite language's stdlib anytime soon.

[-]

Decker108@reddit

I'm new to the concept of uniform probing. Could you give some examples of when you want to use uniform probing vs other forms of probing in the context of hash tables?

[-]

zeekar@reddit

For starters, we're talking about "open addressing". Some hash tables have the ability to store any number of entries in a single table slot ("bucket") by hanging a linked list or other data structure off of the top-level hash table. Open Addressing doesn't do that - it can only store one entry per table cell. So the question is, what do you do when your hash function tells you to put an entry in a slot that already has one? This is called "collision resolution".

Uniform probing isn't really a technique, but an assumption. The first assumption is that your hash function is uniform, which means a random key is equally likely to hash to any of the possible table slots. Given that assumption, the assumption of uniform probing is that when resolving a collision, the next choice for where to put the new item is also equally likely to be any of the slots. Most open addressing schemes don't do this - linear probing, for instance, just looks at the next slot over and keeps going one by one unti it finds a vacancy. Meaning the slot that the item winds up in is very highly correlated to the original one given by the hash function, and probing is not uniform.

[-]

Lumen_Co@reddit

I might be completely mistaken, but I don't think this summary is correct. I've just read the paper and watched Krapivin's talk. He was trying to find a way to improve tiny pointers, and to do so he needed a faster hash table. He tried to make one, and succeeded. Tiny pointers have nothing to do with this paper other than inspiring the research.

[-]

BodybuilderPatient89@reddit

You're right, the GP has no idea what they're talking about. It's so wrong it's baffling how one could even come to that conclusion - the quanta article says nothing like this.

[-]

DoNotFeedTheSnakes@reddit

Great TL DR. Thanks!

[-]

Not_good_scientist@reddit

well said.

Ideas often float around in the collective consciousness, along the way someone comes and connects the dots, its like a puzzle waiting to be assembled

[-]

helloiamsomeone@reddit

Where is the code at? Also, where are the benchmarks against boost::unordered_flat_map at?

[-]

argrejarg@reddit

Someone found a python repo with something approximately matching: https://www.reddit.com/r/programming/comments/1in5hkt/comment/mcxta8n/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

[-]

orig_cerberus1746@reddit

I found someone that implemented in Python: https://github.com/sternma/optopenhash

[-]

Expert_Corner_667@reddit

Good find. For clarity, the linked repo currently has two advanced hash implementations: "elastic hash" and "funnel hash".

[-]

Crazy_Firefly@reddit

this table is 50% full, that one’s 90% — but researchers often deal with much fuller tables.

Why are researchers spending their time with tables that are so full? Isn't it the case that most hash table implementations try to stay at most 30% full then get copied over to a bigger place once they reach it?

[-]

MorrisonLevi@reddit

I have not read the paper, so I do not know _their_ motivations. However, this is a theoretically interesting paper that could be practically interesting in some uses cases. On occasion you have fixed memory constraints, and being able to fit more items in the existing space could be quite useful.

[-]

PeaSlight6601@reddit

Academic research is often about solving problems that don't have obvious utility.

Normal programmers will resize the will just resize the table because that's easy to do. Over time that means consumers have to buy another stick of RAM. If an academic could solve this hard problem then maybe programmers would use a different implementation and we could all buy fewer RAM chips.

[-]

TL-PuLSe@reddit

Most breakthroughs in science and math don't come with immediate practical applications. It's not likely someone is going to just REALLY need to fill up a hash table quickly, but these techniques may lead to adaptations in other areas.

[-]

argrejarg@reddit

Article is short on detail. I read the actual preprint (thanks u/Fibbles) at https://arxiv.org/html/2501.02305v1 and now have the following vague summary accessible to someone like me who may have seen a computer before, but doesn't have any qualifications in CS. Please correct me if I am wrong:

1) A normal hash table works by mapping a query eg "hot moms in my area" pseudo-randomly to an address in a pre-allocated array, e.g. "127.0.0.1". If the array is mostly empty then this is constant-time operation to save or retrieve, but if it fills up then you have "collisions" as the pseudo-random hashing lands different queries onto the same address. In practical applications, such as python's builtin associative array datatype, this is resolved by just doubling the size of the allocated space as the array fills up, so that access time remains near-constant but with some occasional hiccups as the array has to be expanded, and with the problem that space use must always remain much less than 100% for the data structure to be fast.

2) A neat hack to get around this, known but not well-characterised or refined until the present article, is to pre-allocate the hash with subdivisions, so that instead of reallocating and reshuffling, the hash is pre-built to a fixed size but with log(N) subdivisions which fill in order, simulating somewhat the reallocate-every-doubling strategy which is already known. A hash address now is a structured object, something like, (p,q) where p is the number of the partition and (q) is the address within the partition.

3) this structured pointer is the "tiny pointer" concept which was antecedent to the present work. Making pointers structured but (potentially) smaller is an irrelevant side effect for day-to-day operations (64 bits is plenty of address space in most cases), and anyway the "tiny pointers" are not necessarily actually smaller unless you put some thought into making them so, but don't dismiss this aspect, it might be handy for example to run terabyte address spaces on 16bit GPUs.

Did I get it?

[-]

bert8128@reddit

Anyone written one of these hash tables in c++ yet as an example? Or testing it against current popular implementations?

[-]

noirknight@reddit

Reading through the article and skimming the paper, I am not sure if this would impact most implementations. Most implementations include an automatic resize when the table gets too full. This approach seems to only make a big difference when the hash table is almost completely full. For example if I double the size of the table every time the load factor reaches 50% then this algorithm would not come into play. Maybe someone who has done a closer reading than me has a different idea.

In any case the paper just came out, so would expect to see at least some experimental implementations soon.

[-]

larsga@reddit

This approach seems to only make a big difference when the hash table is almost completely full.

As far as I can tell the paper makes two contributions.

The first is, as you say, a more efficient way of inserting elements in a table that's nearly full.

The second one is constant average-time lookup. Literally on average O(1) for the lookup function, regardless of the size of the table.

Key para from the article:

Farach-Colton, Krapivin and Kuszmaul wanted to see if that same limit also applied to non-greedy hash tables. They showed that it did not by providing a counterexample, a non-greedy hash table with an average query time that’s much, much better than log x. In fact, it doesn’t depend on x at all. “You get a number,” Farach-Colton said, “something that is just a constant and doesn’t depend on how full the hash table is.” The fact that you can achieve a constant average query time, regardless of the hash table’s fullness, was wholly unexpected — even to the authors themselves.

[-]

bert8128@reddit

C++’s std::unordered_map is (according to the documentation) amortised constant lookup. (Of course, the constant can be large). Is this a feature of chained hash tables, and not a general feature of open addressed tables?

[-]

larsga@reddit

I'm sorry, my bad. It's not constant in the size of the table, because that's standard for hash tables. It's constant in x, which is this strange measure of how full the hash table is.

So it really is about being able to fill the tables up completely. You can have a 99.99% hash table and lookup and adding new elements is still very efficient.

[-]

araujoms@reddit

The worst-case complexity of the lookup goes down from 1/δ to log(1/δ)^2, but that still diverges. You really don't want to work with a 99.99% full table.

As for x, Quanta is making a mess of what is a simple thing. If a fraction f of your table is full, δ = 1-f, and x = 1/δ.

[-]

PeaSlight6601@reddit

So it's constant in something you can only do a few times (adding to a mostly full table)?!

I guess that's good for use cases where you add and remove items keeping the size roughly unchanged, but then you could just have a slightly bigger table.

[-]

thisisjustascreename@reddit

Well, if your hash table has 4.29 billion slots you can insert something like 40 million times to a 99.99% full table.

[-]

PeaSlight6601@reddit

But that's still only 0.01%. If you have billions of things and are concerned about the performance of adding them to a hash, then it stands to reason that soon you might have tens of billions of things.

[-]

aanzeijar@reddit

That's pretty huge if you know how many entries will be in the hash. Current implementations will resize when a threshold is passed. For this one - you need 1mio slots, you allocate a hash with 1mio slots and fill it up 100%.

[-]

Resident-Trouble-574@reddit

Well, now you can probably postpone the resizing until the table is fuller.

[-]

masklinn@reddit

Modern hash tables can already get fairly high load factors e.g. SwissTable reaches 87.5%.

[-]

nextbite12302@reddit

I would be happy if my games run with 10% less memory

[-]

wintrmt3@reddit

None of your games memory usage is dominated by hashtables.

[-]

nextbite12302@reddit

it was supposed to be a joke 🤡

[-]

binheap@reddit

It also applies to open addressed tables rather than the chained tables.

[-]

milliee-b@reddit

the only good kind of

[-]

TacoOblivion@reddit

I just finished creating a C++ implementation. However if you strictly follow the paper, it has pitfalls that cause it to be slower in practice. So then I created a hybrid approach taking from the better parts and it's significantly faster than the std libraries. Once I feel ready, I will release it today probably. I need more testing though to feel ready.

[-]

Sabotaber@reddit

Yes, but how's cache performance?

[-]

warbitlip@reddit

Turns out, not knowing the 'impossible' can be an advantage—an undergrad just rewrote 40 years of hash table theory by ignoring conventional wisdom

[-]

Sabotaber@reddit

When I run into trouble, conventional wisdom is the first thing I throw out.

[-]

wildjokers@reddit

Krapivin was not held back by the conventional wisdom for the simple reason that he was unaware of it.

Ignorance sometimes is truly bliss.

[-]

Pat_The_Hat@reddit

How topical. The George Dantzig homework story is on the frontpage today.

[-]

wildjokers@reddit

Dantzig arrived late and assumed that they were a homework assignment. According to Dantzig, they "seemed to be a little harder than usual",

LOL

[-]

silveryRain@reddit

...well they were

[-]

ZirePhiinix@reddit

For about a hundred years.

[-]

fractalife@reddit

As it is regularly lol.

[-]

azhder@reddit

That’s the story of how a student arriving late at class mistook the unsolvable math problems written on the board as homework assignment.

[-]

Full-Spectral@reddit

Then why am I not the happiest man alive? Dang...

[-]

Bupod@reddit

Dudes out here in their undergraduate degrees discovering entire new concepts.

I’m lucky if I can make a cup of coffee without breaking something.

[-]

Crazy_Firefly@reddit

Is it just me or did the writer confuse "data structure" with data science? Is research on hash tables really considered a data science problem?

[-]

ScottContini@reddit

That’s the best story I’ve read this year. Love how a curious undergrad found a better solution to a problem that the legendary Andrew Yao never imagined. It reminds me of a blog on Hamming’s advice on doing great research in many ways. For example on trusting a young researcher who does not have an in depth background in the area, Hamming advises “ It took me a few years to realize that people who did not know a lot of mathematics still could contribute. Just because they could not solve a quadratic equation immediately in their head did not mean I should ignore them. When someone's flavor of brains does not match yours may be more reason for paying attention to them.”

[-]