Finding a duplicated item in an array of N integers in the range 1 to N − 1

[-]

paperlantern-ai@reddit

The XOR approach is so satisfying for this one. XOR all elements together, then XOR with 1 through N-1, and the duplicate pops out. No extra space, single pass, no overflow risk like the sum trick has with large N.

[-]

pihkal@reddit

There's no guarantee there's only one dupe, though. From the first paragraph:

There might be more than one duplicated value; you merely have to find any duplicate.

[-]

somebodddy@reddit

Just XOR all the numbers and then XOR the result with a XOR of all the integers in the range from 1 to N-1?

[-]

sweetno@reddit (OP)

Your idea gives 0 for input 1 1 2 2, N = 4.

[-]

somebodddy@reddit

Good point - I was probably thinking of the original problem, where there is only one duplicate.

[-]

pihkal@reddit

From the first paragraph:

There might be more than one duplicated value; you merely have to find any duplicate.

[-]

pigeon768@reddit

That only works if your numbers are in a range [1,2^(k)-1], and if there is exactly one duplicate. If either of those conditions aren't met, the XOR trick doesn't work.

1^2^3^4^5^5 == 4. The range isn't up to one less than a power of two, so this method fails.

1^2^3^3^3^5^6^7 == 4. 3 is duplicated 3 times and 4 is skipped, so this method fails.

[-]

somebodddy@reddit

You are right about multiple duplicates (I was thinking of the original puzzle, where there is only one duplicate) but having a non-power-of-2 range is why I added that last part - "then XOR the result with a XOR of all the integers in the range from 1 to N-1"

In your example, N is 6 - so XORing all the numbers from 1 to 5 gives us 1^2^3^4^5 == 1, and XORing the 4 you got with that yields 4^1 == 5.

[-]

sweetno@reddit (OP)

There may be many duplicates.

[-]

CandiceWoo@reddit

u get the xor of all duplicates

[-]

mdabek@reddit

This is the way!

[-]

EarlMarshal@reddit

Hmm? The initial puzzle is about a duplicated item in an array. The other one about finding a cycle of values? Do I just don't get the initial puzzle and this title here?

[-]

sweetno@reddit (OP)

There is only one puzzle, but many solutions. It's possible to solve it in O(n) of time and O(1) of extra memory. This is where you detect cycles in the linked list that your input forms if you look at it in this way.

[-]

EarlMarshal@reddit

But how does the input looks like for the initial puzzle? Does it contain just a duplicated value at some point or does it contain cycles? Because I wouldn't get that at all from the problem description.

[-]

imachug@reddit

Say the input looks like {4, 1, 3, 2, 2}.

Starting with the last position, treating its value as an index, jumping there and then repeating the process, we get 5 -> 2 -> 1 -> 4 -> 2 -> 1 -> 4 -> .... Notice the structure of this sequence: first, we get unique values we haven't seen before, and then at some point a value repeats and the rest of the sequence is looped.

In this case, the first repeated value is 2. Since it's the first such value, the indexes before its first and second occurrences (5 and 4 in this case) must have been unique by that point, and thus distinct. In general, this gives us some a, b such that arr[a] = arr[b] and a != b, so arr[a] is a valid answer.

Here's another way to look at this sequence. Split it into a prefix and the repeated cycle. Here, the prefix is 5 and the cycle is 2 -> 1 -> 4, but in general the prefix could be longer. The first element of the cycle has two predecessors -- the last element of the prefix and the last element of the cycle, which are necessarily different.

So what we're looking for is the point where the prefix and the cycle meet. A simple way to do this in constant memory and linear time is to use tortoise-and-hare. (In a nutshell: maintain two pointers a and b. a jumps to the next element of the sequence on each step, b jumps two elements at a time. When they meet, you know you're in a cycle and jump(k) = jump(2k), thus the length of the cycle divides k. You can now reset from the start and find the first i such that jump(i) = jump(i+k), giving you the first repeated element.)

[-]

CryptoHorologist@reddit

4 -> 2 -> 1 -> 4 ->

Collatz!

[-]

EarlMarshal@reddit

Thank you. The first section really has helped.

[-]

CandiceWoo@reddit

its an array of 3,4,5,1,2,1 etc; the last solution is just easier to be thought of as a graph. index is the node, the content is the next node. interesting property this graph has a) nothing points to last node, so for it to complete, it has to join another cycle b) the point where it joins a cycle will be a node with 2 index pointing to it - this is a duplicate

[-]

EarlMarshal@reddit

So 345121 is a cycle that repeats itself in the array and if find this cycle or is that the whole array and I should find the indexes of the two 1?

I think I really only have a misunderstanding with the problem statement in the first place.

[-]

mio991@reddit

I think you understand that problem space quite well, to me it seems like you're lost during the mapping into another problem space that has a known solution.

You have an array like 4, 3, 1, 2, 3 how do we recognise that 3 is the duplicate? We interpret the values as 1-based indices into the same array, this way the array describes the following linked lists; - 4 -> 2 -> 3 -> 1 -> 4 -> ... - this is a cycle we entered at 1 - 3 -> 1 -> 4 -> 2 -> 3 -> ... - this is the same cycle but we entered through 5

By keeping track of where we came from we can detect when we arrive at the duplicate.

It's still the same data, just reinterpreted.

[-]

chyekk@reddit

I think the gist is you’re treating each value as an index, so if you encounter a cycle it’s because you’ve visited the same index twice and therefore that’s your duplicate. Roughly speaking…

[-]

crusoe@reddit

The two are the same. The values are in the range of 1 to N-1 for a N size array. So you can treat each value as an index into the array.

So for an array with 4 slots

You can have the values 1 2 1 3

Treating these values as indices transforms the array into a graph. You can then use a cycle detecting algorithm to find the duplicate index value.

[-]

ollerhll@reddit

I might be being dumb - why can't you solve this with a hashset and a single pass? That's less efficient than the solution given here, but definitely better than the original NlogN thought the author had, right?

[-]

soliss@reddit

You can solve this with a single integer of extra space using xor.

Xor has two useful properties:

a ^ a = 0 (any number xor'd with itself cancels out)
a ^ 0 = a (xor with zero is the identity)

So if you xor all the elements together, then xor that result with 1 ^ 2 ^ ... ^ N, every number that appears exactly once cancels to zero. The only thing left is the duplicate.

def find_duplicate(arr, n):
    xor = 0
    for num in arr:
        xor ^= num
    for i in range(1, n + 1):
        xor ^= i
    return xor

[-]

Intrexa@reddit

print(find_duplicate([1,2,3,3,3,3,3]))

>7

[-]

SCube18@reddit

Literally illegal under the constraints

[-]

fb39ca4@reddit

Your solution requires there to be exactly one duplicate pair, which the original problem does not specify.

[-]

rlbond86@reddit

every number that appears exactly once cancels to zero

Every number that appears an odd number of times.

[-]

taintedlead@reddit

Beautiful! But in the blog post they state:

There might be more than one duplicated value

So it won't work in cases like u/Intrexa points out.

[-]

skelterjohn@reddit

Adding to a hash set isn't a constant time operation.

In practice we often treat it as one, because we can make our hash space significantly larger than our input set pretty easily.

But when we're talking O(), it's dependent on how hash collisions are collected.

[-]

khorbin@reddit

Ok, then why not just use an array of size N-1?

I had the same reaction: without additional constraints that they aren’t listing, this seems pretty easy.

The unstated implication seems to be that this must be done in-place, but they never actually say that.

[-]

skelterjohn@reddit

Are the values in the array limited to N? I'd assume no.

[-]

khorbin@reddit

It’s part of the problem definition, so yes

[-]

skelterjohn@reddit

I had read it that the integers that may be duplicated were in the rance a[1] to a[n-1], which I concede would be an odd stilulation.

[-]

sprcow@reddit

Honestly this is the only sane response. It's trivial to solve with a bit array of size n, and that solution is accepted by leetcode and performant in top 1% of solutions. While it's novel to play around coming up with unique approaches, they seem unnecessary. Even if one wanted to be a stickler about O(1) memory, then fine, make your int[INTEGER_MAX_VALUE] and call it a day. :P

[-]

soldiercrabs@reddit

Ok, then why not just use an array of size N-1?

In fact, as Chen points out in footnote 2, that is basically what his attempt is, he's just doing a tricksy-dee by re-using the sign bit as a "free" array.

[-]

neuralbeans@reddit

You're right, a bit map would have been sufficient here.

[-]

DownvoteALot@reddit

I'm most settings, big O notation means average case, and for set insert that's O(1).

[-]

hextree@reddit

Average case for hash maps still isn't O(1).

[-]

skelterjohn@reddit

If we're talking specific cases, big oh and big omega don't seem very meaningful. It's more akin to changing the problem than analyzing it differently.

Big oh of the "worst case" is big oh of the problem that has fewer possible inputs, but it's still the worst case for that new problem.

I read it as "the worst case is no worse than the big oh". The distinction that you're drawing seems more like word semantics than a differentiation in result.

[-]

skelterjohn@reddit

Sorry, no, that's not correct. O means worst case.

[-]

Merry-Lane@reddit

In most settings, big O notation is the worst case

[-]

slaymaker1907@reddit

The sorting solution is better if you are memory constrained. Sorting can be done with constant memory by using heapsort. I think that’s what is missing here.

[-]

skelterjohn@reddit

Most sorts can be done without using non-trivial amounts of extra space. This problem is only interesting (to me) when you are focusing on run time.

[-]

slaymaker1907@reddit

That is absolutely not the case with most sorting algorithms. Most take at least O(log N) allocations. Even though that is small, if Raymond’s case was for a low level team, heap allocation and recursion might be forbidden. Regardless, the particular sorting algorithm isn’t terribly important for the question, just that sorting in general uses little to no extra memory. A hash set would use a whopping O(n) extra memory.

[-]

imachug@reddit

You can even use a bitset array to make it deterministic linear time. I think the intention was to solve the problem in constant space.

[-]

TommyTheTiger@reddit

Man, it's been a while, but I got my first internship coming up with the linear time, constant mem solution. Treat the integers as pointers, iterate through the array in that order, and find the cycle! You're guaranteed to be in it after going through N elements. All this is IIRC.

[-]

ElvishParsley123@reddit

I don't get this solution. If 0 points at 1 and 1 points at 0, that creates a cycle, but that doesn't tell you anything about what is a duplicate. Can you explain the logic of your solution better?

[-]

TommyTheTiger@reddit

I did skip a bit, and the article doesn't explain it that well, I'm not sure if my solution is the same and I don't know about any of these cycle detection algorithms.

Basically, we're going to get to a number inside the cycle, from there we're going to find the length of the cycle, and then, knowing the length of the list and the length of the cycle, we can find the start of the cycle, as long as the list only has 1 duplicate (which IIRC was also a property of this problem).

In your case there would need to be a 3rd number. Let's go with [1, 0, 1] at random. Hop 3 times from pos 2, end up at pos 1. number is 0, hop 2x, back to 0, so cycle len is 2. List len is 3, cycle len is 2, there for we need to hop 0 times from the N'th elt to get to the duplicate, so 1 is the duplicate.

[-]

Mysterious-Rent7233@reddit

Did you come up with it on the spot or had you heard of it before?

[-]

TommyTheTiger@reddit

It wasn't live - it was an optional "challenge problem" that you could solve outside of the normal application process, which I thought about in the back of my head for quite some time before figuring it out

[-]

lachlanhunt@reddit

They seem to be using 1-based indexing. It was very confusing what "the entry at index N" was referring to because there is usually no index N in an N-length array.

It makes much more sense if you keep indexes as 0 based, and convert values to indexes by subtracting 1.

[-]

neuralbeans@reddit

I don't like that you need to modify the sign bit of the numbers. That puts unnecessary assumptions on the input. Better to say that you create a boolean array instead. Still less space and time than a hash table.

[-]

Jonny0Than@reddit

The end of the article describes an algorithm that does not modify the array.

[-]

PurepointDog@reddit

When's the last time anyone wrote code with negative numbers though, honestly

[-]

neuralbeans@reddit

To be fair, the specification is that the input numbers are all between 1 and N-1. So far so good. But what if the numbers are unsigned and N-1 is big enough to use the most significant bit?

[-]

Iggyhopper@reddit

Similar to the 100 prisoner problem where the optimal solution is for each prisoner to follow the box they open to the next box, hopefully fidning a loop before bix 50.

[-]

Uristqwerty@reddit

I'd have to wonder if there's an input size where random access causes enough cache problems that you'd be better off counting prefix occurrences in a linear pass, picking an over-represented prefix, then doing a second scan that only pays attention to matching values. I imagine only for inputs so large that even a relatively-compact bitset is still far too large. Maybe never.

[-]

claimred@reddit

Also nicely explained in this blog post

This problem (reportedly) took CS legend Don Knuth twenty-four hours to solve and I have only met one person (Keith Amling) who could solve it in less time than this.

https://www.keithschwarz.com/interesting/code/?dir=find-duplicate

[-]

puredotaplayer@reddit

Clever solution except on a physical processor it is inefficient due to cache behaviour for a large N. A bitmap + iteration will be faster.

[-]

sviperll@reddit

Just XOR-ing everything is the most efficient solution

[-]

puredotaplayer@reddit

True but then it wont be simple to figure out how to do this with multiple duplicates.

[-]

Mysterious-Rent7233@reddit

By bitmap do you mean simply a bit array, but packed with no wasted space?

[-]

puredotaplayer@reddit

Yea exactly.

[-]

sweetno@reddit (OP)

Isn't the bitmap also random access?

[-]

puredotaplayer@reddit

Much more smaller in memory footprint and more likely to accommodate in L1

[-]

norude1@reddit

can't you xor all of them and then additionally xor all the numbers from 1 to N-1 there?

[-]

FuckOnion@reddit

The comments already make it clear, but this is a poorly posed problem. Solving this problem in O(n) time is trivial with no space restrictions. The author probably forgot to specify further restrictions.

[-]

TommyTheTiger@reddit

Yeah, the point of the solution is O(1) memory

[-]

digital_cucumber@reddit

If you mutate the input data, that shouldn't count as O(1) memory.

[-]

TommyTheTiger@reddit

You don't mutate the data

[-]

digital_cucumber@reddit

Ah yes, you are right, using the "two pointers" cycle detection it's indeed proper O(1), space-wise.

[-]

sweetno@reddit (OP)

It's an interview question. The fancier the solution the merrier.

[-]

sweetno@reddit (OP)

Good old LeetCode #287. One of those you think you know the solution, only to get dragged into the rabbit and tortoise hole.

[-]

m0j0m0j@reddit

It’s so funny to read “I thought this, and another guy thought that, but the interviewer found a completely mind-blowing solution!”

And it’s a classic coding puzzle that was solved and re-solved and discussed ad nauseam for probably decades at this point

[-]

trejj@reddit

I don't think the presented algorithm of walking through cycles, and requiring that one bit of space in the source array is available, is particularly good. It has data-dependent jumps all over memory, and will have to perform a cleanup pass at the end, to fix up memory.

Any algorithm that temp-mutates the original data source is bad, since if something goes wrong and the thread is interrupted, data will go into a bad state.

Far superior default code to the presented cycle finding algorithm is to just be simple, and create a zero-initialized contiguous bit array, e.g. uint8_t data[(N+7)/8] = {0}; (either calloc or alloca if you want to stress about a memory allocation) and scan through the data, binary ORring ones into the dataset.

Code with a variable size array for simplicity of presentation:

int find_duplicate(const int *data, int N) {
  uint8_t seen[(N+7)/8] = {0};
  for(int i = 0; i < N; ++i) {
    int byte = data[i] >> 3, mask = 1 << (data[i] & 7);
    if (seen[byte] & mask) return data[i];
    seen[byte] |= mask;
  }
  return -1;
}

(calloc / alloca impl. left as an exercise to the reader)

The above code will be a much more performant default implementation, than a "mutate the input array" solution, or a "Floyd's cycle detector" utilizing algorithm.

If N gets large, the above has the possibility to OpenMP #parallel for the code, which the tortoise&hare thing won't easily be able to.

Plus, the above code is dead naive for junior readers to understand as well.

[-]

ta9876543205@reddit

Why not just sum all the values and subtract the expected sum?

[-]

TommyTheTiger@reddit

What if they overflow?

[-]

tankerdudeucsc@reddit

Math problem really, assuming it’s 1 and only 1 duplicate. Summation problem. Add all numbers together. Calculate the summation of 1..N.

Subtract summation from what you calculated. You have your duplicate.

O(N) run time and O(1) space.

Multiple duplicates gets a lot more complicated snd. You can’t do the simple summation trick.

You could do it in constant time but O(N) space (bitmap) for each number, but very memory intensive for a large N.

[-]

Xenarthran47@reddit

The range of 1 to N-1 sums to (N*(N-1))//2, so just add up the array and subtract the expected sum for O(N) no additional space? Unless I am drastically misunderstanding...

[-]

skyware@reddit

Hashmap. If you want no extra space, use the existing input array. And do cycle sort. When you find index that already has n-1 value, dupe

[-]

poco@reddit

If you are willing to use the extra space of a hash map then it is even simpler to create an array of bools of size N and write true at the index that matches the value of the original array. As soon as you hit a value that is already true then you found your duplicate.

[-]