Finding a duplicated item in an array of N integers in the range 1 to N − 1
Posted by sweetno@reddit | programming | View on Reddit | 87 comments
Posted by sweetno@reddit | programming | View on Reddit | 87 comments
paperlantern-ai@reddit
The XOR approach is so satisfying for this one. XOR all elements together, then XOR with 1 through N-1, and the duplicate pops out. No extra space, single pass, no overflow risk like the sum trick has with large N.
pihkal@reddit
There's no guarantee there's only one dupe, though. From the first paragraph:
somebodddy@reddit
Just XOR all the numbers and then XOR the result with a XOR of all the integers in the range from 1 to N-1?
sweetno@reddit (OP)
Your idea gives 0 for input 1 1 2 2, N = 4.
somebodddy@reddit
Good point - I was probably thinking of the original problem, where there is only one duplicate.
pihkal@reddit
From the first paragraph:
pigeon768@reddit
That only works if your numbers are in a range [1,2^(k)-1], and if there is exactly one duplicate. If either of those conditions aren't met, the XOR trick doesn't work.
1^2^3^4^5^5 == 4. The range isn't up to one less than a power of two, so this method fails.1^2^3^3^3^5^6^7 == 4. 3 is duplicated 3 times and 4 is skipped, so this method fails.somebodddy@reddit
You are right about multiple duplicates (I was thinking of the original puzzle, where there is only one duplicate) but having a non-power-of-2 range is why I added that last part - "then XOR the result with a XOR of all the integers in the range from 1 to N-1"
In your example, N is 6 - so XORing all the numbers from 1 to 5 gives us
1^2^3^4^5 == 1, and XORing the 4 you got with that yields4^1 == 5.sweetno@reddit (OP)
There may be many duplicates.
CandiceWoo@reddit
u get the xor of all duplicates
mdabek@reddit
This is the way!
EarlMarshal@reddit
Hmm? The initial puzzle is about a duplicated item in an array. The other one about finding a cycle of values? Do I just don't get the initial puzzle and this title here?
sweetno@reddit (OP)
There is only one puzzle, but many solutions. It's possible to solve it in O(n) of time and O(1) of extra memory. This is where you detect cycles in the linked list that your input forms if you look at it in this way.
EarlMarshal@reddit
But how does the input looks like for the initial puzzle? Does it contain just a duplicated value at some point or does it contain cycles? Because I wouldn't get that at all from the problem description.
imachug@reddit
Say the input looks like {4, 1, 3, 2, 2}.
Starting with the last position, treating its value as an index, jumping there and then repeating the process, we get
5 -> 2 -> 1 -> 4 -> 2 -> 1 -> 4 -> .... Notice the structure of this sequence: first, we get unique values we haven't seen before, and then at some point a value repeats and the rest of the sequence is looped.In this case, the first repeated value is
2. Since it's the first such value, the indexes before its first and second occurrences (5and4in this case) must have been unique by that point, and thus distinct. In general, this gives us somea, bsuch thatarr[a] = arr[b]anda != b, soarr[a]is a valid answer.Here's another way to look at this sequence. Split it into a prefix and the repeated cycle. Here, the prefix is
5and the cycle is2 -> 1 -> 4, but in general the prefix could be longer. The first element of the cycle has two predecessors -- the last element of the prefix and the last element of the cycle, which are necessarily different.So what we're looking for is the point where the prefix and the cycle meet. A simple way to do this in constant memory and linear time is to use tortoise-and-hare. (In a nutshell: maintain two pointers
aandb.ajumps to the next element of the sequence on each step,bjumps two elements at a time. When they meet, you know you're in a cycle andjump(k) = jump(2k), thus the length of the cycle dividesk. You can now reset from the start and find the firstisuch thatjump(i) = jump(i+k), giving you the first repeated element.)CryptoHorologist@reddit
Collatz!
EarlMarshal@reddit
Thank you. The first section really has helped.
CandiceWoo@reddit
its an array of 3,4,5,1,2,1 etc; the last solution is just easier to be thought of as a graph. index is the node, the content is the next node. interesting property this graph has a) nothing points to last node, so for it to complete, it has to join another cycle b) the point where it joins a cycle will be a node with 2 index pointing to it - this is a duplicate
EarlMarshal@reddit
So 345121 is a cycle that repeats itself in the array and if find this cycle or is that the whole array and I should find the indexes of the two 1?
I think I really only have a misunderstanding with the problem statement in the first place.
mio991@reddit
I think you understand that problem space quite well, to me it seems like you're lost during the mapping into another problem space that has a known solution.
You have an array like 4, 3, 1, 2, 3 how do we recognise that 3 is the duplicate? We interpret the values as 1-based indices into the same array, this way the array describes the following linked lists; - 4 -> 2 -> 3 -> 1 -> 4 -> ... - this is a cycle we entered at 1 - 3 -> 1 -> 4 -> 2 -> 3 -> ... - this is the same cycle but we entered through 5
By keeping track of where we came from we can detect when we arrive at the duplicate.
It's still the same data, just reinterpreted.
chyekk@reddit
I think the gist is you’re treating each value as an index, so if you encounter a cycle it’s because you’ve visited the same index twice and therefore that’s your duplicate. Roughly speaking…
crusoe@reddit
The two are the same. The values are in the range of 1 to N-1 for a N size array. So you can treat each value as an index into the array.
So for an array with 4 slots
You can have the values 1 2 1 3
Treating these values as indices transforms the array into a graph. You can then use a cycle detecting algorithm to find the duplicate index value.
ollerhll@reddit
I might be being dumb - why can't you solve this with a hashset and a single pass? That's less efficient than the solution given here, but definitely better than the original NlogN thought the author had, right?
soliss@reddit
You can solve this with a single integer of extra space using xor.
Xor has two useful properties:
a ^ a = 0(any number xor'd with itself cancels out)a ^ 0 = a(xor with zero is the identity)So if you xor all the elements together, then xor that result with
1 ^ 2 ^ ... ^ N, every number that appears exactly once cancels to zero. The only thing left is the duplicate.Intrexa@reddit
SCube18@reddit
Literally illegal under the constraints
fb39ca4@reddit
Your solution requires there to be exactly one duplicate pair, which the original problem does not specify.
rlbond86@reddit
Every number that appears an odd number of times.
taintedlead@reddit
Beautiful! But in the blog post they state:
So it won't work in cases like u/Intrexa points out.
skelterjohn@reddit
Adding to a hash set isn't a constant time operation.
In practice we often treat it as one, because we can make our hash space significantly larger than our input set pretty easily.
But when we're talking O(), it's dependent on how hash collisions are collected.
khorbin@reddit
Ok, then why not just use an array of size N-1?
I had the same reaction: without additional constraints that they aren’t listing, this seems pretty easy.
The unstated implication seems to be that this must be done in-place, but they never actually say that.
skelterjohn@reddit
Are the values in the array limited to N? I'd assume no.
khorbin@reddit
It’s part of the problem definition, so yes
skelterjohn@reddit
I had read it that the integers that may be duplicated were in the rance a[1] to a[n-1], which I concede would be an odd stilulation.
sprcow@reddit
Honestly this is the only sane response. It's trivial to solve with a bit array of size n, and that solution is accepted by leetcode and performant in top 1% of solutions. While it's novel to play around coming up with unique approaches, they seem unnecessary. Even if one wanted to be a stickler about O(1) memory, then fine, make your int[INTEGER_MAX_VALUE] and call it a day. :P
soldiercrabs@reddit
In fact, as Chen points out in footnote 2, that is basically what his attempt is, he's just doing a tricksy-dee by re-using the sign bit as a "free" array.
neuralbeans@reddit
You're right, a bit map would have been sufficient here.
DownvoteALot@reddit
I'm most settings, big O notation means average case, and for set insert that's O(1).
hextree@reddit
Average case for hash maps still isn't O(1).
skelterjohn@reddit
If we're talking specific cases, big oh and big omega don't seem very meaningful. It's more akin to changing the problem than analyzing it differently.
Big oh of the "worst case" is big oh of the problem that has fewer possible inputs, but it's still the worst case for that new problem.
I read it as "the worst case is no worse than the big oh". The distinction that you're drawing seems more like word semantics than a differentiation in result.
skelterjohn@reddit
Sorry, no, that's not correct. O means worst case.
Merry-Lane@reddit
In most settings, big O notation is the worst case
slaymaker1907@reddit
The sorting solution is better if you are memory constrained. Sorting can be done with constant memory by using heapsort. I think that’s what is missing here.
skelterjohn@reddit
Most sorts can be done without using non-trivial amounts of extra space. This problem is only interesting (to me) when you are focusing on run time.
slaymaker1907@reddit
That is absolutely not the case with most sorting algorithms. Most take at least O(log N) allocations. Even though that is small, if Raymond’s case was for a low level team, heap allocation and recursion might be forbidden. Regardless, the particular sorting algorithm isn’t terribly important for the question, just that sorting in general uses little to no extra memory. A hash set would use a whopping O(n) extra memory.
imachug@reddit
You can even use a bitset array to make it deterministic linear time. I think the intention was to solve the problem in constant space.
TommyTheTiger@reddit
Man, it's been a while, but I got my first internship coming up with the linear time, constant mem solution. Treat the integers as pointers, iterate through the array in that order, and find the cycle! You're guaranteed to be in it after going through N elements. All this is IIRC.
ElvishParsley123@reddit
I don't get this solution. If 0 points at 1 and 1 points at 0, that creates a cycle, but that doesn't tell you anything about what is a duplicate. Can you explain the logic of your solution better?
TommyTheTiger@reddit
I did skip a bit, and the article doesn't explain it that well, I'm not sure if my solution is the same and I don't know about any of these cycle detection algorithms.
Basically, we're going to get to a number inside the cycle, from there we're going to find the length of the cycle, and then, knowing the length of the list and the length of the cycle, we can find the start of the cycle, as long as the list only has 1 duplicate (which IIRC was also a property of this problem).
In your case there would need to be a 3rd number. Let's go with [1, 0, 1] at random. Hop 3 times from pos 2, end up at pos 1. number is 0, hop 2x, back to 0, so cycle len is 2. List len is 3, cycle len is 2, there for we need to hop 0 times from the N'th elt to get to the duplicate, so 1 is the duplicate.
Mysterious-Rent7233@reddit
Did you come up with it on the spot or had you heard of it before?
TommyTheTiger@reddit
It wasn't live - it was an optional "challenge problem" that you could solve outside of the normal application process, which I thought about in the back of my head for quite some time before figuring it out
lachlanhunt@reddit
They seem to be using 1-based indexing. It was very confusing what "the entry at index N" was referring to because there is usually no index N in an N-length array.
It makes much more sense if you keep indexes as 0 based, and convert values to indexes by subtracting 1.
neuralbeans@reddit
I don't like that you need to modify the sign bit of the numbers. That puts unnecessary assumptions on the input. Better to say that you create a boolean array instead. Still less space and time than a hash table.
Jonny0Than@reddit
The end of the article describes an algorithm that does not modify the array.
PurepointDog@reddit
When's the last time anyone wrote code with negative numbers though, honestly
neuralbeans@reddit
To be fair, the specification is that the input numbers are all between 1 and N-1. So far so good. But what if the numbers are unsigned and N-1 is big enough to use the most significant bit?
Iggyhopper@reddit
Similar to the 100 prisoner problem where the optimal solution is for each prisoner to follow the box they open to the next box, hopefully fidning a loop before bix 50.
Uristqwerty@reddit
I'd have to wonder if there's an input size where random access causes enough cache problems that you'd be better off counting prefix occurrences in a linear pass, picking an over-represented prefix, then doing a second scan that only pays attention to matching values. I imagine only for inputs so large that even a relatively-compact bitset is still far too large. Maybe never.
claimred@reddit
Also nicely explained in this blog post
https://www.keithschwarz.com/interesting/code/?dir=find-duplicate
puredotaplayer@reddit
Clever solution except on a physical processor it is inefficient due to cache behaviour for a large N. A bitmap + iteration will be faster.
sviperll@reddit
Just XOR-ing everything is the most efficient solution
puredotaplayer@reddit
True but then it wont be simple to figure out how to do this with multiple duplicates.
Mysterious-Rent7233@reddit
By bitmap do you mean simply a bit array, but packed with no wasted space?
puredotaplayer@reddit
Yea exactly.
sweetno@reddit (OP)
Isn't the bitmap also random access?
puredotaplayer@reddit
Much more smaller in memory footprint and more likely to accommodate in L1
norude1@reddit
can't you xor all of them and then additionally xor all the numbers from 1 to N-1 there?
FuckOnion@reddit
The comments already make it clear, but this is a poorly posed problem. Solving this problem in O(n) time is trivial with no space restrictions. The author probably forgot to specify further restrictions.
TommyTheTiger@reddit
Yeah, the point of the solution is O(1) memory
digital_cucumber@reddit
If you mutate the input data, that shouldn't count as O(1) memory.
TommyTheTiger@reddit
You don't mutate the data
digital_cucumber@reddit
Ah yes, you are right, using the "two pointers" cycle detection it's indeed proper O(1), space-wise.
sweetno@reddit (OP)
It's an interview question. The fancier the solution the merrier.
sweetno@reddit (OP)
Good old LeetCode #287. One of those you think you know the solution, only to get dragged into the rabbit and tortoise hole.
m0j0m0j@reddit
It’s so funny to read “I thought this, and another guy thought that, but the interviewer found a completely mind-blowing solution!”
And it’s a classic coding puzzle that was solved and re-solved and discussed ad nauseam for probably decades at this point
trejj@reddit
I don't think the presented algorithm of walking through cycles, and requiring that one bit of space in the source array is available, is particularly good. It has data-dependent jumps all over memory, and will have to perform a cleanup pass at the end, to fix up memory.
Any algorithm that temp-mutates the original data source is bad, since if something goes wrong and the thread is interrupted, data will go into a bad state.
Far superior default code to the presented cycle finding algorithm is to just be simple, and create a zero-initialized contiguous bit array, e.g.
uint8_t data[(N+7)/8] = {0};(eithercallocorallocaif you want to stress about a memory allocation) and scan through the data, binary ORring ones into the dataset.Code with a variable size array for simplicity of presentation:
(
calloc/allocaimpl. left as an exercise to the reader)The above code will be a much more performant default implementation, than a "mutate the input array" solution, or a "Floyd's cycle detector" utilizing algorithm.
If N gets large, the above has the possibility to OpenMP
#parallel forthe code, which the tortoise&hare thing won't easily be able to.Plus, the above code is dead naive for junior readers to understand as well.
ta9876543205@reddit
Why not just sum all the values and subtract the expected sum?
TommyTheTiger@reddit
What if they overflow?
tankerdudeucsc@reddit
Math problem really, assuming it’s 1 and only 1 duplicate. Summation problem. Add all numbers together. Calculate the summation of 1..N.
Subtract summation from what you calculated. You have your duplicate.
O(N) run time and O(1) space.
Multiple duplicates gets a lot more complicated snd. You can’t do the simple summation trick.
You could do it in constant time but O(N) space (bitmap) for each number, but very memory intensive for a large N.
Xenarthran47@reddit
The range of 1 to N-1 sums to (N*(N-1))//2, so just add up the array and subtract the expected sum for O(N) no additional space? Unless I am drastically misunderstanding...
skyware@reddit
Hashmap. If you want no extra space, use the existing input array. And do cycle sort. When you find index that already has n-1 value, dupe
poco@reddit
If you are willing to use the extra space of a hash map then it is even simpler to create an array of bools of size N and write true at the index that matches the value of the original array. As soon as you hit a value that is already true then you found your duplicate.
Sopel97@reddit
the diagram for this algorithm on wikipedia is really bad and misleading
Supadoplex@reddit
Why don't you care if there is a duplicate in index 0?
confu2000@reddit
The values are in the range 1 to N-1. The indexes are from 0 to N-1.
Supadoplex@reddit
Ahh. This makes sense.
skyware@reddit
I think that's if there is only 1 dupe. This says there may be more than one. You just need to find one