Efficiently find an integer not in a set of size 40, 400, or 4000
Related to the classic problem find an integer not among four billion given ones but not exactly the same.
To clarify, by integers what I really mean is only a subset of its mathemtical definition. That is, assume there are only finite number of integers. Say in C++, they are int
in the range of [INT_MIN, INT_MAX]
.
Now given a std::vector<int>
(no duplicates) or std::unordered_set<int>
, whose size can be 40, 400, 4000 or so, but not too large, how to efficiently generate a number that is guaranteed to be not among the given ones?
If there is no worry for overflow, then I could multiply all nonzero ones together and add the product by 1. But there is. The adversary test cases could delibrately contain INT_MAX
.
I am more in favor of simple, non-random approaches. Is there any?
Thank you!
Update: to clear up ambiguity, let's say an unsorted std::vector<int>
which is guaranteed to have no duplicates. So I am asking if there is anything better than O(n log(n)). Also please note that test cases may contain both INT_MIN
and INT_MAX
.
c++ algorithm
|
show 13 more comments
Related to the classic problem find an integer not among four billion given ones but not exactly the same.
To clarify, by integers what I really mean is only a subset of its mathemtical definition. That is, assume there are only finite number of integers. Say in C++, they are int
in the range of [INT_MIN, INT_MAX]
.
Now given a std::vector<int>
(no duplicates) or std::unordered_set<int>
, whose size can be 40, 400, 4000 or so, but not too large, how to efficiently generate a number that is guaranteed to be not among the given ones?
If there is no worry for overflow, then I could multiply all nonzero ones together and add the product by 1. But there is. The adversary test cases could delibrately contain INT_MAX
.
I am more in favor of simple, non-random approaches. Is there any?
Thank you!
Update: to clear up ambiguity, let's say an unsorted std::vector<int>
which is guaranteed to have no duplicates. So I am asking if there is anything better than O(n log(n)). Also please note that test cases may contain both INT_MIN
and INT_MAX
.
c++ algorithm
2
sorting the vector can be done inO(n log(n))
, not sure if you can get anything more efficient
– user463035818
Dec 18 '18 at 14:41
2
Is the vector sorted? For the set it is trivial since it is sorted.
– NathanOliver
Dec 18 '18 at 14:41
1
@user463035818 if you can sort a vector inO(log(n))
, you should become very rich indeed.
– Walter
Dec 18 '18 at 14:43
1
@Walter meh it was a typo, no millions for me unfortunately
– user463035818
Dec 18 '18 at 14:44
1
Once sorted, you do: Last + 1?
– JVApen
Dec 18 '18 at 14:47
|
show 13 more comments
Related to the classic problem find an integer not among four billion given ones but not exactly the same.
To clarify, by integers what I really mean is only a subset of its mathemtical definition. That is, assume there are only finite number of integers. Say in C++, they are int
in the range of [INT_MIN, INT_MAX]
.
Now given a std::vector<int>
(no duplicates) or std::unordered_set<int>
, whose size can be 40, 400, 4000 or so, but not too large, how to efficiently generate a number that is guaranteed to be not among the given ones?
If there is no worry for overflow, then I could multiply all nonzero ones together and add the product by 1. But there is. The adversary test cases could delibrately contain INT_MAX
.
I am more in favor of simple, non-random approaches. Is there any?
Thank you!
Update: to clear up ambiguity, let's say an unsorted std::vector<int>
which is guaranteed to have no duplicates. So I am asking if there is anything better than O(n log(n)). Also please note that test cases may contain both INT_MIN
and INT_MAX
.
c++ algorithm
Related to the classic problem find an integer not among four billion given ones but not exactly the same.
To clarify, by integers what I really mean is only a subset of its mathemtical definition. That is, assume there are only finite number of integers. Say in C++, they are int
in the range of [INT_MIN, INT_MAX]
.
Now given a std::vector<int>
(no duplicates) or std::unordered_set<int>
, whose size can be 40, 400, 4000 or so, but not too large, how to efficiently generate a number that is guaranteed to be not among the given ones?
If there is no worry for overflow, then I could multiply all nonzero ones together and add the product by 1. But there is. The adversary test cases could delibrately contain INT_MAX
.
I am more in favor of simple, non-random approaches. Is there any?
Thank you!
Update: to clear up ambiguity, let's say an unsorted std::vector<int>
which is guaranteed to have no duplicates. So I am asking if there is anything better than O(n log(n)). Also please note that test cases may contain both INT_MIN
and INT_MAX
.
c++ algorithm
c++ algorithm
edited Dec 18 '18 at 15:17
asked Dec 18 '18 at 14:37
fleix
466110
466110
2
sorting the vector can be done inO(n log(n))
, not sure if you can get anything more efficient
– user463035818
Dec 18 '18 at 14:41
2
Is the vector sorted? For the set it is trivial since it is sorted.
– NathanOliver
Dec 18 '18 at 14:41
1
@user463035818 if you can sort a vector inO(log(n))
, you should become very rich indeed.
– Walter
Dec 18 '18 at 14:43
1
@Walter meh it was a typo, no millions for me unfortunately
– user463035818
Dec 18 '18 at 14:44
1
Once sorted, you do: Last + 1?
– JVApen
Dec 18 '18 at 14:47
|
show 13 more comments
2
sorting the vector can be done inO(n log(n))
, not sure if you can get anything more efficient
– user463035818
Dec 18 '18 at 14:41
2
Is the vector sorted? For the set it is trivial since it is sorted.
– NathanOliver
Dec 18 '18 at 14:41
1
@user463035818 if you can sort a vector inO(log(n))
, you should become very rich indeed.
– Walter
Dec 18 '18 at 14:43
1
@Walter meh it was a typo, no millions for me unfortunately
– user463035818
Dec 18 '18 at 14:44
1
Once sorted, you do: Last + 1?
– JVApen
Dec 18 '18 at 14:47
2
2
sorting the vector can be done in
O(n log(n))
, not sure if you can get anything more efficient– user463035818
Dec 18 '18 at 14:41
sorting the vector can be done in
O(n log(n))
, not sure if you can get anything more efficient– user463035818
Dec 18 '18 at 14:41
2
2
Is the vector sorted? For the set it is trivial since it is sorted.
– NathanOliver
Dec 18 '18 at 14:41
Is the vector sorted? For the set it is trivial since it is sorted.
– NathanOliver
Dec 18 '18 at 14:41
1
1
@user463035818 if you can sort a vector in
O(log(n))
, you should become very rich indeed.– Walter
Dec 18 '18 at 14:43
@user463035818 if you can sort a vector in
O(log(n))
, you should become very rich indeed.– Walter
Dec 18 '18 at 14:43
1
1
@Walter meh it was a typo, no millions for me unfortunately
– user463035818
Dec 18 '18 at 14:44
@Walter meh it was a typo, no millions for me unfortunately
– user463035818
Dec 18 '18 at 14:44
1
1
Once sorted, you do: Last + 1?
– JVApen
Dec 18 '18 at 14:47
Once sorted, you do: Last + 1?
– JVApen
Dec 18 '18 at 14:47
|
show 13 more comments
6 Answers
6
active
oldest
votes
You could just return the first of N+1
candidate integers not contained in your input. The simplest candidates are the numbers 0
to N
. This requires O(N)
space and time.
int find_not_contained(container<int> const&data)
{
const int N=data.size();
std::vector<char> known(N+1, 0); // one more candidates than data
for(int i=0; i< N; ++i)
if(data[i]>=0 && data[i]<=N)
known[data[i]]=1;
for(int i=0; i<=N; ++i)
if(!known[i])
return i;
assert(false); // should never be reached.
}
Random methods can be more space efficient, but may require more passes over the data in the worst case.
2
I like this. It avoids sorting and searching.
– fleix
Dec 18 '18 at 15:50
3
This is the standard algorithm for finding the smallest integer not in a given set
– DreamConspiracy
Dec 18 '18 at 22:20
@DreamConspiracy That makes sense. I didn't know that, though.
– Walter
Dec 19 '18 at 0:39
1
Fun fact: on a machine with scatter stores, like x86 with AVX512F, this algorithm can be vectorized. (At least with x86-style scatters where a mask controls which elements are actually scattered). And unlike a histogram problem where you have to do conflict detection for multiple vector elements hitting the same index, you're only ever storing a1
so it's fine. (x86 can only scatter with 32 or 64-bit granularity, so you'd have to promoteknown
toint
, which may not be worth it for largeN
. With large enoughN
, you'd want to use a bitmap instead of a char array.)
– Peter Cordes
Dec 19 '18 at 3:59
The search at the end can be vectorized easily, using whatever tricks are applicable for a hand-optimizedstrlen
. (e.g. modern x86 should be able to check 16 to 64 bytes per clock cycle, with a well-tuned loop using SSE2 or AVX2, depending on the hardware.)
– Peter Cordes
Dec 19 '18 at 4:03
add a comment |
Random methods are indeed very efficient here.
If we want to use a deterministic method and by assuming the size n is not too large, 4000 for example, then we can create a vector x of size m = n + 1
(or a little bit larger, 4096 for example to facilitate calculation), initialised with 0.
For each i
in the range, we just set x[array[i] modulo m] = 1.
Then a simple O(n) search in x will provide a value which is not in array
Note: the modulo operation is not exactly the "%" operation
Edit: I mentioned that calculations are made easier by selecting here a size of 4096. To be more concrete, this implies that the modulo operation is performed with a simple &
operation
add a comment |
You can find the smallest unused integer in O(N) time using O(1) auxiliary space if you are allowed to reorder the input vector, using the following algorithm. [Note 1] (The algorithm also works if the vector contains repeated data.)
size_t smallest_unused(std::vector<unsigned>& data) {
size_t N = data.size(), scan = 0;
while (scan < N) {
auto other = data[scan];
if (other < scan && data[other] != other) {
data[scan] = data[other];
data[other] = other;
}
else
++scan;
}
for (scan = 0; scan < N && data[scan] == scan; ++scan) { }
return scan;
}
The first pass guarantees that if some k
in the range [0, N)
was found after position k
, then it is now present at position k
. This rearrangement is done by swapping in order to avoid losing data. Once that scan is complete, the first entry whose value is not the same as its index is not referenced anywhere in the array.
That assertion may not be 100% obvious, since a entry could be referenced from an earlier index. However, in that case the entry could not be the first entry unequal to its index, since the earlier entry would be meet that criterion.
To see that this algorithm is O(N), it should be observed that the swap at lines 6 and 7 can only happen if the target entry is not equal to its index, and that after the swap the target entry is equal to its index. So at most N
swaps can be performed, and the if
condition at line 5 will be true
at most N
times. On the other hand, if the if
condition is false, scan
will be incremented, which can also only happen N
times. So the if
statement is evaluated at most 2N
times (which is O(N)).
Notes:
- I used unsigned integers here because it makes the code clearer. The algorithm can easily be adjusted for signed integers, for example by mapping signed integers from
[INT_MIN, 0)
onto unsigned integers[INT_MAX, INT_MAX - INT_MIN)
(The subtraction is mathematical, not according to C semantics which wouldn't allow the result to be represented.) In 2's-complement, that's the same bit pattern. That changes the order of the numbers, of course, which affects the semantics of "smallest unused integer"; an order-preserving mapping could also be used.
To handle OP's requirement smallest positive unused is sufficient, so an even easier way to handle signed integers is to not swap if the number is negative.
– Taemyr
Dec 19 '18 at 14:13
@taemyr: that's what happens if you use the "2's complement" mapping of negative integers, which doesn't require an extra test (assuming C in which the cast is a no-op). In general, if you want to find the smallest available number in some range, you can ignore any entries outside of that range and subtracting the beginning of the range from useful entries. I was going to add that observation but I was mostly interested in the elegance of the original algorithm.
– rici
Dec 19 '18 at 14:18
add a comment |
Make random x (INT_MIN..INT_MAX) and test it against all. Test x++ on failure (very rare case for 40/400/4000).
2
He is looking for an efficient way to do it, this would be O(n^2). Yes, it will be way better most times, but still very inefficient worst-case (and OP specifically asked for O, that's for worst case)
– dquijada
Dec 18 '18 at 14:54
1
@dquijada No, it isn't. Big O notation describes the asymptotic behavior of a function, but it does not mean it is the worst-case of anything. In other words, an algorithm may have different upper bounds depending on the input you are studying.
– Acorn
Dec 18 '18 at 15:15
1
@Acorn Typically with big-O notation you describe the worst-case time complexity of an algorithm, i.e. the worst time it can have for all inputs. Sure, a subset of the inputs may run in with a smaller complexity but worst-case means considering all possibilities. Big-O can also be used to describe the average-case complexity. However, as far as I know, the "default meaning" for the sentence "This algorithm takes O(f(n))" time is that O(f(n)) is the worst-case scenario, not the average case.
– Bakuriu
Dec 18 '18 at 23:31
1
@Bakuriu Big O notation has nothing to do with the type of complexity analyzed. It may not even be time complexity. While you can, of course, argue that one may assume we are talking about worst-case time complexity, dquijada said that "OP specifically asked for O, that's for worst case", which is simply not true. OP didn't specifically say anything about worst-case (quite the opposite: we can only assume), nor "O" means worst-case either (that is not true at all).
– Acorn
Dec 18 '18 at 23:35
3
Further, in this specific problem, this solution is actually very efficient in practice, because you almost never hit the worst case (and/or you can simply fallback to another solution if you detect you may be hitting it, e.g. after a few tries; giving you back a worst-case O(n) algorithm easily). So dismissing this solution by saying it is O(n^2) and claiming that OP wanted a good worst-case algorithm, is simply wrong.
– Acorn
Dec 18 '18 at 23:42
|
show 4 more comments
Step 1: Sort the vector.
That can be done in O(n log(n)), you can find a few different algorithms online, use the one you like the most.
Step 2: Find the first int not in the vector.
Easily iterate from INT_MIN to INT_MIN + 40/400/4000 checking if the vector has the current int:
Pseudocode:
SIZE = 40|400|4000 // The one you are using
for (int i = 0; i < SIZE; i++) {
if (array[i] != INT_MIN + i)
return INT_MIN + i;
The solution would be O(n log(n) + n) meaning: O(n log(n))
Edit: just read your edit asking for something better than O(n log(n)), sorry.
Thanks for the answer. A minor point: though it being pseudocode, it doesn't seemarray[i]
would make any sense wheni
starts fromINT_MIN
.
– fleix
Dec 18 '18 at 15:10
You are right, @fleix, it's fixed now
– dquijada
Dec 18 '18 at 15:14
1
You can do the search using binary search in O(log N).
– rici
Dec 18 '18 at 20:25
1
@rici: how would you binary search for "a missing number"? If you knew a number that was missing, you could find the position it belongs in a sorted array in log(n) time. I don't think you can do better than linear, although probably with a low constant factor. With a uniform distribution, a linear search will hit a gap inO(n / max_n)
time on average, but worst casen
(no gaps until the end.)
– Peter Cordes
Dec 18 '18 at 23:50
2
@peter: the same way you do a linear search, comparei + min
withvec[i]
. If they are unequal, there is a missing value beforei
, so you bisect down; otherwise you bisect up. Binary search works with any monotonic predicate.
– rici
Dec 19 '18 at 2:05
|
show 4 more comments
For the case in which the integers are provided in an std::unordered_set<int>
(as opposed to a std::vector<int>
), you could simply traverse the range of integer values until you come up against one integer value that is not present in the unordered_set<int>
. Searching for the presence of an integer in an std::unordered_set<int>
is quite straightforward, since std::unodered_set
does provide searching through its find()
member function.
The space complexity of this approach would be O(1).
If you start traversing at the lowest possible value for an int
(i.e., std::numeric_limits<int>::min()
), you will obtain the lowest int
not contained in the std::unordered_set<int>
:
int find_lowest_not_contained(const std::unordered_set<int>& set) {
for (auto i = std::numeric_limits<int>::min(); ; ++i) {
auto it = set.find(i); // search in set
if (it == set.end()) // integer not in set?
return *it;
}
}
Analogously, if you start traversing at the greatest possible value for an int
(i.e., std::numeric_limits<int>::max()
), you will obtain the lowest int
not contained in the std::unordered_set<int>
:
int find_greatest_not_contained(const std::unordered_set<int>& set) {
for (auto i = std::numeric_limits<int>::max(); ; --i) {
auto it = set.find(i); // search in set
if (it == set.end()) // integer not in set?
return *it;
}
}
Assuming that the int
s are uniformly mapped by the hash function into the unordered_set<int>
's buckets, a search operation on the unordered_set<int>
can be achieved in constant time. The run-time complexity would then be O(M ), where M is the size of the integer range you are looking for a non-contained value. M is upper-bounded by the size of the unordered_set<int>
(i.e., in your case M <= 4000).
Indeed, with this approach, selecting any integer range whose size is greater than the size of the unordered_set
, is guaranteed to come up against an integer value which is not present in the unordered_set<int>
.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53835354%2fefficiently-find-an-integer-not-in-a-set-of-size-40-400-or-4000%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
6 Answers
6
active
oldest
votes
6 Answers
6
active
oldest
votes
active
oldest
votes
active
oldest
votes
You could just return the first of N+1
candidate integers not contained in your input. The simplest candidates are the numbers 0
to N
. This requires O(N)
space and time.
int find_not_contained(container<int> const&data)
{
const int N=data.size();
std::vector<char> known(N+1, 0); // one more candidates than data
for(int i=0; i< N; ++i)
if(data[i]>=0 && data[i]<=N)
known[data[i]]=1;
for(int i=0; i<=N; ++i)
if(!known[i])
return i;
assert(false); // should never be reached.
}
Random methods can be more space efficient, but may require more passes over the data in the worst case.
2
I like this. It avoids sorting and searching.
– fleix
Dec 18 '18 at 15:50
3
This is the standard algorithm for finding the smallest integer not in a given set
– DreamConspiracy
Dec 18 '18 at 22:20
@DreamConspiracy That makes sense. I didn't know that, though.
– Walter
Dec 19 '18 at 0:39
1
Fun fact: on a machine with scatter stores, like x86 with AVX512F, this algorithm can be vectorized. (At least with x86-style scatters where a mask controls which elements are actually scattered). And unlike a histogram problem where you have to do conflict detection for multiple vector elements hitting the same index, you're only ever storing a1
so it's fine. (x86 can only scatter with 32 or 64-bit granularity, so you'd have to promoteknown
toint
, which may not be worth it for largeN
. With large enoughN
, you'd want to use a bitmap instead of a char array.)
– Peter Cordes
Dec 19 '18 at 3:59
The search at the end can be vectorized easily, using whatever tricks are applicable for a hand-optimizedstrlen
. (e.g. modern x86 should be able to check 16 to 64 bytes per clock cycle, with a well-tuned loop using SSE2 or AVX2, depending on the hardware.)
– Peter Cordes
Dec 19 '18 at 4:03
add a comment |
You could just return the first of N+1
candidate integers not contained in your input. The simplest candidates are the numbers 0
to N
. This requires O(N)
space and time.
int find_not_contained(container<int> const&data)
{
const int N=data.size();
std::vector<char> known(N+1, 0); // one more candidates than data
for(int i=0; i< N; ++i)
if(data[i]>=0 && data[i]<=N)
known[data[i]]=1;
for(int i=0; i<=N; ++i)
if(!known[i])
return i;
assert(false); // should never be reached.
}
Random methods can be more space efficient, but may require more passes over the data in the worst case.
2
I like this. It avoids sorting and searching.
– fleix
Dec 18 '18 at 15:50
3
This is the standard algorithm for finding the smallest integer not in a given set
– DreamConspiracy
Dec 18 '18 at 22:20
@DreamConspiracy That makes sense. I didn't know that, though.
– Walter
Dec 19 '18 at 0:39
1
Fun fact: on a machine with scatter stores, like x86 with AVX512F, this algorithm can be vectorized. (At least with x86-style scatters where a mask controls which elements are actually scattered). And unlike a histogram problem where you have to do conflict detection for multiple vector elements hitting the same index, you're only ever storing a1
so it's fine. (x86 can only scatter with 32 or 64-bit granularity, so you'd have to promoteknown
toint
, which may not be worth it for largeN
. With large enoughN
, you'd want to use a bitmap instead of a char array.)
– Peter Cordes
Dec 19 '18 at 3:59
The search at the end can be vectorized easily, using whatever tricks are applicable for a hand-optimizedstrlen
. (e.g. modern x86 should be able to check 16 to 64 bytes per clock cycle, with a well-tuned loop using SSE2 or AVX2, depending on the hardware.)
– Peter Cordes
Dec 19 '18 at 4:03
add a comment |
You could just return the first of N+1
candidate integers not contained in your input. The simplest candidates are the numbers 0
to N
. This requires O(N)
space and time.
int find_not_contained(container<int> const&data)
{
const int N=data.size();
std::vector<char> known(N+1, 0); // one more candidates than data
for(int i=0; i< N; ++i)
if(data[i]>=0 && data[i]<=N)
known[data[i]]=1;
for(int i=0; i<=N; ++i)
if(!known[i])
return i;
assert(false); // should never be reached.
}
Random methods can be more space efficient, but may require more passes over the data in the worst case.
You could just return the first of N+1
candidate integers not contained in your input. The simplest candidates are the numbers 0
to N
. This requires O(N)
space and time.
int find_not_contained(container<int> const&data)
{
const int N=data.size();
std::vector<char> known(N+1, 0); // one more candidates than data
for(int i=0; i< N; ++i)
if(data[i]>=0 && data[i]<=N)
known[data[i]]=1;
for(int i=0; i<=N; ++i)
if(!known[i])
return i;
assert(false); // should never be reached.
}
Random methods can be more space efficient, but may require more passes over the data in the worst case.
edited Dec 18 '18 at 15:34
answered Dec 18 '18 at 15:15
Walter
27.1k1375153
27.1k1375153
2
I like this. It avoids sorting and searching.
– fleix
Dec 18 '18 at 15:50
3
This is the standard algorithm for finding the smallest integer not in a given set
– DreamConspiracy
Dec 18 '18 at 22:20
@DreamConspiracy That makes sense. I didn't know that, though.
– Walter
Dec 19 '18 at 0:39
1
Fun fact: on a machine with scatter stores, like x86 with AVX512F, this algorithm can be vectorized. (At least with x86-style scatters where a mask controls which elements are actually scattered). And unlike a histogram problem where you have to do conflict detection for multiple vector elements hitting the same index, you're only ever storing a1
so it's fine. (x86 can only scatter with 32 or 64-bit granularity, so you'd have to promoteknown
toint
, which may not be worth it for largeN
. With large enoughN
, you'd want to use a bitmap instead of a char array.)
– Peter Cordes
Dec 19 '18 at 3:59
The search at the end can be vectorized easily, using whatever tricks are applicable for a hand-optimizedstrlen
. (e.g. modern x86 should be able to check 16 to 64 bytes per clock cycle, with a well-tuned loop using SSE2 or AVX2, depending on the hardware.)
– Peter Cordes
Dec 19 '18 at 4:03
add a comment |
2
I like this. It avoids sorting and searching.
– fleix
Dec 18 '18 at 15:50
3
This is the standard algorithm for finding the smallest integer not in a given set
– DreamConspiracy
Dec 18 '18 at 22:20
@DreamConspiracy That makes sense. I didn't know that, though.
– Walter
Dec 19 '18 at 0:39
1
Fun fact: on a machine with scatter stores, like x86 with AVX512F, this algorithm can be vectorized. (At least with x86-style scatters where a mask controls which elements are actually scattered). And unlike a histogram problem where you have to do conflict detection for multiple vector elements hitting the same index, you're only ever storing a1
so it's fine. (x86 can only scatter with 32 or 64-bit granularity, so you'd have to promoteknown
toint
, which may not be worth it for largeN
. With large enoughN
, you'd want to use a bitmap instead of a char array.)
– Peter Cordes
Dec 19 '18 at 3:59
The search at the end can be vectorized easily, using whatever tricks are applicable for a hand-optimizedstrlen
. (e.g. modern x86 should be able to check 16 to 64 bytes per clock cycle, with a well-tuned loop using SSE2 or AVX2, depending on the hardware.)
– Peter Cordes
Dec 19 '18 at 4:03
2
2
I like this. It avoids sorting and searching.
– fleix
Dec 18 '18 at 15:50
I like this. It avoids sorting and searching.
– fleix
Dec 18 '18 at 15:50
3
3
This is the standard algorithm for finding the smallest integer not in a given set
– DreamConspiracy
Dec 18 '18 at 22:20
This is the standard algorithm for finding the smallest integer not in a given set
– DreamConspiracy
Dec 18 '18 at 22:20
@DreamConspiracy That makes sense. I didn't know that, though.
– Walter
Dec 19 '18 at 0:39
@DreamConspiracy That makes sense. I didn't know that, though.
– Walter
Dec 19 '18 at 0:39
1
1
Fun fact: on a machine with scatter stores, like x86 with AVX512F, this algorithm can be vectorized. (At least with x86-style scatters where a mask controls which elements are actually scattered). And unlike a histogram problem where you have to do conflict detection for multiple vector elements hitting the same index, you're only ever storing a
1
so it's fine. (x86 can only scatter with 32 or 64-bit granularity, so you'd have to promote known
to int
, which may not be worth it for large N
. With large enough N
, you'd want to use a bitmap instead of a char array.)– Peter Cordes
Dec 19 '18 at 3:59
Fun fact: on a machine with scatter stores, like x86 with AVX512F, this algorithm can be vectorized. (At least with x86-style scatters where a mask controls which elements are actually scattered). And unlike a histogram problem where you have to do conflict detection for multiple vector elements hitting the same index, you're only ever storing a
1
so it's fine. (x86 can only scatter with 32 or 64-bit granularity, so you'd have to promote known
to int
, which may not be worth it for large N
. With large enough N
, you'd want to use a bitmap instead of a char array.)– Peter Cordes
Dec 19 '18 at 3:59
The search at the end can be vectorized easily, using whatever tricks are applicable for a hand-optimized
strlen
. (e.g. modern x86 should be able to check 16 to 64 bytes per clock cycle, with a well-tuned loop using SSE2 or AVX2, depending on the hardware.)– Peter Cordes
Dec 19 '18 at 4:03
The search at the end can be vectorized easily, using whatever tricks are applicable for a hand-optimized
strlen
. (e.g. modern x86 should be able to check 16 to 64 bytes per clock cycle, with a well-tuned loop using SSE2 or AVX2, depending on the hardware.)– Peter Cordes
Dec 19 '18 at 4:03
add a comment |
Random methods are indeed very efficient here.
If we want to use a deterministic method and by assuming the size n is not too large, 4000 for example, then we can create a vector x of size m = n + 1
(or a little bit larger, 4096 for example to facilitate calculation), initialised with 0.
For each i
in the range, we just set x[array[i] modulo m] = 1.
Then a simple O(n) search in x will provide a value which is not in array
Note: the modulo operation is not exactly the "%" operation
Edit: I mentioned that calculations are made easier by selecting here a size of 4096. To be more concrete, this implies that the modulo operation is performed with a simple &
operation
add a comment |
Random methods are indeed very efficient here.
If we want to use a deterministic method and by assuming the size n is not too large, 4000 for example, then we can create a vector x of size m = n + 1
(or a little bit larger, 4096 for example to facilitate calculation), initialised with 0.
For each i
in the range, we just set x[array[i] modulo m] = 1.
Then a simple O(n) search in x will provide a value which is not in array
Note: the modulo operation is not exactly the "%" operation
Edit: I mentioned that calculations are made easier by selecting here a size of 4096. To be more concrete, this implies that the modulo operation is performed with a simple &
operation
add a comment |
Random methods are indeed very efficient here.
If we want to use a deterministic method and by assuming the size n is not too large, 4000 for example, then we can create a vector x of size m = n + 1
(or a little bit larger, 4096 for example to facilitate calculation), initialised with 0.
For each i
in the range, we just set x[array[i] modulo m] = 1.
Then a simple O(n) search in x will provide a value which is not in array
Note: the modulo operation is not exactly the "%" operation
Edit: I mentioned that calculations are made easier by selecting here a size of 4096. To be more concrete, this implies that the modulo operation is performed with a simple &
operation
Random methods are indeed very efficient here.
If we want to use a deterministic method and by assuming the size n is not too large, 4000 for example, then we can create a vector x of size m = n + 1
(or a little bit larger, 4096 for example to facilitate calculation), initialised with 0.
For each i
in the range, we just set x[array[i] modulo m] = 1.
Then a simple O(n) search in x will provide a value which is not in array
Note: the modulo operation is not exactly the "%" operation
Edit: I mentioned that calculations are made easier by selecting here a size of 4096. To be more concrete, this implies that the modulo operation is performed with a simple &
operation
edited Dec 18 '18 at 15:30
answered Dec 18 '18 at 15:12
Damien
62715
62715
add a comment |
add a comment |
You can find the smallest unused integer in O(N) time using O(1) auxiliary space if you are allowed to reorder the input vector, using the following algorithm. [Note 1] (The algorithm also works if the vector contains repeated data.)
size_t smallest_unused(std::vector<unsigned>& data) {
size_t N = data.size(), scan = 0;
while (scan < N) {
auto other = data[scan];
if (other < scan && data[other] != other) {
data[scan] = data[other];
data[other] = other;
}
else
++scan;
}
for (scan = 0; scan < N && data[scan] == scan; ++scan) { }
return scan;
}
The first pass guarantees that if some k
in the range [0, N)
was found after position k
, then it is now present at position k
. This rearrangement is done by swapping in order to avoid losing data. Once that scan is complete, the first entry whose value is not the same as its index is not referenced anywhere in the array.
That assertion may not be 100% obvious, since a entry could be referenced from an earlier index. However, in that case the entry could not be the first entry unequal to its index, since the earlier entry would be meet that criterion.
To see that this algorithm is O(N), it should be observed that the swap at lines 6 and 7 can only happen if the target entry is not equal to its index, and that after the swap the target entry is equal to its index. So at most N
swaps can be performed, and the if
condition at line 5 will be true
at most N
times. On the other hand, if the if
condition is false, scan
will be incremented, which can also only happen N
times. So the if
statement is evaluated at most 2N
times (which is O(N)).
Notes:
- I used unsigned integers here because it makes the code clearer. The algorithm can easily be adjusted for signed integers, for example by mapping signed integers from
[INT_MIN, 0)
onto unsigned integers[INT_MAX, INT_MAX - INT_MIN)
(The subtraction is mathematical, not according to C semantics which wouldn't allow the result to be represented.) In 2's-complement, that's the same bit pattern. That changes the order of the numbers, of course, which affects the semantics of "smallest unused integer"; an order-preserving mapping could also be used.
To handle OP's requirement smallest positive unused is sufficient, so an even easier way to handle signed integers is to not swap if the number is negative.
– Taemyr
Dec 19 '18 at 14:13
@taemyr: that's what happens if you use the "2's complement" mapping of negative integers, which doesn't require an extra test (assuming C in which the cast is a no-op). In general, if you want to find the smallest available number in some range, you can ignore any entries outside of that range and subtracting the beginning of the range from useful entries. I was going to add that observation but I was mostly interested in the elegance of the original algorithm.
– rici
Dec 19 '18 at 14:18
add a comment |
You can find the smallest unused integer in O(N) time using O(1) auxiliary space if you are allowed to reorder the input vector, using the following algorithm. [Note 1] (The algorithm also works if the vector contains repeated data.)
size_t smallest_unused(std::vector<unsigned>& data) {
size_t N = data.size(), scan = 0;
while (scan < N) {
auto other = data[scan];
if (other < scan && data[other] != other) {
data[scan] = data[other];
data[other] = other;
}
else
++scan;
}
for (scan = 0; scan < N && data[scan] == scan; ++scan) { }
return scan;
}
The first pass guarantees that if some k
in the range [0, N)
was found after position k
, then it is now present at position k
. This rearrangement is done by swapping in order to avoid losing data. Once that scan is complete, the first entry whose value is not the same as its index is not referenced anywhere in the array.
That assertion may not be 100% obvious, since a entry could be referenced from an earlier index. However, in that case the entry could not be the first entry unequal to its index, since the earlier entry would be meet that criterion.
To see that this algorithm is O(N), it should be observed that the swap at lines 6 and 7 can only happen if the target entry is not equal to its index, and that after the swap the target entry is equal to its index. So at most N
swaps can be performed, and the if
condition at line 5 will be true
at most N
times. On the other hand, if the if
condition is false, scan
will be incremented, which can also only happen N
times. So the if
statement is evaluated at most 2N
times (which is O(N)).
Notes:
- I used unsigned integers here because it makes the code clearer. The algorithm can easily be adjusted for signed integers, for example by mapping signed integers from
[INT_MIN, 0)
onto unsigned integers[INT_MAX, INT_MAX - INT_MIN)
(The subtraction is mathematical, not according to C semantics which wouldn't allow the result to be represented.) In 2's-complement, that's the same bit pattern. That changes the order of the numbers, of course, which affects the semantics of "smallest unused integer"; an order-preserving mapping could also be used.
To handle OP's requirement smallest positive unused is sufficient, so an even easier way to handle signed integers is to not swap if the number is negative.
– Taemyr
Dec 19 '18 at 14:13
@taemyr: that's what happens if you use the "2's complement" mapping of negative integers, which doesn't require an extra test (assuming C in which the cast is a no-op). In general, if you want to find the smallest available number in some range, you can ignore any entries outside of that range and subtracting the beginning of the range from useful entries. I was going to add that observation but I was mostly interested in the elegance of the original algorithm.
– rici
Dec 19 '18 at 14:18
add a comment |
You can find the smallest unused integer in O(N) time using O(1) auxiliary space if you are allowed to reorder the input vector, using the following algorithm. [Note 1] (The algorithm also works if the vector contains repeated data.)
size_t smallest_unused(std::vector<unsigned>& data) {
size_t N = data.size(), scan = 0;
while (scan < N) {
auto other = data[scan];
if (other < scan && data[other] != other) {
data[scan] = data[other];
data[other] = other;
}
else
++scan;
}
for (scan = 0; scan < N && data[scan] == scan; ++scan) { }
return scan;
}
The first pass guarantees that if some k
in the range [0, N)
was found after position k
, then it is now present at position k
. This rearrangement is done by swapping in order to avoid losing data. Once that scan is complete, the first entry whose value is not the same as its index is not referenced anywhere in the array.
That assertion may not be 100% obvious, since a entry could be referenced from an earlier index. However, in that case the entry could not be the first entry unequal to its index, since the earlier entry would be meet that criterion.
To see that this algorithm is O(N), it should be observed that the swap at lines 6 and 7 can only happen if the target entry is not equal to its index, and that after the swap the target entry is equal to its index. So at most N
swaps can be performed, and the if
condition at line 5 will be true
at most N
times. On the other hand, if the if
condition is false, scan
will be incremented, which can also only happen N
times. So the if
statement is evaluated at most 2N
times (which is O(N)).
Notes:
- I used unsigned integers here because it makes the code clearer. The algorithm can easily be adjusted for signed integers, for example by mapping signed integers from
[INT_MIN, 0)
onto unsigned integers[INT_MAX, INT_MAX - INT_MIN)
(The subtraction is mathematical, not according to C semantics which wouldn't allow the result to be represented.) In 2's-complement, that's the same bit pattern. That changes the order of the numbers, of course, which affects the semantics of "smallest unused integer"; an order-preserving mapping could also be used.
You can find the smallest unused integer in O(N) time using O(1) auxiliary space if you are allowed to reorder the input vector, using the following algorithm. [Note 1] (The algorithm also works if the vector contains repeated data.)
size_t smallest_unused(std::vector<unsigned>& data) {
size_t N = data.size(), scan = 0;
while (scan < N) {
auto other = data[scan];
if (other < scan && data[other] != other) {
data[scan] = data[other];
data[other] = other;
}
else
++scan;
}
for (scan = 0; scan < N && data[scan] == scan; ++scan) { }
return scan;
}
The first pass guarantees that if some k
in the range [0, N)
was found after position k
, then it is now present at position k
. This rearrangement is done by swapping in order to avoid losing data. Once that scan is complete, the first entry whose value is not the same as its index is not referenced anywhere in the array.
That assertion may not be 100% obvious, since a entry could be referenced from an earlier index. However, in that case the entry could not be the first entry unequal to its index, since the earlier entry would be meet that criterion.
To see that this algorithm is O(N), it should be observed that the swap at lines 6 and 7 can only happen if the target entry is not equal to its index, and that after the swap the target entry is equal to its index. So at most N
swaps can be performed, and the if
condition at line 5 will be true
at most N
times. On the other hand, if the if
condition is false, scan
will be incremented, which can also only happen N
times. So the if
statement is evaluated at most 2N
times (which is O(N)).
Notes:
- I used unsigned integers here because it makes the code clearer. The algorithm can easily be adjusted for signed integers, for example by mapping signed integers from
[INT_MIN, 0)
onto unsigned integers[INT_MAX, INT_MAX - INT_MIN)
(The subtraction is mathematical, not according to C semantics which wouldn't allow the result to be represented.) In 2's-complement, that's the same bit pattern. That changes the order of the numbers, of course, which affects the semantics of "smallest unused integer"; an order-preserving mapping could also be used.
edited Dec 19 '18 at 2:18
answered Dec 18 '18 at 20:13
rici
152k19132197
152k19132197
To handle OP's requirement smallest positive unused is sufficient, so an even easier way to handle signed integers is to not swap if the number is negative.
– Taemyr
Dec 19 '18 at 14:13
@taemyr: that's what happens if you use the "2's complement" mapping of negative integers, which doesn't require an extra test (assuming C in which the cast is a no-op). In general, if you want to find the smallest available number in some range, you can ignore any entries outside of that range and subtracting the beginning of the range from useful entries. I was going to add that observation but I was mostly interested in the elegance of the original algorithm.
– rici
Dec 19 '18 at 14:18
add a comment |
To handle OP's requirement smallest positive unused is sufficient, so an even easier way to handle signed integers is to not swap if the number is negative.
– Taemyr
Dec 19 '18 at 14:13
@taemyr: that's what happens if you use the "2's complement" mapping of negative integers, which doesn't require an extra test (assuming C in which the cast is a no-op). In general, if you want to find the smallest available number in some range, you can ignore any entries outside of that range and subtracting the beginning of the range from useful entries. I was going to add that observation but I was mostly interested in the elegance of the original algorithm.
– rici
Dec 19 '18 at 14:18
To handle OP's requirement smallest positive unused is sufficient, so an even easier way to handle signed integers is to not swap if the number is negative.
– Taemyr
Dec 19 '18 at 14:13
To handle OP's requirement smallest positive unused is sufficient, so an even easier way to handle signed integers is to not swap if the number is negative.
– Taemyr
Dec 19 '18 at 14:13
@taemyr: that's what happens if you use the "2's complement" mapping of negative integers, which doesn't require an extra test (assuming C in which the cast is a no-op). In general, if you want to find the smallest available number in some range, you can ignore any entries outside of that range and subtracting the beginning of the range from useful entries. I was going to add that observation but I was mostly interested in the elegance of the original algorithm.
– rici
Dec 19 '18 at 14:18
@taemyr: that's what happens if you use the "2's complement" mapping of negative integers, which doesn't require an extra test (assuming C in which the cast is a no-op). In general, if you want to find the smallest available number in some range, you can ignore any entries outside of that range and subtracting the beginning of the range from useful entries. I was going to add that observation but I was mostly interested in the elegance of the original algorithm.
– rici
Dec 19 '18 at 14:18
add a comment |
Make random x (INT_MIN..INT_MAX) and test it against all. Test x++ on failure (very rare case for 40/400/4000).
2
He is looking for an efficient way to do it, this would be O(n^2). Yes, it will be way better most times, but still very inefficient worst-case (and OP specifically asked for O, that's for worst case)
– dquijada
Dec 18 '18 at 14:54
1
@dquijada No, it isn't. Big O notation describes the asymptotic behavior of a function, but it does not mean it is the worst-case of anything. In other words, an algorithm may have different upper bounds depending on the input you are studying.
– Acorn
Dec 18 '18 at 15:15
1
@Acorn Typically with big-O notation you describe the worst-case time complexity of an algorithm, i.e. the worst time it can have for all inputs. Sure, a subset of the inputs may run in with a smaller complexity but worst-case means considering all possibilities. Big-O can also be used to describe the average-case complexity. However, as far as I know, the "default meaning" for the sentence "This algorithm takes O(f(n))" time is that O(f(n)) is the worst-case scenario, not the average case.
– Bakuriu
Dec 18 '18 at 23:31
1
@Bakuriu Big O notation has nothing to do with the type of complexity analyzed. It may not even be time complexity. While you can, of course, argue that one may assume we are talking about worst-case time complexity, dquijada said that "OP specifically asked for O, that's for worst case", which is simply not true. OP didn't specifically say anything about worst-case (quite the opposite: we can only assume), nor "O" means worst-case either (that is not true at all).
– Acorn
Dec 18 '18 at 23:35
3
Further, in this specific problem, this solution is actually very efficient in practice, because you almost never hit the worst case (and/or you can simply fallback to another solution if you detect you may be hitting it, e.g. after a few tries; giving you back a worst-case O(n) algorithm easily). So dismissing this solution by saying it is O(n^2) and claiming that OP wanted a good worst-case algorithm, is simply wrong.
– Acorn
Dec 18 '18 at 23:42
|
show 4 more comments
Make random x (INT_MIN..INT_MAX) and test it against all. Test x++ on failure (very rare case for 40/400/4000).
2
He is looking for an efficient way to do it, this would be O(n^2). Yes, it will be way better most times, but still very inefficient worst-case (and OP specifically asked for O, that's for worst case)
– dquijada
Dec 18 '18 at 14:54
1
@dquijada No, it isn't. Big O notation describes the asymptotic behavior of a function, but it does not mean it is the worst-case of anything. In other words, an algorithm may have different upper bounds depending on the input you are studying.
– Acorn
Dec 18 '18 at 15:15
1
@Acorn Typically with big-O notation you describe the worst-case time complexity of an algorithm, i.e. the worst time it can have for all inputs. Sure, a subset of the inputs may run in with a smaller complexity but worst-case means considering all possibilities. Big-O can also be used to describe the average-case complexity. However, as far as I know, the "default meaning" for the sentence "This algorithm takes O(f(n))" time is that O(f(n)) is the worst-case scenario, not the average case.
– Bakuriu
Dec 18 '18 at 23:31
1
@Bakuriu Big O notation has nothing to do with the type of complexity analyzed. It may not even be time complexity. While you can, of course, argue that one may assume we are talking about worst-case time complexity, dquijada said that "OP specifically asked for O, that's for worst case", which is simply not true. OP didn't specifically say anything about worst-case (quite the opposite: we can only assume), nor "O" means worst-case either (that is not true at all).
– Acorn
Dec 18 '18 at 23:35
3
Further, in this specific problem, this solution is actually very efficient in practice, because you almost never hit the worst case (and/or you can simply fallback to another solution if you detect you may be hitting it, e.g. after a few tries; giving you back a worst-case O(n) algorithm easily). So dismissing this solution by saying it is O(n^2) and claiming that OP wanted a good worst-case algorithm, is simply wrong.
– Acorn
Dec 18 '18 at 23:42
|
show 4 more comments
Make random x (INT_MIN..INT_MAX) and test it against all. Test x++ on failure (very rare case for 40/400/4000).
Make random x (INT_MIN..INT_MAX) and test it against all. Test x++ on failure (very rare case for 40/400/4000).
answered Dec 18 '18 at 14:52
Alexey Birukov
475213
475213
2
He is looking for an efficient way to do it, this would be O(n^2). Yes, it will be way better most times, but still very inefficient worst-case (and OP specifically asked for O, that's for worst case)
– dquijada
Dec 18 '18 at 14:54
1
@dquijada No, it isn't. Big O notation describes the asymptotic behavior of a function, but it does not mean it is the worst-case of anything. In other words, an algorithm may have different upper bounds depending on the input you are studying.
– Acorn
Dec 18 '18 at 15:15
1
@Acorn Typically with big-O notation you describe the worst-case time complexity of an algorithm, i.e. the worst time it can have for all inputs. Sure, a subset of the inputs may run in with a smaller complexity but worst-case means considering all possibilities. Big-O can also be used to describe the average-case complexity. However, as far as I know, the "default meaning" for the sentence "This algorithm takes O(f(n))" time is that O(f(n)) is the worst-case scenario, not the average case.
– Bakuriu
Dec 18 '18 at 23:31
1
@Bakuriu Big O notation has nothing to do with the type of complexity analyzed. It may not even be time complexity. While you can, of course, argue that one may assume we are talking about worst-case time complexity, dquijada said that "OP specifically asked for O, that's for worst case", which is simply not true. OP didn't specifically say anything about worst-case (quite the opposite: we can only assume), nor "O" means worst-case either (that is not true at all).
– Acorn
Dec 18 '18 at 23:35
3
Further, in this specific problem, this solution is actually very efficient in practice, because you almost never hit the worst case (and/or you can simply fallback to another solution if you detect you may be hitting it, e.g. after a few tries; giving you back a worst-case O(n) algorithm easily). So dismissing this solution by saying it is O(n^2) and claiming that OP wanted a good worst-case algorithm, is simply wrong.
– Acorn
Dec 18 '18 at 23:42
|
show 4 more comments
2
He is looking for an efficient way to do it, this would be O(n^2). Yes, it will be way better most times, but still very inefficient worst-case (and OP specifically asked for O, that's for worst case)
– dquijada
Dec 18 '18 at 14:54
1
@dquijada No, it isn't. Big O notation describes the asymptotic behavior of a function, but it does not mean it is the worst-case of anything. In other words, an algorithm may have different upper bounds depending on the input you are studying.
– Acorn
Dec 18 '18 at 15:15
1
@Acorn Typically with big-O notation you describe the worst-case time complexity of an algorithm, i.e. the worst time it can have for all inputs. Sure, a subset of the inputs may run in with a smaller complexity but worst-case means considering all possibilities. Big-O can also be used to describe the average-case complexity. However, as far as I know, the "default meaning" for the sentence "This algorithm takes O(f(n))" time is that O(f(n)) is the worst-case scenario, not the average case.
– Bakuriu
Dec 18 '18 at 23:31
1
@Bakuriu Big O notation has nothing to do with the type of complexity analyzed. It may not even be time complexity. While you can, of course, argue that one may assume we are talking about worst-case time complexity, dquijada said that "OP specifically asked for O, that's for worst case", which is simply not true. OP didn't specifically say anything about worst-case (quite the opposite: we can only assume), nor "O" means worst-case either (that is not true at all).
– Acorn
Dec 18 '18 at 23:35
3
Further, in this specific problem, this solution is actually very efficient in practice, because you almost never hit the worst case (and/or you can simply fallback to another solution if you detect you may be hitting it, e.g. after a few tries; giving you back a worst-case O(n) algorithm easily). So dismissing this solution by saying it is O(n^2) and claiming that OP wanted a good worst-case algorithm, is simply wrong.
– Acorn
Dec 18 '18 at 23:42
2
2
He is looking for an efficient way to do it, this would be O(n^2). Yes, it will be way better most times, but still very inefficient worst-case (and OP specifically asked for O, that's for worst case)
– dquijada
Dec 18 '18 at 14:54
He is looking for an efficient way to do it, this would be O(n^2). Yes, it will be way better most times, but still very inefficient worst-case (and OP specifically asked for O, that's for worst case)
– dquijada
Dec 18 '18 at 14:54
1
1
@dquijada No, it isn't. Big O notation describes the asymptotic behavior of a function, but it does not mean it is the worst-case of anything. In other words, an algorithm may have different upper bounds depending on the input you are studying.
– Acorn
Dec 18 '18 at 15:15
@dquijada No, it isn't. Big O notation describes the asymptotic behavior of a function, but it does not mean it is the worst-case of anything. In other words, an algorithm may have different upper bounds depending on the input you are studying.
– Acorn
Dec 18 '18 at 15:15
1
1
@Acorn Typically with big-O notation you describe the worst-case time complexity of an algorithm, i.e. the worst time it can have for all inputs. Sure, a subset of the inputs may run in with a smaller complexity but worst-case means considering all possibilities. Big-O can also be used to describe the average-case complexity. However, as far as I know, the "default meaning" for the sentence "This algorithm takes O(f(n))" time is that O(f(n)) is the worst-case scenario, not the average case.
– Bakuriu
Dec 18 '18 at 23:31
@Acorn Typically with big-O notation you describe the worst-case time complexity of an algorithm, i.e. the worst time it can have for all inputs. Sure, a subset of the inputs may run in with a smaller complexity but worst-case means considering all possibilities. Big-O can also be used to describe the average-case complexity. However, as far as I know, the "default meaning" for the sentence "This algorithm takes O(f(n))" time is that O(f(n)) is the worst-case scenario, not the average case.
– Bakuriu
Dec 18 '18 at 23:31
1
1
@Bakuriu Big O notation has nothing to do with the type of complexity analyzed. It may not even be time complexity. While you can, of course, argue that one may assume we are talking about worst-case time complexity, dquijada said that "OP specifically asked for O, that's for worst case", which is simply not true. OP didn't specifically say anything about worst-case (quite the opposite: we can only assume), nor "O" means worst-case either (that is not true at all).
– Acorn
Dec 18 '18 at 23:35
@Bakuriu Big O notation has nothing to do with the type of complexity analyzed. It may not even be time complexity. While you can, of course, argue that one may assume we are talking about worst-case time complexity, dquijada said that "OP specifically asked for O, that's for worst case", which is simply not true. OP didn't specifically say anything about worst-case (quite the opposite: we can only assume), nor "O" means worst-case either (that is not true at all).
– Acorn
Dec 18 '18 at 23:35
3
3
Further, in this specific problem, this solution is actually very efficient in practice, because you almost never hit the worst case (and/or you can simply fallback to another solution if you detect you may be hitting it, e.g. after a few tries; giving you back a worst-case O(n) algorithm easily). So dismissing this solution by saying it is O(n^2) and claiming that OP wanted a good worst-case algorithm, is simply wrong.
– Acorn
Dec 18 '18 at 23:42
Further, in this specific problem, this solution is actually very efficient in practice, because you almost never hit the worst case (and/or you can simply fallback to another solution if you detect you may be hitting it, e.g. after a few tries; giving you back a worst-case O(n) algorithm easily). So dismissing this solution by saying it is O(n^2) and claiming that OP wanted a good worst-case algorithm, is simply wrong.
– Acorn
Dec 18 '18 at 23:42
|
show 4 more comments
Step 1: Sort the vector.
That can be done in O(n log(n)), you can find a few different algorithms online, use the one you like the most.
Step 2: Find the first int not in the vector.
Easily iterate from INT_MIN to INT_MIN + 40/400/4000 checking if the vector has the current int:
Pseudocode:
SIZE = 40|400|4000 // The one you are using
for (int i = 0; i < SIZE; i++) {
if (array[i] != INT_MIN + i)
return INT_MIN + i;
The solution would be O(n log(n) + n) meaning: O(n log(n))
Edit: just read your edit asking for something better than O(n log(n)), sorry.
Thanks for the answer. A minor point: though it being pseudocode, it doesn't seemarray[i]
would make any sense wheni
starts fromINT_MIN
.
– fleix
Dec 18 '18 at 15:10
You are right, @fleix, it's fixed now
– dquijada
Dec 18 '18 at 15:14
1
You can do the search using binary search in O(log N).
– rici
Dec 18 '18 at 20:25
1
@rici: how would you binary search for "a missing number"? If you knew a number that was missing, you could find the position it belongs in a sorted array in log(n) time. I don't think you can do better than linear, although probably with a low constant factor. With a uniform distribution, a linear search will hit a gap inO(n / max_n)
time on average, but worst casen
(no gaps until the end.)
– Peter Cordes
Dec 18 '18 at 23:50
2
@peter: the same way you do a linear search, comparei + min
withvec[i]
. If they are unequal, there is a missing value beforei
, so you bisect down; otherwise you bisect up. Binary search works with any monotonic predicate.
– rici
Dec 19 '18 at 2:05
|
show 4 more comments
Step 1: Sort the vector.
That can be done in O(n log(n)), you can find a few different algorithms online, use the one you like the most.
Step 2: Find the first int not in the vector.
Easily iterate from INT_MIN to INT_MIN + 40/400/4000 checking if the vector has the current int:
Pseudocode:
SIZE = 40|400|4000 // The one you are using
for (int i = 0; i < SIZE; i++) {
if (array[i] != INT_MIN + i)
return INT_MIN + i;
The solution would be O(n log(n) + n) meaning: O(n log(n))
Edit: just read your edit asking for something better than O(n log(n)), sorry.
Thanks for the answer. A minor point: though it being pseudocode, it doesn't seemarray[i]
would make any sense wheni
starts fromINT_MIN
.
– fleix
Dec 18 '18 at 15:10
You are right, @fleix, it's fixed now
– dquijada
Dec 18 '18 at 15:14
1
You can do the search using binary search in O(log N).
– rici
Dec 18 '18 at 20:25
1
@rici: how would you binary search for "a missing number"? If you knew a number that was missing, you could find the position it belongs in a sorted array in log(n) time. I don't think you can do better than linear, although probably with a low constant factor. With a uniform distribution, a linear search will hit a gap inO(n / max_n)
time on average, but worst casen
(no gaps until the end.)
– Peter Cordes
Dec 18 '18 at 23:50
2
@peter: the same way you do a linear search, comparei + min
withvec[i]
. If they are unequal, there is a missing value beforei
, so you bisect down; otherwise you bisect up. Binary search works with any monotonic predicate.
– rici
Dec 19 '18 at 2:05
|
show 4 more comments
Step 1: Sort the vector.
That can be done in O(n log(n)), you can find a few different algorithms online, use the one you like the most.
Step 2: Find the first int not in the vector.
Easily iterate from INT_MIN to INT_MIN + 40/400/4000 checking if the vector has the current int:
Pseudocode:
SIZE = 40|400|4000 // The one you are using
for (int i = 0; i < SIZE; i++) {
if (array[i] != INT_MIN + i)
return INT_MIN + i;
The solution would be O(n log(n) + n) meaning: O(n log(n))
Edit: just read your edit asking for something better than O(n log(n)), sorry.
Step 1: Sort the vector.
That can be done in O(n log(n)), you can find a few different algorithms online, use the one you like the most.
Step 2: Find the first int not in the vector.
Easily iterate from INT_MIN to INT_MIN + 40/400/4000 checking if the vector has the current int:
Pseudocode:
SIZE = 40|400|4000 // The one you are using
for (int i = 0; i < SIZE; i++) {
if (array[i] != INT_MIN + i)
return INT_MIN + i;
The solution would be O(n log(n) + n) meaning: O(n log(n))
Edit: just read your edit asking for something better than O(n log(n)), sorry.
edited Dec 18 '18 at 15:14
answered Dec 18 '18 at 14:52
dquijada
1,062820
1,062820
Thanks for the answer. A minor point: though it being pseudocode, it doesn't seemarray[i]
would make any sense wheni
starts fromINT_MIN
.
– fleix
Dec 18 '18 at 15:10
You are right, @fleix, it's fixed now
– dquijada
Dec 18 '18 at 15:14
1
You can do the search using binary search in O(log N).
– rici
Dec 18 '18 at 20:25
1
@rici: how would you binary search for "a missing number"? If you knew a number that was missing, you could find the position it belongs in a sorted array in log(n) time. I don't think you can do better than linear, although probably with a low constant factor. With a uniform distribution, a linear search will hit a gap inO(n / max_n)
time on average, but worst casen
(no gaps until the end.)
– Peter Cordes
Dec 18 '18 at 23:50
2
@peter: the same way you do a linear search, comparei + min
withvec[i]
. If they are unequal, there is a missing value beforei
, so you bisect down; otherwise you bisect up. Binary search works with any monotonic predicate.
– rici
Dec 19 '18 at 2:05
|
show 4 more comments
Thanks for the answer. A minor point: though it being pseudocode, it doesn't seemarray[i]
would make any sense wheni
starts fromINT_MIN
.
– fleix
Dec 18 '18 at 15:10
You are right, @fleix, it's fixed now
– dquijada
Dec 18 '18 at 15:14
1
You can do the search using binary search in O(log N).
– rici
Dec 18 '18 at 20:25
1
@rici: how would you binary search for "a missing number"? If you knew a number that was missing, you could find the position it belongs in a sorted array in log(n) time. I don't think you can do better than linear, although probably with a low constant factor. With a uniform distribution, a linear search will hit a gap inO(n / max_n)
time on average, but worst casen
(no gaps until the end.)
– Peter Cordes
Dec 18 '18 at 23:50
2
@peter: the same way you do a linear search, comparei + min
withvec[i]
. If they are unequal, there is a missing value beforei
, so you bisect down; otherwise you bisect up. Binary search works with any monotonic predicate.
– rici
Dec 19 '18 at 2:05
Thanks for the answer. A minor point: though it being pseudocode, it doesn't seem
array[i]
would make any sense when i
starts from INT_MIN
.– fleix
Dec 18 '18 at 15:10
Thanks for the answer. A minor point: though it being pseudocode, it doesn't seem
array[i]
would make any sense when i
starts from INT_MIN
.– fleix
Dec 18 '18 at 15:10
You are right, @fleix, it's fixed now
– dquijada
Dec 18 '18 at 15:14
You are right, @fleix, it's fixed now
– dquijada
Dec 18 '18 at 15:14
1
1
You can do the search using binary search in O(log N).
– rici
Dec 18 '18 at 20:25
You can do the search using binary search in O(log N).
– rici
Dec 18 '18 at 20:25
1
1
@rici: how would you binary search for "a missing number"? If you knew a number that was missing, you could find the position it belongs in a sorted array in log(n) time. I don't think you can do better than linear, although probably with a low constant factor. With a uniform distribution, a linear search will hit a gap in
O(n / max_n)
time on average, but worst case n
(no gaps until the end.)– Peter Cordes
Dec 18 '18 at 23:50
@rici: how would you binary search for "a missing number"? If you knew a number that was missing, you could find the position it belongs in a sorted array in log(n) time. I don't think you can do better than linear, although probably with a low constant factor. With a uniform distribution, a linear search will hit a gap in
O(n / max_n)
time on average, but worst case n
(no gaps until the end.)– Peter Cordes
Dec 18 '18 at 23:50
2
2
@peter: the same way you do a linear search, compare
i + min
with vec[i]
. If they are unequal, there is a missing value before i
, so you bisect down; otherwise you bisect up. Binary search works with any monotonic predicate.– rici
Dec 19 '18 at 2:05
@peter: the same way you do a linear search, compare
i + min
with vec[i]
. If they are unequal, there is a missing value before i
, so you bisect down; otherwise you bisect up. Binary search works with any monotonic predicate.– rici
Dec 19 '18 at 2:05
|
show 4 more comments
For the case in which the integers are provided in an std::unordered_set<int>
(as opposed to a std::vector<int>
), you could simply traverse the range of integer values until you come up against one integer value that is not present in the unordered_set<int>
. Searching for the presence of an integer in an std::unordered_set<int>
is quite straightforward, since std::unodered_set
does provide searching through its find()
member function.
The space complexity of this approach would be O(1).
If you start traversing at the lowest possible value for an int
(i.e., std::numeric_limits<int>::min()
), you will obtain the lowest int
not contained in the std::unordered_set<int>
:
int find_lowest_not_contained(const std::unordered_set<int>& set) {
for (auto i = std::numeric_limits<int>::min(); ; ++i) {
auto it = set.find(i); // search in set
if (it == set.end()) // integer not in set?
return *it;
}
}
Analogously, if you start traversing at the greatest possible value for an int
(i.e., std::numeric_limits<int>::max()
), you will obtain the lowest int
not contained in the std::unordered_set<int>
:
int find_greatest_not_contained(const std::unordered_set<int>& set) {
for (auto i = std::numeric_limits<int>::max(); ; --i) {
auto it = set.find(i); // search in set
if (it == set.end()) // integer not in set?
return *it;
}
}
Assuming that the int
s are uniformly mapped by the hash function into the unordered_set<int>
's buckets, a search operation on the unordered_set<int>
can be achieved in constant time. The run-time complexity would then be O(M ), where M is the size of the integer range you are looking for a non-contained value. M is upper-bounded by the size of the unordered_set<int>
(i.e., in your case M <= 4000).
Indeed, with this approach, selecting any integer range whose size is greater than the size of the unordered_set
, is guaranteed to come up against an integer value which is not present in the unordered_set<int>
.
add a comment |
For the case in which the integers are provided in an std::unordered_set<int>
(as opposed to a std::vector<int>
), you could simply traverse the range of integer values until you come up against one integer value that is not present in the unordered_set<int>
. Searching for the presence of an integer in an std::unordered_set<int>
is quite straightforward, since std::unodered_set
does provide searching through its find()
member function.
The space complexity of this approach would be O(1).
If you start traversing at the lowest possible value for an int
(i.e., std::numeric_limits<int>::min()
), you will obtain the lowest int
not contained in the std::unordered_set<int>
:
int find_lowest_not_contained(const std::unordered_set<int>& set) {
for (auto i = std::numeric_limits<int>::min(); ; ++i) {
auto it = set.find(i); // search in set
if (it == set.end()) // integer not in set?
return *it;
}
}
Analogously, if you start traversing at the greatest possible value for an int
(i.e., std::numeric_limits<int>::max()
), you will obtain the lowest int
not contained in the std::unordered_set<int>
:
int find_greatest_not_contained(const std::unordered_set<int>& set) {
for (auto i = std::numeric_limits<int>::max(); ; --i) {
auto it = set.find(i); // search in set
if (it == set.end()) // integer not in set?
return *it;
}
}
Assuming that the int
s are uniformly mapped by the hash function into the unordered_set<int>
's buckets, a search operation on the unordered_set<int>
can be achieved in constant time. The run-time complexity would then be O(M ), where M is the size of the integer range you are looking for a non-contained value. M is upper-bounded by the size of the unordered_set<int>
(i.e., in your case M <= 4000).
Indeed, with this approach, selecting any integer range whose size is greater than the size of the unordered_set
, is guaranteed to come up against an integer value which is not present in the unordered_set<int>
.
add a comment |
For the case in which the integers are provided in an std::unordered_set<int>
(as opposed to a std::vector<int>
), you could simply traverse the range of integer values until you come up against one integer value that is not present in the unordered_set<int>
. Searching for the presence of an integer in an std::unordered_set<int>
is quite straightforward, since std::unodered_set
does provide searching through its find()
member function.
The space complexity of this approach would be O(1).
If you start traversing at the lowest possible value for an int
(i.e., std::numeric_limits<int>::min()
), you will obtain the lowest int
not contained in the std::unordered_set<int>
:
int find_lowest_not_contained(const std::unordered_set<int>& set) {
for (auto i = std::numeric_limits<int>::min(); ; ++i) {
auto it = set.find(i); // search in set
if (it == set.end()) // integer not in set?
return *it;
}
}
Analogously, if you start traversing at the greatest possible value for an int
(i.e., std::numeric_limits<int>::max()
), you will obtain the lowest int
not contained in the std::unordered_set<int>
:
int find_greatest_not_contained(const std::unordered_set<int>& set) {
for (auto i = std::numeric_limits<int>::max(); ; --i) {
auto it = set.find(i); // search in set
if (it == set.end()) // integer not in set?
return *it;
}
}
Assuming that the int
s are uniformly mapped by the hash function into the unordered_set<int>
's buckets, a search operation on the unordered_set<int>
can be achieved in constant time. The run-time complexity would then be O(M ), where M is the size of the integer range you are looking for a non-contained value. M is upper-bounded by the size of the unordered_set<int>
(i.e., in your case M <= 4000).
Indeed, with this approach, selecting any integer range whose size is greater than the size of the unordered_set
, is guaranteed to come up against an integer value which is not present in the unordered_set<int>
.
For the case in which the integers are provided in an std::unordered_set<int>
(as opposed to a std::vector<int>
), you could simply traverse the range of integer values until you come up against one integer value that is not present in the unordered_set<int>
. Searching for the presence of an integer in an std::unordered_set<int>
is quite straightforward, since std::unodered_set
does provide searching through its find()
member function.
The space complexity of this approach would be O(1).
If you start traversing at the lowest possible value for an int
(i.e., std::numeric_limits<int>::min()
), you will obtain the lowest int
not contained in the std::unordered_set<int>
:
int find_lowest_not_contained(const std::unordered_set<int>& set) {
for (auto i = std::numeric_limits<int>::min(); ; ++i) {
auto it = set.find(i); // search in set
if (it == set.end()) // integer not in set?
return *it;
}
}
Analogously, if you start traversing at the greatest possible value for an int
(i.e., std::numeric_limits<int>::max()
), you will obtain the lowest int
not contained in the std::unordered_set<int>
:
int find_greatest_not_contained(const std::unordered_set<int>& set) {
for (auto i = std::numeric_limits<int>::max(); ; --i) {
auto it = set.find(i); // search in set
if (it == set.end()) // integer not in set?
return *it;
}
}
Assuming that the int
s are uniformly mapped by the hash function into the unordered_set<int>
's buckets, a search operation on the unordered_set<int>
can be achieved in constant time. The run-time complexity would then be O(M ), where M is the size of the integer range you are looking for a non-contained value. M is upper-bounded by the size of the unordered_set<int>
(i.e., in your case M <= 4000).
Indeed, with this approach, selecting any integer range whose size is greater than the size of the unordered_set
, is guaranteed to come up against an integer value which is not present in the unordered_set<int>
.
edited Dec 19 '18 at 9:50
answered Dec 19 '18 at 9:18
ネロク
10.5k32140
10.5k32140
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53835354%2fefficiently-find-an-integer-not-in-a-set-of-size-40-400-or-4000%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
sorting the vector can be done in
O(n log(n))
, not sure if you can get anything more efficient– user463035818
Dec 18 '18 at 14:41
2
Is the vector sorted? For the set it is trivial since it is sorted.
– NathanOliver
Dec 18 '18 at 14:41
1
@user463035818 if you can sort a vector in
O(log(n))
, you should become very rich indeed.– Walter
Dec 18 '18 at 14:43
1
@Walter meh it was a typo, no millions for me unfortunately
– user463035818
Dec 18 '18 at 14:44
1
Once sorted, you do: Last + 1?
– JVApen
Dec 18 '18 at 14:47