AWK: get random lines of file satisfying a condition?
up vote
6
down vote
favorite
I am trying to get a set number of random lines that satisfy a condition.
e.g. if my file was:
a 1 5
b 4 12
c 2 3
e 6 14
f 7 52
g 1 8
then I would like exactly two random lines where the difference between column 3 and column 2 is greater than 3 but less than 10 (e.g. lines starting with a, b, e, and g would qualify)
How would I approach this?
awk (if something and random) '{print $1,$2,$3}'
awk
add a comment |
up vote
6
down vote
favorite
I am trying to get a set number of random lines that satisfy a condition.
e.g. if my file was:
a 1 5
b 4 12
c 2 3
e 6 14
f 7 52
g 1 8
then I would like exactly two random lines where the difference between column 3 and column 2 is greater than 3 but less than 10 (e.g. lines starting with a, b, e, and g would qualify)
How would I approach this?
awk (if something and random) '{print $1,$2,$3}'
awk
add a comment |
up vote
6
down vote
favorite
up vote
6
down vote
favorite
I am trying to get a set number of random lines that satisfy a condition.
e.g. if my file was:
a 1 5
b 4 12
c 2 3
e 6 14
f 7 52
g 1 8
then I would like exactly two random lines where the difference between column 3 and column 2 is greater than 3 but less than 10 (e.g. lines starting with a, b, e, and g would qualify)
How would I approach this?
awk (if something and random) '{print $1,$2,$3}'
awk
I am trying to get a set number of random lines that satisfy a condition.
e.g. if my file was:
a 1 5
b 4 12
c 2 3
e 6 14
f 7 52
g 1 8
then I would like exactly two random lines where the difference between column 3 and column 2 is greater than 3 but less than 10 (e.g. lines starting with a, b, e, and g would qualify)
How would I approach this?
awk (if something and random) '{print $1,$2,$3}'
awk
awk
edited Nov 25 at 23:37
Rui F Ribeiro
38.3k1477127
38.3k1477127
asked Jun 22 '17 at 20:22
SumNeuron
1312
1312
add a comment |
add a comment |
4 Answers
4
active
oldest
votes
up vote
11
down vote
You can do this in awk
but getting the random selection of lines will be complex and will require writing quite a bit of code. I would instead use awk
to get the lines that match your criteria and then use the standard tool shuf
to choose a random selection:
$ awk '$3-$2>3 && $3-$2 < 10' file | shuf -n2
g 1 8
a 1 5
If you run this a few times, you'll see you get a random selection of lines:
$ for i in {1..5}; do awk '$3-$2>3 && $3-$2 < 10' file | shuf -n2; echo "--"; done
g 1 8
e 6 14
--
g 1 8
e 6 14
--
b 4 12
g 1 8
--
b 4 12
e 6 14
--
e 6 14
b 4 12
--
The shuf
tool is part of the GNU coreutils, so it should be installed by default on most any Linux system and easily available for most any *nix.
2
Selecting a single value is easy (condition && rand() < 1 / ++n { selection = $0 }
), but I'm not sure how to adapt that to selecting N, so yeah,shuf
is probably the way to go.
– Kevin
Jun 22 '17 at 20:39
@Kevin ah, clever trick with the/ ++n
! However, the problem is that i) you need to also usesrand()
to set a random seed else you always get the same output and ii) you can sometimes get no output at all for a small file. I tried extending it to print two lines, but I still can't get past ii):awk 'BEGIN{srand()}{if($3-$2>3 && $3-$2 <10 && rand() < 1 / ++n){a[k++]=$0}if(k>1){print a[0]"n"a[1]; exit}}' file
and, in any case, that's more effort than it's worth, as you said.
– terdon♦
Jun 22 '17 at 20:48
1
I believeawk -v count=2 'BEGIN { srand() } $3 - $2 > 3 && $3 - $2 < 10 && rand() < count / ++n { if (n <= count) { s[n] = $0 } else { s[int(rand()*count)] = $0 } } END { for (i = 1; i <= count; i++) print s[i] }' file
should work... but I'm not a statistician.
– Kevin
Jun 22 '17 at 21:14
@Kevin looks like it. Nice one! May as well make it into an answer. If only to show why we tend to use existing utilities when possible :)
– terdon♦
Jun 22 '17 at 21:16
sure, done. I did notice a bug in the comment, fixed in my answer.
– Kevin
Jun 22 '17 at 21:32
|
show 5 more comments
up vote
4
down vote
If you want a pure awk answer that only iterates through the list once:
awk -v count=2 'BEGIN { srand() } $3 - $2 > 3 && $3 - $2 < 10 && rand() < count / ++n { if (n <= count) { s[n] = $0 } else { s[1+int(rand()*count)] = $0 } } END { for (i in s) print s[i] }' input.txt
Stored in a file for easier reading:
BEGIN { srand() }
$3 - $2 > 3 &&
$3 - $2 < 10 &&
rand() < count / ++n {
if (n <= count) {
s[n] = $0
} else {
s[1+int(rand()*count)] = $0
}
}
END {
for (i in s) print s[i]
}
The algorithm is a slight variation on Knuth's algorithm R; I'm pretty sure the change doesn't alter the distribution but I'm not a statistician so I can't guarantee it.
Commented for those less familiar with awk:
# Before the first line is read...
BEGIN {
# ...seed the random number generator.
srand()
}
# For each line:
# if the difference between the second and third columns is between 3 and 10 (exclusive)...
$3 - $2 > 3 &&
$3 - $2 < 10 &&
# ... with a probability of (total rows to select) / (total matching rows so far) ...
rand() < count / ++n {
# ... If we haven't reached the number of rows we need, just add it to our list
if (n <= count) {
s[n] = $0
} else {
# otherwise, replace a random entry in our list with the current line.
s[1+int(rand()*count)] = $0
}
}
# After all lines have been processed...
END {
# Print all lines in our list.
for (i in s) print s[i]
}
Could you please explain this witchcraft for those uninitiated inawk
:)
– SumNeuron
Jun 22 '17 at 22:49
1
Added some explanation.
– Kevin
Jun 22 '17 at 23:19
add a comment |
up vote
2
down vote
Here's one way to do it in GNU awk (which supports custom sort routines):
#!/usr/bin/gawk -f
function mycmp(ia, va, ib, vb) {
return rand() < 0.5 ? 0 : 1;
}
BEGIN {
srand();
}
$3 - $2 > 3 && $3 - $2 < 10 {
a[NR]=$0;
}
END {
asort(a, b, "mycmp");
for (i = 1; i < 3; i++) print b[i];
}
Testing with the given data:
$ for i in {1..6}; do printf 'Try %d:n' $i; ../randsel.awk file; sleep 2; done
Try 1:
g 1 8
e 6 14
Try 2:
a 1 5
b 4 12
Try 3:
b 4 12
a 1 5
Try 4:
e 6 14
a 1 5
Try 5:
b 4 12
a 1 5
Try 6:
e 6 14
b 4 12
add a comment |
up vote
0
down vote
Posting a perl
solution, as I don't see any reason why it must be in awk
(except for the OP's wish):
#!/usr/bin/perl
use strict;
use warnings;
my $N = 2;
my $k;
my @r;
while(<>) {
my @line = split(/s+/);
if ($line[2] - $line[1] > 3 && $line[2] - $line[1] < 10) {
if(++$k <= $N) {
push @r, $_;
} elsif(rand(1) <= ($N/$k)) {
$r[rand(@r)] = $_;
}
}
}
print @r;
This is a classic example of reservoir sampling. The algorithm was copied from here and modified by me to meet OP's specific wishes.
When saved in file reservoir.pl
you run it with ./reservoir.pl file1 file2 file3
or cat file1 file2 file3 | ./reservoir.pl
.
add a comment |
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
11
down vote
You can do this in awk
but getting the random selection of lines will be complex and will require writing quite a bit of code. I would instead use awk
to get the lines that match your criteria and then use the standard tool shuf
to choose a random selection:
$ awk '$3-$2>3 && $3-$2 < 10' file | shuf -n2
g 1 8
a 1 5
If you run this a few times, you'll see you get a random selection of lines:
$ for i in {1..5}; do awk '$3-$2>3 && $3-$2 < 10' file | shuf -n2; echo "--"; done
g 1 8
e 6 14
--
g 1 8
e 6 14
--
b 4 12
g 1 8
--
b 4 12
e 6 14
--
e 6 14
b 4 12
--
The shuf
tool is part of the GNU coreutils, so it should be installed by default on most any Linux system and easily available for most any *nix.
2
Selecting a single value is easy (condition && rand() < 1 / ++n { selection = $0 }
), but I'm not sure how to adapt that to selecting N, so yeah,shuf
is probably the way to go.
– Kevin
Jun 22 '17 at 20:39
@Kevin ah, clever trick with the/ ++n
! However, the problem is that i) you need to also usesrand()
to set a random seed else you always get the same output and ii) you can sometimes get no output at all for a small file. I tried extending it to print two lines, but I still can't get past ii):awk 'BEGIN{srand()}{if($3-$2>3 && $3-$2 <10 && rand() < 1 / ++n){a[k++]=$0}if(k>1){print a[0]"n"a[1]; exit}}' file
and, in any case, that's more effort than it's worth, as you said.
– terdon♦
Jun 22 '17 at 20:48
1
I believeawk -v count=2 'BEGIN { srand() } $3 - $2 > 3 && $3 - $2 < 10 && rand() < count / ++n { if (n <= count) { s[n] = $0 } else { s[int(rand()*count)] = $0 } } END { for (i = 1; i <= count; i++) print s[i] }' file
should work... but I'm not a statistician.
– Kevin
Jun 22 '17 at 21:14
@Kevin looks like it. Nice one! May as well make it into an answer. If only to show why we tend to use existing utilities when possible :)
– terdon♦
Jun 22 '17 at 21:16
sure, done. I did notice a bug in the comment, fixed in my answer.
– Kevin
Jun 22 '17 at 21:32
|
show 5 more comments
up vote
11
down vote
You can do this in awk
but getting the random selection of lines will be complex and will require writing quite a bit of code. I would instead use awk
to get the lines that match your criteria and then use the standard tool shuf
to choose a random selection:
$ awk '$3-$2>3 && $3-$2 < 10' file | shuf -n2
g 1 8
a 1 5
If you run this a few times, you'll see you get a random selection of lines:
$ for i in {1..5}; do awk '$3-$2>3 && $3-$2 < 10' file | shuf -n2; echo "--"; done
g 1 8
e 6 14
--
g 1 8
e 6 14
--
b 4 12
g 1 8
--
b 4 12
e 6 14
--
e 6 14
b 4 12
--
The shuf
tool is part of the GNU coreutils, so it should be installed by default on most any Linux system and easily available for most any *nix.
2
Selecting a single value is easy (condition && rand() < 1 / ++n { selection = $0 }
), but I'm not sure how to adapt that to selecting N, so yeah,shuf
is probably the way to go.
– Kevin
Jun 22 '17 at 20:39
@Kevin ah, clever trick with the/ ++n
! However, the problem is that i) you need to also usesrand()
to set a random seed else you always get the same output and ii) you can sometimes get no output at all for a small file. I tried extending it to print two lines, but I still can't get past ii):awk 'BEGIN{srand()}{if($3-$2>3 && $3-$2 <10 && rand() < 1 / ++n){a[k++]=$0}if(k>1){print a[0]"n"a[1]; exit}}' file
and, in any case, that's more effort than it's worth, as you said.
– terdon♦
Jun 22 '17 at 20:48
1
I believeawk -v count=2 'BEGIN { srand() } $3 - $2 > 3 && $3 - $2 < 10 && rand() < count / ++n { if (n <= count) { s[n] = $0 } else { s[int(rand()*count)] = $0 } } END { for (i = 1; i <= count; i++) print s[i] }' file
should work... but I'm not a statistician.
– Kevin
Jun 22 '17 at 21:14
@Kevin looks like it. Nice one! May as well make it into an answer. If only to show why we tend to use existing utilities when possible :)
– terdon♦
Jun 22 '17 at 21:16
sure, done. I did notice a bug in the comment, fixed in my answer.
– Kevin
Jun 22 '17 at 21:32
|
show 5 more comments
up vote
11
down vote
up vote
11
down vote
You can do this in awk
but getting the random selection of lines will be complex and will require writing quite a bit of code. I would instead use awk
to get the lines that match your criteria and then use the standard tool shuf
to choose a random selection:
$ awk '$3-$2>3 && $3-$2 < 10' file | shuf -n2
g 1 8
a 1 5
If you run this a few times, you'll see you get a random selection of lines:
$ for i in {1..5}; do awk '$3-$2>3 && $3-$2 < 10' file | shuf -n2; echo "--"; done
g 1 8
e 6 14
--
g 1 8
e 6 14
--
b 4 12
g 1 8
--
b 4 12
e 6 14
--
e 6 14
b 4 12
--
The shuf
tool is part of the GNU coreutils, so it should be installed by default on most any Linux system and easily available for most any *nix.
You can do this in awk
but getting the random selection of lines will be complex and will require writing quite a bit of code. I would instead use awk
to get the lines that match your criteria and then use the standard tool shuf
to choose a random selection:
$ awk '$3-$2>3 && $3-$2 < 10' file | shuf -n2
g 1 8
a 1 5
If you run this a few times, you'll see you get a random selection of lines:
$ for i in {1..5}; do awk '$3-$2>3 && $3-$2 < 10' file | shuf -n2; echo "--"; done
g 1 8
e 6 14
--
g 1 8
e 6 14
--
b 4 12
g 1 8
--
b 4 12
e 6 14
--
e 6 14
b 4 12
--
The shuf
tool is part of the GNU coreutils, so it should be installed by default on most any Linux system and easily available for most any *nix.
edited Jun 22 '17 at 21:17
answered Jun 22 '17 at 20:38
terdon♦
127k31244421
127k31244421
2
Selecting a single value is easy (condition && rand() < 1 / ++n { selection = $0 }
), but I'm not sure how to adapt that to selecting N, so yeah,shuf
is probably the way to go.
– Kevin
Jun 22 '17 at 20:39
@Kevin ah, clever trick with the/ ++n
! However, the problem is that i) you need to also usesrand()
to set a random seed else you always get the same output and ii) you can sometimes get no output at all for a small file. I tried extending it to print two lines, but I still can't get past ii):awk 'BEGIN{srand()}{if($3-$2>3 && $3-$2 <10 && rand() < 1 / ++n){a[k++]=$0}if(k>1){print a[0]"n"a[1]; exit}}' file
and, in any case, that's more effort than it's worth, as you said.
– terdon♦
Jun 22 '17 at 20:48
1
I believeawk -v count=2 'BEGIN { srand() } $3 - $2 > 3 && $3 - $2 < 10 && rand() < count / ++n { if (n <= count) { s[n] = $0 } else { s[int(rand()*count)] = $0 } } END { for (i = 1; i <= count; i++) print s[i] }' file
should work... but I'm not a statistician.
– Kevin
Jun 22 '17 at 21:14
@Kevin looks like it. Nice one! May as well make it into an answer. If only to show why we tend to use existing utilities when possible :)
– terdon♦
Jun 22 '17 at 21:16
sure, done. I did notice a bug in the comment, fixed in my answer.
– Kevin
Jun 22 '17 at 21:32
|
show 5 more comments
2
Selecting a single value is easy (condition && rand() < 1 / ++n { selection = $0 }
), but I'm not sure how to adapt that to selecting N, so yeah,shuf
is probably the way to go.
– Kevin
Jun 22 '17 at 20:39
@Kevin ah, clever trick with the/ ++n
! However, the problem is that i) you need to also usesrand()
to set a random seed else you always get the same output and ii) you can sometimes get no output at all for a small file. I tried extending it to print two lines, but I still can't get past ii):awk 'BEGIN{srand()}{if($3-$2>3 && $3-$2 <10 && rand() < 1 / ++n){a[k++]=$0}if(k>1){print a[0]"n"a[1]; exit}}' file
and, in any case, that's more effort than it's worth, as you said.
– terdon♦
Jun 22 '17 at 20:48
1
I believeawk -v count=2 'BEGIN { srand() } $3 - $2 > 3 && $3 - $2 < 10 && rand() < count / ++n { if (n <= count) { s[n] = $0 } else { s[int(rand()*count)] = $0 } } END { for (i = 1; i <= count; i++) print s[i] }' file
should work... but I'm not a statistician.
– Kevin
Jun 22 '17 at 21:14
@Kevin looks like it. Nice one! May as well make it into an answer. If only to show why we tend to use existing utilities when possible :)
– terdon♦
Jun 22 '17 at 21:16
sure, done. I did notice a bug in the comment, fixed in my answer.
– Kevin
Jun 22 '17 at 21:32
2
2
Selecting a single value is easy (
condition && rand() < 1 / ++n { selection = $0 }
), but I'm not sure how to adapt that to selecting N, so yeah, shuf
is probably the way to go.– Kevin
Jun 22 '17 at 20:39
Selecting a single value is easy (
condition && rand() < 1 / ++n { selection = $0 }
), but I'm not sure how to adapt that to selecting N, so yeah, shuf
is probably the way to go.– Kevin
Jun 22 '17 at 20:39
@Kevin ah, clever trick with the
/ ++n
! However, the problem is that i) you need to also use srand()
to set a random seed else you always get the same output and ii) you can sometimes get no output at all for a small file. I tried extending it to print two lines, but I still can't get past ii): awk 'BEGIN{srand()}{if($3-$2>3 && $3-$2 <10 && rand() < 1 / ++n){a[k++]=$0}if(k>1){print a[0]"n"a[1]; exit}}' file
and, in any case, that's more effort than it's worth, as you said.– terdon♦
Jun 22 '17 at 20:48
@Kevin ah, clever trick with the
/ ++n
! However, the problem is that i) you need to also use srand()
to set a random seed else you always get the same output and ii) you can sometimes get no output at all for a small file. I tried extending it to print two lines, but I still can't get past ii): awk 'BEGIN{srand()}{if($3-$2>3 && $3-$2 <10 && rand() < 1 / ++n){a[k++]=$0}if(k>1){print a[0]"n"a[1]; exit}}' file
and, in any case, that's more effort than it's worth, as you said.– terdon♦
Jun 22 '17 at 20:48
1
1
I believe
awk -v count=2 'BEGIN { srand() } $3 - $2 > 3 && $3 - $2 < 10 && rand() < count / ++n { if (n <= count) { s[n] = $0 } else { s[int(rand()*count)] = $0 } } END { for (i = 1; i <= count; i++) print s[i] }' file
should work... but I'm not a statistician.– Kevin
Jun 22 '17 at 21:14
I believe
awk -v count=2 'BEGIN { srand() } $3 - $2 > 3 && $3 - $2 < 10 && rand() < count / ++n { if (n <= count) { s[n] = $0 } else { s[int(rand()*count)] = $0 } } END { for (i = 1; i <= count; i++) print s[i] }' file
should work... but I'm not a statistician.– Kevin
Jun 22 '17 at 21:14
@Kevin looks like it. Nice one! May as well make it into an answer. If only to show why we tend to use existing utilities when possible :)
– terdon♦
Jun 22 '17 at 21:16
@Kevin looks like it. Nice one! May as well make it into an answer. If only to show why we tend to use existing utilities when possible :)
– terdon♦
Jun 22 '17 at 21:16
sure, done. I did notice a bug in the comment, fixed in my answer.
– Kevin
Jun 22 '17 at 21:32
sure, done. I did notice a bug in the comment, fixed in my answer.
– Kevin
Jun 22 '17 at 21:32
|
show 5 more comments
up vote
4
down vote
If you want a pure awk answer that only iterates through the list once:
awk -v count=2 'BEGIN { srand() } $3 - $2 > 3 && $3 - $2 < 10 && rand() < count / ++n { if (n <= count) { s[n] = $0 } else { s[1+int(rand()*count)] = $0 } } END { for (i in s) print s[i] }' input.txt
Stored in a file for easier reading:
BEGIN { srand() }
$3 - $2 > 3 &&
$3 - $2 < 10 &&
rand() < count / ++n {
if (n <= count) {
s[n] = $0
} else {
s[1+int(rand()*count)] = $0
}
}
END {
for (i in s) print s[i]
}
The algorithm is a slight variation on Knuth's algorithm R; I'm pretty sure the change doesn't alter the distribution but I'm not a statistician so I can't guarantee it.
Commented for those less familiar with awk:
# Before the first line is read...
BEGIN {
# ...seed the random number generator.
srand()
}
# For each line:
# if the difference between the second and third columns is between 3 and 10 (exclusive)...
$3 - $2 > 3 &&
$3 - $2 < 10 &&
# ... with a probability of (total rows to select) / (total matching rows so far) ...
rand() < count / ++n {
# ... If we haven't reached the number of rows we need, just add it to our list
if (n <= count) {
s[n] = $0
} else {
# otherwise, replace a random entry in our list with the current line.
s[1+int(rand()*count)] = $0
}
}
# After all lines have been processed...
END {
# Print all lines in our list.
for (i in s) print s[i]
}
Could you please explain this witchcraft for those uninitiated inawk
:)
– SumNeuron
Jun 22 '17 at 22:49
1
Added some explanation.
– Kevin
Jun 22 '17 at 23:19
add a comment |
up vote
4
down vote
If you want a pure awk answer that only iterates through the list once:
awk -v count=2 'BEGIN { srand() } $3 - $2 > 3 && $3 - $2 < 10 && rand() < count / ++n { if (n <= count) { s[n] = $0 } else { s[1+int(rand()*count)] = $0 } } END { for (i in s) print s[i] }' input.txt
Stored in a file for easier reading:
BEGIN { srand() }
$3 - $2 > 3 &&
$3 - $2 < 10 &&
rand() < count / ++n {
if (n <= count) {
s[n] = $0
} else {
s[1+int(rand()*count)] = $0
}
}
END {
for (i in s) print s[i]
}
The algorithm is a slight variation on Knuth's algorithm R; I'm pretty sure the change doesn't alter the distribution but I'm not a statistician so I can't guarantee it.
Commented for those less familiar with awk:
# Before the first line is read...
BEGIN {
# ...seed the random number generator.
srand()
}
# For each line:
# if the difference between the second and third columns is between 3 and 10 (exclusive)...
$3 - $2 > 3 &&
$3 - $2 < 10 &&
# ... with a probability of (total rows to select) / (total matching rows so far) ...
rand() < count / ++n {
# ... If we haven't reached the number of rows we need, just add it to our list
if (n <= count) {
s[n] = $0
} else {
# otherwise, replace a random entry in our list with the current line.
s[1+int(rand()*count)] = $0
}
}
# After all lines have been processed...
END {
# Print all lines in our list.
for (i in s) print s[i]
}
Could you please explain this witchcraft for those uninitiated inawk
:)
– SumNeuron
Jun 22 '17 at 22:49
1
Added some explanation.
– Kevin
Jun 22 '17 at 23:19
add a comment |
up vote
4
down vote
up vote
4
down vote
If you want a pure awk answer that only iterates through the list once:
awk -v count=2 'BEGIN { srand() } $3 - $2 > 3 && $3 - $2 < 10 && rand() < count / ++n { if (n <= count) { s[n] = $0 } else { s[1+int(rand()*count)] = $0 } } END { for (i in s) print s[i] }' input.txt
Stored in a file for easier reading:
BEGIN { srand() }
$3 - $2 > 3 &&
$3 - $2 < 10 &&
rand() < count / ++n {
if (n <= count) {
s[n] = $0
} else {
s[1+int(rand()*count)] = $0
}
}
END {
for (i in s) print s[i]
}
The algorithm is a slight variation on Knuth's algorithm R; I'm pretty sure the change doesn't alter the distribution but I'm not a statistician so I can't guarantee it.
Commented for those less familiar with awk:
# Before the first line is read...
BEGIN {
# ...seed the random number generator.
srand()
}
# For each line:
# if the difference between the second and third columns is between 3 and 10 (exclusive)...
$3 - $2 > 3 &&
$3 - $2 < 10 &&
# ... with a probability of (total rows to select) / (total matching rows so far) ...
rand() < count / ++n {
# ... If we haven't reached the number of rows we need, just add it to our list
if (n <= count) {
s[n] = $0
} else {
# otherwise, replace a random entry in our list with the current line.
s[1+int(rand()*count)] = $0
}
}
# After all lines have been processed...
END {
# Print all lines in our list.
for (i in s) print s[i]
}
If you want a pure awk answer that only iterates through the list once:
awk -v count=2 'BEGIN { srand() } $3 - $2 > 3 && $3 - $2 < 10 && rand() < count / ++n { if (n <= count) { s[n] = $0 } else { s[1+int(rand()*count)] = $0 } } END { for (i in s) print s[i] }' input.txt
Stored in a file for easier reading:
BEGIN { srand() }
$3 - $2 > 3 &&
$3 - $2 < 10 &&
rand() < count / ++n {
if (n <= count) {
s[n] = $0
} else {
s[1+int(rand()*count)] = $0
}
}
END {
for (i in s) print s[i]
}
The algorithm is a slight variation on Knuth's algorithm R; I'm pretty sure the change doesn't alter the distribution but I'm not a statistician so I can't guarantee it.
Commented for those less familiar with awk:
# Before the first line is read...
BEGIN {
# ...seed the random number generator.
srand()
}
# For each line:
# if the difference between the second and third columns is between 3 and 10 (exclusive)...
$3 - $2 > 3 &&
$3 - $2 < 10 &&
# ... with a probability of (total rows to select) / (total matching rows so far) ...
rand() < count / ++n {
# ... If we haven't reached the number of rows we need, just add it to our list
if (n <= count) {
s[n] = $0
} else {
# otherwise, replace a random entry in our list with the current line.
s[1+int(rand()*count)] = $0
}
}
# After all lines have been processed...
END {
# Print all lines in our list.
for (i in s) print s[i]
}
edited Jun 22 '17 at 23:19
answered Jun 22 '17 at 21:23
Kevin
26.7k106198
26.7k106198
Could you please explain this witchcraft for those uninitiated inawk
:)
– SumNeuron
Jun 22 '17 at 22:49
1
Added some explanation.
– Kevin
Jun 22 '17 at 23:19
add a comment |
Could you please explain this witchcraft for those uninitiated inawk
:)
– SumNeuron
Jun 22 '17 at 22:49
1
Added some explanation.
– Kevin
Jun 22 '17 at 23:19
Could you please explain this witchcraft for those uninitiated in
awk
:)– SumNeuron
Jun 22 '17 at 22:49
Could you please explain this witchcraft for those uninitiated in
awk
:)– SumNeuron
Jun 22 '17 at 22:49
1
1
Added some explanation.
– Kevin
Jun 22 '17 at 23:19
Added some explanation.
– Kevin
Jun 22 '17 at 23:19
add a comment |
up vote
2
down vote
Here's one way to do it in GNU awk (which supports custom sort routines):
#!/usr/bin/gawk -f
function mycmp(ia, va, ib, vb) {
return rand() < 0.5 ? 0 : 1;
}
BEGIN {
srand();
}
$3 - $2 > 3 && $3 - $2 < 10 {
a[NR]=$0;
}
END {
asort(a, b, "mycmp");
for (i = 1; i < 3; i++) print b[i];
}
Testing with the given data:
$ for i in {1..6}; do printf 'Try %d:n' $i; ../randsel.awk file; sleep 2; done
Try 1:
g 1 8
e 6 14
Try 2:
a 1 5
b 4 12
Try 3:
b 4 12
a 1 5
Try 4:
e 6 14
a 1 5
Try 5:
b 4 12
a 1 5
Try 6:
e 6 14
b 4 12
add a comment |
up vote
2
down vote
Here's one way to do it in GNU awk (which supports custom sort routines):
#!/usr/bin/gawk -f
function mycmp(ia, va, ib, vb) {
return rand() < 0.5 ? 0 : 1;
}
BEGIN {
srand();
}
$3 - $2 > 3 && $3 - $2 < 10 {
a[NR]=$0;
}
END {
asort(a, b, "mycmp");
for (i = 1; i < 3; i++) print b[i];
}
Testing with the given data:
$ for i in {1..6}; do printf 'Try %d:n' $i; ../randsel.awk file; sleep 2; done
Try 1:
g 1 8
e 6 14
Try 2:
a 1 5
b 4 12
Try 3:
b 4 12
a 1 5
Try 4:
e 6 14
a 1 5
Try 5:
b 4 12
a 1 5
Try 6:
e 6 14
b 4 12
add a comment |
up vote
2
down vote
up vote
2
down vote
Here's one way to do it in GNU awk (which supports custom sort routines):
#!/usr/bin/gawk -f
function mycmp(ia, va, ib, vb) {
return rand() < 0.5 ? 0 : 1;
}
BEGIN {
srand();
}
$3 - $2 > 3 && $3 - $2 < 10 {
a[NR]=$0;
}
END {
asort(a, b, "mycmp");
for (i = 1; i < 3; i++) print b[i];
}
Testing with the given data:
$ for i in {1..6}; do printf 'Try %d:n' $i; ../randsel.awk file; sleep 2; done
Try 1:
g 1 8
e 6 14
Try 2:
a 1 5
b 4 12
Try 3:
b 4 12
a 1 5
Try 4:
e 6 14
a 1 5
Try 5:
b 4 12
a 1 5
Try 6:
e 6 14
b 4 12
Here's one way to do it in GNU awk (which supports custom sort routines):
#!/usr/bin/gawk -f
function mycmp(ia, va, ib, vb) {
return rand() < 0.5 ? 0 : 1;
}
BEGIN {
srand();
}
$3 - $2 > 3 && $3 - $2 < 10 {
a[NR]=$0;
}
END {
asort(a, b, "mycmp");
for (i = 1; i < 3; i++) print b[i];
}
Testing with the given data:
$ for i in {1..6}; do printf 'Try %d:n' $i; ../randsel.awk file; sleep 2; done
Try 1:
g 1 8
e 6 14
Try 2:
a 1 5
b 4 12
Try 3:
b 4 12
a 1 5
Try 4:
e 6 14
a 1 5
Try 5:
b 4 12
a 1 5
Try 6:
e 6 14
b 4 12
answered Jun 22 '17 at 23:09
steeldriver
33.7k34983
33.7k34983
add a comment |
add a comment |
up vote
0
down vote
Posting a perl
solution, as I don't see any reason why it must be in awk
(except for the OP's wish):
#!/usr/bin/perl
use strict;
use warnings;
my $N = 2;
my $k;
my @r;
while(<>) {
my @line = split(/s+/);
if ($line[2] - $line[1] > 3 && $line[2] - $line[1] < 10) {
if(++$k <= $N) {
push @r, $_;
} elsif(rand(1) <= ($N/$k)) {
$r[rand(@r)] = $_;
}
}
}
print @r;
This is a classic example of reservoir sampling. The algorithm was copied from here and modified by me to meet OP's specific wishes.
When saved in file reservoir.pl
you run it with ./reservoir.pl file1 file2 file3
or cat file1 file2 file3 | ./reservoir.pl
.
add a comment |
up vote
0
down vote
Posting a perl
solution, as I don't see any reason why it must be in awk
(except for the OP's wish):
#!/usr/bin/perl
use strict;
use warnings;
my $N = 2;
my $k;
my @r;
while(<>) {
my @line = split(/s+/);
if ($line[2] - $line[1] > 3 && $line[2] - $line[1] < 10) {
if(++$k <= $N) {
push @r, $_;
} elsif(rand(1) <= ($N/$k)) {
$r[rand(@r)] = $_;
}
}
}
print @r;
This is a classic example of reservoir sampling. The algorithm was copied from here and modified by me to meet OP's specific wishes.
When saved in file reservoir.pl
you run it with ./reservoir.pl file1 file2 file3
or cat file1 file2 file3 | ./reservoir.pl
.
add a comment |
up vote
0
down vote
up vote
0
down vote
Posting a perl
solution, as I don't see any reason why it must be in awk
(except for the OP's wish):
#!/usr/bin/perl
use strict;
use warnings;
my $N = 2;
my $k;
my @r;
while(<>) {
my @line = split(/s+/);
if ($line[2] - $line[1] > 3 && $line[2] - $line[1] < 10) {
if(++$k <= $N) {
push @r, $_;
} elsif(rand(1) <= ($N/$k)) {
$r[rand(@r)] = $_;
}
}
}
print @r;
This is a classic example of reservoir sampling. The algorithm was copied from here and modified by me to meet OP's specific wishes.
When saved in file reservoir.pl
you run it with ./reservoir.pl file1 file2 file3
or cat file1 file2 file3 | ./reservoir.pl
.
Posting a perl
solution, as I don't see any reason why it must be in awk
(except for the OP's wish):
#!/usr/bin/perl
use strict;
use warnings;
my $N = 2;
my $k;
my @r;
while(<>) {
my @line = split(/s+/);
if ($line[2] - $line[1] > 3 && $line[2] - $line[1] < 10) {
if(++$k <= $N) {
push @r, $_;
} elsif(rand(1) <= ($N/$k)) {
$r[rand(@r)] = $_;
}
}
}
print @r;
This is a classic example of reservoir sampling. The algorithm was copied from here and modified by me to meet OP's specific wishes.
When saved in file reservoir.pl
you run it with ./reservoir.pl file1 file2 file3
or cat file1 file2 file3 | ./reservoir.pl
.
answered Jun 23 '17 at 11:00
styrofoam fly
424311
424311
add a comment |
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f372816%2fawk-get-random-lines-of-file-satisfying-a-condition%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown