AWK: get random lines of file satisfying a condition?

up vote
6
down vote

favorite

I am trying to get a set number of random lines that satisfy a condition.

e.g. if my file was:

a    1    5

b    4    12

c    2    3

e    6    14

f    7    52

g    1    8

then I would like exactly two random lines where the difference between column 3 and column 2 is greater than 3 but less than 10 (e.g. lines starting with a, b, e, and g would qualify)

How would I approach this?

awk (if something and random) '{print $1,$2,$3}'

edited Nov 25 at 23:37

Rui F Ribeiro

38.3k1477127

asked Jun 22 '17 at 20:22

SumNeuron

1312

add a comment |

up vote
6
down vote

favorite

I am trying to get a set number of random lines that satisfy a condition.

e.g. if my file was:

a    1    5

b    4    12

c    2    3

e    6    14

f    7    52

g    1    8

then I would like exactly two random lines where the difference between column 3 and column 2 is greater than 3 but less than 10 (e.g. lines starting with a, b, e, and g would qualify)

How would I approach this?

awk (if something and random) '{print $1,$2,$3}'

edited Nov 25 at 23:37

Rui F Ribeiro

38.3k1477127

asked Jun 22 '17 at 20:22

SumNeuron

1312

add a comment |

up vote
6
down vote

favorite

I am trying to get a set number of random lines that satisfy a condition.

e.g. if my file was:

a    1    5

b    4    12

c    2    3

e    6    14

f    7    52

g    1    8

then I would like exactly two random lines where the difference between column 3 and column 2 is greater than 3 but less than 10 (e.g. lines starting with a, b, e, and g would qualify)

How would I approach this?

awk (if something and random) '{print $1,$2,$3}'

edited Nov 25 at 23:37

Rui F Ribeiro

38.3k1477127

asked Jun 22 '17 at 20:22

SumNeuron

1312

I am trying to get a set number of random lines that satisfy a condition.

e.g. if my file was:

a    1    5

b    4    12

c    2    3

e    6    14

f    7    52

g    1    8

then I would like exactly two random lines where the difference between column 3 and column 2 is greater than 3 but less than 10 (e.g. lines starting with a, b, e, and g would qualify)

How would I approach this?

awk (if something and random) '{print $1,$2,$3}'

awk

edited Nov 25 at 23:37

Rui F Ribeiro

38.3k1477127

asked Jun 22 '17 at 20:22

SumNeuron

1312

edited Nov 25 at 23:37

Rui F Ribeiro

38.3k1477127

asked Jun 22 '17 at 20:22

SumNeuron

1312

edited Nov 25 at 23:37

Rui F Ribeiro

38.3k1477127

edited Nov 25 at 23:37

Rui F Ribeiro

38.3k1477127

edited Nov 25 at 23:37

Rui F Ribeiro

38.3k1477127

asked Jun 22 '17 at 20:22

SumNeuron

1312

asked Jun 22 '17 at 20:22

SumNeuron

1312

asked Jun 22 '17 at 20:22

SumNeuron

1312

add a comment |

4 Answers
4

active

oldest

votes

up vote
11
down vote

You can do this in awk but getting the random selection of lines will be complex and will require writing quite a bit of code. I would instead use awk to get the lines that match your criteria and then use the standard tool shuf to choose a random selection:

$ awk '$3-$2>3 && $3-$2 < 10' file | shuf -n2

g    1    8

a    1    5

If you run this a few times, you'll see you get a random selection of lines:

$ for i in {1..5}; do awk '$3-$2>3 && $3-$2 < 10' file | shuf -n2; echo "--";  done

g    1    8

e    6    14

--

g    1    8

e    6    14

--

b    4    12

g    1    8

--

b    4    12

e    6    14

--

e    6    14

b    4    12

--

The shuf tool is part of the GNU coreutils, so it should be installed by default on most any Linux system and easily available for most any *nix.

edited Jun 22 '17 at 21:17

answered Jun 22 '17 at 20:38

terdon♦

127k31244421

2

Selecting a single value is easy (condition && rand() < 1 / ++n { selection = $0 }), but I'm not sure how to adapt that to selecting N, so yeah, shuf is probably the way to go.
– Kevin
Jun 22 '17 at 20:39

@Kevin ah, clever trick with the / ++n! However, the problem is that i) you need to also use srand() to set a random seed else you always get the same output and ii) you can sometimes get no output at all for a small file. I tried extending it to print two lines, but I still can't get past ii): awk 'BEGIN{srand()}{if($3-$2>3 && $3-$2 <10 && rand() < 1 / ++n){a[k++]=$0}if(k>1){print a[0]"n"a[1]; exit}}' file and, in any case, that's more effort than it's worth, as you said.
– terdon♦
Jun 22 '17 at 20:48

1

I believe awk -v count=2 'BEGIN { srand() } $3 - $2 > 3 && $3 - $2 < 10 && rand() < count / ++n { if (n <= count) { s[n] = $0 } else { s[int(rand()*count)] = $0 } } END { for (i = 1; i <= count; i++) print s[i] }' file should work... but I'm not a statistician.
– Kevin
Jun 22 '17 at 21:14

@Kevin looks like it. Nice one! May as well make it into an answer. If only to show why we tend to use existing utilities when possible :)
– terdon♦
Jun 22 '17 at 21:16

sure, done. I did notice a bug in the comment, fixed in my answer.
– Kevin
Jun 22 '17 at 21:32

|
show 5 more comments

up vote
4
down vote

If you want a pure awk answer that only iterates through the list once:

awk -v count=2 'BEGIN { srand() } $3 - $2 > 3 && $3 - $2 < 10 && rand() < count / ++n { if (n <= count) { s[n] = $0 } else { s[1+int(rand()*count)] = $0 } } END { for (i in s) print s[i] }' input.txt

Stored in a file for easier reading:

BEGIN { srand() }

$3 - $2 > 3 &&

$3 - $2 < 10 &&

rand() < count / ++n {

    if (n <= count) {

        s[n] = $0 

    } else { 

        s[1+int(rand()*count)] = $0 

    } 

} 

END { 

    for (i in s) print s[i] 

}

The algorithm is a slight variation on Knuth's algorithm R; I'm pretty sure the change doesn't alter the distribution but I'm not a statistician so I can't guarantee it.

Commented for those less familiar with awk:

# Before the first line is read...

BEGIN { 

    # ...seed the random number generator.

    srand() 

}



# For each line:

# if the difference between the second and third columns is between 3 and 10 (exclusive)...

$3 - $2 > 3 &&

$3 - $2 < 10 &&

# ... with a probability of (total rows to select) / (total matching rows so far) ...

rand() < count / ++n {

    # ... If we haven't reached the number of rows we need, just add it to our list

    if (n <= count) {

        s[n] = $0 

    } else {

        # otherwise, replace a random entry in our list with the current line.

        s[1+int(rand()*count)] = $0 

    } 

} 



# After all lines have been processed...

END { 

    # Print all lines in our list.

    for (i in s) print s[i] 

}

edited Jun 22 '17 at 23:19

answered Jun 22 '17 at 21:23

Kevin

26.7k106198

Could you please explain this witchcraft for those uninitiated in awk :)
– SumNeuron
Jun 22 '17 at 22:49

1

Added some explanation.
– Kevin
Jun 22 '17 at 23:19

add a comment |

up vote
2
down vote

Here's one way to do it in GNU awk (which supports custom sort routines):

#!/usr/bin/gawk -f



function mycmp(ia, va, ib, vb) {

  return rand() < 0.5 ? 0 : 1;

}



BEGIN {

  srand();

}



$3 - $2 > 3 && $3 - $2 < 10 {

  a[NR]=$0;

} 



END {

  asort(a, b, "mycmp");

  for (i = 1; i < 3; i++) print b[i];

}

Testing with the given data:

$ for i in {1..6}; do printf 'Try %d:n' $i; ../randsel.awk file; sleep 2; done

Try 1:

g    1    8

e    6    14

Try 2:

a    1    5

b    4    12

Try 3:

b    4    12

a    1    5

Try 4:

e    6    14

a    1    5

Try 5:

b    4    12

a    1    5

Try 6:

e    6    14

b    4    12

answered Jun 22 '17 at 23:09

steeldriver

33.7k34983

add a comment |

up vote
0
down vote

Posting a perl solution, as I don't see any reason why it must be in awk (except for the OP's wish):

#!/usr/bin/perl



use strict;

use warnings;

my $N = 2;

my $k;

my @r;



while(<>) {

    my @line = split(/s+/);

    if ($line[2] - $line[1] > 3 && $line[2] - $line[1] < 10) {

        if(++$k <= $N) {

            push @r, $_;

        } elsif(rand(1) <= ($N/$k)) {

            $r[rand(@r)] = $_;

        }

    }

}



print @r;

This is a classic example of reservoir sampling. The algorithm was copied from here and modified by me to meet OP's specific wishes.

When saved in file reservoir.pl you run it with ./reservoir.pl file1 file2 file3 or cat file1 file2 file3 | ./reservoir.pl.

answered Jun 23 '17 at 11:00

styrofoam fly

424311

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f372816%2fawk-get-random-lines-of-file-satisfying-a-condition%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

up vote
11
down vote

$ awk '$3-$2>3 && $3-$2 < 10' file | shuf -n2

g    1    8

a    1    5

If you run this a few times, you'll see you get a random selection of lines:

$ for i in {1..5}; do awk '$3-$2>3 && $3-$2 < 10' file | shuf -n2; echo "--";  done

g    1    8

e    6    14

--

g    1    8

e    6    14

--

b    4    12

g    1    8

--

b    4    12

e    6    14

--

e    6    14

b    4    12

--

The shuf tool is part of the GNU coreutils, so it should be installed by default on most any Linux system and easily available for most any *nix.

edited Jun 22 '17 at 21:17

answered Jun 22 '17 at 20:38

terdon♦

127k31244421

2

Selecting a single value is easy (condition && rand() < 1 / ++n { selection = $0 }), but I'm not sure how to adapt that to selecting N, so yeah, shuf is probably the way to go.
– Kevin
Jun 22 '17 at 20:39

@Kevin ah, clever trick with the / ++n! However, the problem is that i) you need to also use srand() to set a random seed else you always get the same output and ii) you can sometimes get no output at all for a small file. I tried extending it to print two lines, but I still can't get past ii): awk 'BEGIN{srand()}{if($3-$2>3 && $3-$2 <10 && rand() < 1 / ++n){a[k++]=$0}if(k>1){print a[0]"n"a[1]; exit}}' file and, in any case, that's more effort than it's worth, as you said.
– terdon♦
Jun 22 '17 at 20:48

1

I believe awk -v count=2 'BEGIN { srand() } $3 - $2 > 3 && $3 - $2 < 10 && rand() < count / ++n { if (n <= count) { s[n] = $0 } else { s[int(rand()*count)] = $0 } } END { for (i = 1; i <= count; i++) print s[i] }' file should work... but I'm not a statistician.
– Kevin
Jun 22 '17 at 21:14

@Kevin looks like it. Nice one! May as well make it into an answer. If only to show why we tend to use existing utilities when possible :)
– terdon♦
Jun 22 '17 at 21:16

sure, done. I did notice a bug in the comment, fixed in my answer.
– Kevin
Jun 22 '17 at 21:32

|
show 5 more comments

up vote
11
down vote

$ awk '$3-$2>3 && $3-$2 < 10' file | shuf -n2

g    1    8

a    1    5

If you run this a few times, you'll see you get a random selection of lines:

$ for i in {1..5}; do awk '$3-$2>3 && $3-$2 < 10' file | shuf -n2; echo "--";  done

g    1    8

e    6    14

--

g    1    8

e    6    14

--

b    4    12

g    1    8

--

b    4    12

e    6    14

--

e    6    14

b    4    12

--

The shuf tool is part of the GNU coreutils, so it should be installed by default on most any Linux system and easily available for most any *nix.

edited Jun 22 '17 at 21:17

answered Jun 22 '17 at 20:38

terdon♦

127k31244421

2

Selecting a single value is easy (condition && rand() < 1 / ++n { selection = $0 }), but I'm not sure how to adapt that to selecting N, so yeah, shuf is probably the way to go.
– Kevin
Jun 22 '17 at 20:39

@Kevin ah, clever trick with the / ++n! However, the problem is that i) you need to also use srand() to set a random seed else you always get the same output and ii) you can sometimes get no output at all for a small file. I tried extending it to print two lines, but I still can't get past ii): awk 'BEGIN{srand()}{if($3-$2>3 && $3-$2 <10 && rand() < 1 / ++n){a[k++]=$0}if(k>1){print a[0]"n"a[1]; exit}}' file and, in any case, that's more effort than it's worth, as you said.
– terdon♦
Jun 22 '17 at 20:48

1

I believe awk -v count=2 'BEGIN { srand() } $3 - $2 > 3 && $3 - $2 < 10 && rand() < count / ++n { if (n <= count) { s[n] = $0 } else { s[int(rand()*count)] = $0 } } END { for (i = 1; i <= count; i++) print s[i] }' file should work... but I'm not a statistician.
– Kevin
Jun 22 '17 at 21:14

@Kevin looks like it. Nice one! May as well make it into an answer. If only to show why we tend to use existing utilities when possible :)
– terdon♦
Jun 22 '17 at 21:16

sure, done. I did notice a bug in the comment, fixed in my answer.
– Kevin
Jun 22 '17 at 21:32

|
show 5 more comments

up vote
11
down vote

$ awk '$3-$2>3 && $3-$2 < 10' file | shuf -n2

g    1    8

a    1    5

If you run this a few times, you'll see you get a random selection of lines:

$ for i in {1..5}; do awk '$3-$2>3 && $3-$2 < 10' file | shuf -n2; echo "--";  done

g    1    8

e    6    14

--

g    1    8

e    6    14

--

b    4    12

g    1    8

--

b    4    12

e    6    14

--

e    6    14

b    4    12

--

The shuf tool is part of the GNU coreutils, so it should be installed by default on most any Linux system and easily available for most any *nix.

edited Jun 22 '17 at 21:17

answered Jun 22 '17 at 20:38

terdon♦

127k31244421

$ awk '$3-$2>3 && $3-$2 < 10' file | shuf -n2

g    1    8

a    1    5

If you run this a few times, you'll see you get a random selection of lines:

$ for i in {1..5}; do awk '$3-$2>3 && $3-$2 < 10' file | shuf -n2; echo "--";  done

g    1    8

e    6    14

--

g    1    8

e    6    14

--

b    4    12

g    1    8

--

b    4    12

e    6    14

--

e    6    14

b    4    12

--

The shuf tool is part of the GNU coreutils, so it should be installed by default on most any Linux system and easily available for most any *nix.

edited Jun 22 '17 at 21:17

answered Jun 22 '17 at 20:38

terdon♦

127k31244421

edited Jun 22 '17 at 21:17

answered Jun 22 '17 at 20:38

terdon♦

127k31244421

answered Jun 22 '17 at 20:38

terdon♦

127k31244421

answered Jun 22 '17 at 20:38

terdon♦

127k31244421

2

Selecting a single value is easy (condition && rand() < 1 / ++n { selection = $0 }), but I'm not sure how to adapt that to selecting N, so yeah, shuf is probably the way to go.
– Kevin
Jun 22 '17 at 20:39

@Kevin ah, clever trick with the / ++n! However, the problem is that i) you need to also use srand() to set a random seed else you always get the same output and ii) you can sometimes get no output at all for a small file. I tried extending it to print two lines, but I still can't get past ii): awk 'BEGIN{srand()}{if($3-$2>3 && $3-$2 <10 && rand() < 1 / ++n){a[k++]=$0}if(k>1){print a[0]"n"a[1]; exit}}' file and, in any case, that's more effort than it's worth, as you said.
– terdon♦
Jun 22 '17 at 20:48

1

I believe awk -v count=2 'BEGIN { srand() } $3 - $2 > 3 && $3 - $2 < 10 && rand() < count / ++n { if (n <= count) { s[n] = $0 } else { s[int(rand()*count)] = $0 } } END { for (i = 1; i <= count; i++) print s[i] }' file should work... but I'm not a statistician.
– Kevin
Jun 22 '17 at 21:14

@Kevin looks like it. Nice one! May as well make it into an answer. If only to show why we tend to use existing utilities when possible :)
– terdon♦
Jun 22 '17 at 21:16

sure, done. I did notice a bug in the comment, fixed in my answer.
– Kevin
Jun 22 '17 at 21:32

|
show 5 more comments

2

Selecting a single value is easy (condition && rand() < 1 / ++n { selection = $0 }), but I'm not sure how to adapt that to selecting N, so yeah, shuf is probably the way to go.
– Kevin
Jun 22 '17 at 20:39

@Kevin ah, clever trick with the / ++n! However, the problem is that i) you need to also use srand() to set a random seed else you always get the same output and ii) you can sometimes get no output at all for a small file. I tried extending it to print two lines, but I still can't get past ii): awk 'BEGIN{srand()}{if($3-$2>3 && $3-$2 <10 && rand() < 1 / ++n){a[k++]=$0}if(k>1){print a[0]"n"a[1]; exit}}' file and, in any case, that's more effort than it's worth, as you said.
– terdon♦
Jun 22 '17 at 20:48

1

I believe awk -v count=2 'BEGIN { srand() } $3 - $2 > 3 && $3 - $2 < 10 && rand() < count / ++n { if (n <= count) { s[n] = $0 } else { s[int(rand()*count)] = $0 } } END { for (i = 1; i <= count; i++) print s[i] }' file should work... but I'm not a statistician.
– Kevin
Jun 22 '17 at 21:14

@Kevin looks like it. Nice one! May as well make it into an answer. If only to show why we tend to use existing utilities when possible :)
– terdon♦
Jun 22 '17 at 21:16

sure, done. I did notice a bug in the comment, fixed in my answer.
– Kevin
Jun 22 '17 at 21:32

Selecting a single value is easy (condition && rand() < 1 / ++n { selection = $0 }), but I'm not sure how to adapt that to selecting N, so yeah, shuf is probably the way to go.
– Kevin
Jun 22 '17 at 20:39

@Kevin ah, clever trick with the / ++n! However, the problem is that i) you need to also use srand() to set a random seed else you always get the same output and ii) you can sometimes get no output at all for a small file. I tried extending it to print two lines, but I still can't get past ii): awk 'BEGIN{srand()}{if($3-$2>3 && $3-$2 <10 && rand() < 1 / ++n){a[k++]=$0}if(k>1){print a[0]"n"a[1]; exit}}' file and, in any case, that's more effort than it's worth, as you said.
– terdon♦
Jun 22 '17 at 20:48

I believe

awk -v count=2 'BEGIN { srand() } $3 - $2 > 3 && $3 - $2 < 10 && rand() < count / ++n { if (n <= count) { s[n] = $0 } else { s[int(rand()*count)] = $0 } } END { for (i = 1; i <= count; i++) print s[i] }' file

should work... but I'm not a statistician.
– Kevin
Jun 22 '17 at 21:14

I believe

awk -v count=2 'BEGIN { srand() } $3 - $2 > 3 && $3 - $2 < 10 && rand() < count / ++n { if (n <= count) { s[n] = $0 } else { s[int(rand()*count)] = $0 } } END { for (i = 1; i <= count; i++) print s[i] }' file

should work... but I'm not a statistician.
– Kevin
Jun 22 '17 at 21:14

@Kevin looks like it. Nice one! May as well make it into an answer. If only to show why we tend to use existing utilities when possible :)
– terdon♦
Jun 22 '17 at 21:16

sure, done. I did notice a bug in the comment, fixed in my answer.
– Kevin
Jun 22 '17 at 21:32

|
show 5 more comments

up vote
4
down vote

If you want a pure awk answer that only iterates through the list once:

awk -v count=2 'BEGIN { srand() } $3 - $2 > 3 && $3 - $2 < 10 && rand() < count / ++n { if (n <= count) { s[n] = $0 } else { s[1+int(rand()*count)] = $0 } } END { for (i in s) print s[i] }' input.txt

Stored in a file for easier reading:

BEGIN { srand() }

$3 - $2 > 3 &&

$3 - $2 < 10 &&

rand() < count / ++n {

    if (n <= count) {

        s[n] = $0 

    } else { 

        s[1+int(rand()*count)] = $0 

    } 

} 

END { 

    for (i in s) print s[i] 

}

The algorithm is a slight variation on Knuth's algorithm R; I'm pretty sure the change doesn't alter the distribution but I'm not a statistician so I can't guarantee it.

Commented for those less familiar with awk:

# Before the first line is read...

BEGIN { 

    # ...seed the random number generator.

    srand() 

}



# For each line:

# if the difference between the second and third columns is between 3 and 10 (exclusive)...

$3 - $2 > 3 &&

$3 - $2 < 10 &&

# ... with a probability of (total rows to select) / (total matching rows so far) ...

rand() < count / ++n {

    # ... If we haven't reached the number of rows we need, just add it to our list

    if (n <= count) {

        s[n] = $0 

    } else {

        # otherwise, replace a random entry in our list with the current line.

        s[1+int(rand()*count)] = $0 

    } 

} 



# After all lines have been processed...

END { 

    # Print all lines in our list.

    for (i in s) print s[i] 

}

edited Jun 22 '17 at 23:19

answered Jun 22 '17 at 21:23

Kevin

26.7k106198

Could you please explain this witchcraft for those uninitiated in awk :)
– SumNeuron
Jun 22 '17 at 22:49

1

Added some explanation.
– Kevin
Jun 22 '17 at 23:19

add a comment |

up vote
4
down vote

If you want a pure awk answer that only iterates through the list once:

awk -v count=2 'BEGIN { srand() } $3 - $2 > 3 && $3 - $2 < 10 && rand() < count / ++n { if (n <= count) { s[n] = $0 } else { s[1+int(rand()*count)] = $0 } } END { for (i in s) print s[i] }' input.txt

Stored in a file for easier reading:

BEGIN { srand() }

$3 - $2 > 3 &&

$3 - $2 < 10 &&

rand() < count / ++n {

    if (n <= count) {

        s[n] = $0 

    } else { 

        s[1+int(rand()*count)] = $0 

    } 

} 

END { 

    for (i in s) print s[i] 

}

The algorithm is a slight variation on Knuth's algorithm R; I'm pretty sure the change doesn't alter the distribution but I'm not a statistician so I can't guarantee it.

Commented for those less familiar with awk:

# Before the first line is read...

BEGIN { 

    # ...seed the random number generator.

    srand() 

}



# For each line:

# if the difference between the second and third columns is between 3 and 10 (exclusive)...

$3 - $2 > 3 &&

$3 - $2 < 10 &&

# ... with a probability of (total rows to select) / (total matching rows so far) ...

rand() < count / ++n {

    # ... If we haven't reached the number of rows we need, just add it to our list

    if (n <= count) {

        s[n] = $0 

    } else {

        # otherwise, replace a random entry in our list with the current line.

        s[1+int(rand()*count)] = $0 

    } 

} 



# After all lines have been processed...

END { 

    # Print all lines in our list.

    for (i in s) print s[i] 

}

edited Jun 22 '17 at 23:19

answered Jun 22 '17 at 21:23

Kevin

26.7k106198

Could you please explain this witchcraft for those uninitiated in awk :)
– SumNeuron
Jun 22 '17 at 22:49

1

Added some explanation.
– Kevin
Jun 22 '17 at 23:19

add a comment |

up vote
4
down vote

If you want a pure awk answer that only iterates through the list once:

awk -v count=2 'BEGIN { srand() } $3 - $2 > 3 && $3 - $2 < 10 && rand() < count / ++n { if (n <= count) { s[n] = $0 } else { s[1+int(rand()*count)] = $0 } } END { for (i in s) print s[i] }' input.txt

Stored in a file for easier reading:

BEGIN { srand() }

$3 - $2 > 3 &&

$3 - $2 < 10 &&

rand() < count / ++n {

    if (n <= count) {

        s[n] = $0 

    } else { 

        s[1+int(rand()*count)] = $0 

    } 

} 

END { 

    for (i in s) print s[i] 

}

The algorithm is a slight variation on Knuth's algorithm R; I'm pretty sure the change doesn't alter the distribution but I'm not a statistician so I can't guarantee it.

Commented for those less familiar with awk:

# Before the first line is read...

BEGIN { 

    # ...seed the random number generator.

    srand() 

}



# For each line:

# if the difference between the second and third columns is between 3 and 10 (exclusive)...

$3 - $2 > 3 &&

$3 - $2 < 10 &&

# ... with a probability of (total rows to select) / (total matching rows so far) ...

rand() < count / ++n {

    # ... If we haven't reached the number of rows we need, just add it to our list

    if (n <= count) {

        s[n] = $0 

    } else {

        # otherwise, replace a random entry in our list with the current line.

        s[1+int(rand()*count)] = $0 

    } 

} 



# After all lines have been processed...

END { 

    # Print all lines in our list.

    for (i in s) print s[i] 

}

edited Jun 22 '17 at 23:19

answered Jun 22 '17 at 21:23

Kevin

26.7k106198

If you want a pure awk answer that only iterates through the list once:

awk -v count=2 'BEGIN { srand() } $3 - $2 > 3 && $3 - $2 < 10 && rand() < count / ++n { if (n <= count) { s[n] = $0 } else { s[1+int(rand()*count)] = $0 } } END { for (i in s) print s[i] }' input.txt

Stored in a file for easier reading:

BEGIN { srand() }

$3 - $2 > 3 &&

$3 - $2 < 10 &&

rand() < count / ++n {

    if (n <= count) {

        s[n] = $0 

    } else { 

        s[1+int(rand()*count)] = $0 

    } 

} 

END { 

    for (i in s) print s[i] 

}

The algorithm is a slight variation on Knuth's algorithm R; I'm pretty sure the change doesn't alter the distribution but I'm not a statistician so I can't guarantee it.

Commented for those less familiar with awk:

# Before the first line is read...

BEGIN { 

    # ...seed the random number generator.

    srand() 

}



# For each line:

# if the difference between the second and third columns is between 3 and 10 (exclusive)...

$3 - $2 > 3 &&

$3 - $2 < 10 &&

# ... with a probability of (total rows to select) / (total matching rows so far) ...

rand() < count / ++n {

    # ... If we haven't reached the number of rows we need, just add it to our list

    if (n <= count) {

        s[n] = $0 

    } else {

        # otherwise, replace a random entry in our list with the current line.

        s[1+int(rand()*count)] = $0 

    } 

} 



# After all lines have been processed...

END { 

    # Print all lines in our list.

    for (i in s) print s[i] 

}

edited Jun 22 '17 at 23:19

answered Jun 22 '17 at 21:23

Kevin

26.7k106198

edited Jun 22 '17 at 23:19

answered Jun 22 '17 at 21:23

Kevin

26.7k106198

answered Jun 22 '17 at 21:23

Kevin

26.7k106198

answered Jun 22 '17 at 21:23

Kevin

26.7k106198

Could you please explain this witchcraft for those uninitiated in awk :)
– SumNeuron
Jun 22 '17 at 22:49

1

Added some explanation.
– Kevin
Jun 22 '17 at 23:19

add a comment |

Could you please explain this witchcraft for those uninitiated in awk :)
– SumNeuron
Jun 22 '17 at 22:49

1

Added some explanation.
– Kevin
Jun 22 '17 at 23:19

Could you please explain this witchcraft for those uninitiated in awk :)
– SumNeuron
Jun 22 '17 at 22:49

Added some explanation.
– Kevin
Jun 22 '17 at 23:19

add a comment |

up vote
2
down vote

Here's one way to do it in GNU awk (which supports custom sort routines):

#!/usr/bin/gawk -f



function mycmp(ia, va, ib, vb) {

  return rand() < 0.5 ? 0 : 1;

}



BEGIN {

  srand();

}



$3 - $2 > 3 && $3 - $2 < 10 {

  a[NR]=$0;

} 



END {

  asort(a, b, "mycmp");

  for (i = 1; i < 3; i++) print b[i];

}

Testing with the given data:

$ for i in {1..6}; do printf 'Try %d:n' $i; ../randsel.awk file; sleep 2; done

Try 1:

g    1    8

e    6    14

Try 2:

a    1    5

b    4    12

Try 3:

b    4    12

a    1    5

Try 4:

e    6    14

a    1    5

Try 5:

b    4    12

a    1    5

Try 6:

e    6    14

b    4    12

answered Jun 22 '17 at 23:09

steeldriver

33.7k34983

add a comment |

up vote
2
down vote

Here's one way to do it in GNU awk (which supports custom sort routines):

#!/usr/bin/gawk -f



function mycmp(ia, va, ib, vb) {

  return rand() < 0.5 ? 0 : 1;

}



BEGIN {

  srand();

}



$3 - $2 > 3 && $3 - $2 < 10 {

  a[NR]=$0;

} 



END {

  asort(a, b, "mycmp");

  for (i = 1; i < 3; i++) print b[i];

}

Testing with the given data:

$ for i in {1..6}; do printf 'Try %d:n' $i; ../randsel.awk file; sleep 2; done

Try 1:

g    1    8

e    6    14

Try 2:

a    1    5

b    4    12

Try 3:

b    4    12

a    1    5

Try 4:

e    6    14

a    1    5

Try 5:

b    4    12

a    1    5

Try 6:

e    6    14

b    4    12

answered Jun 22 '17 at 23:09

steeldriver

33.7k34983

add a comment |

up vote
2
down vote

Here's one way to do it in GNU awk (which supports custom sort routines):

#!/usr/bin/gawk -f



function mycmp(ia, va, ib, vb) {

  return rand() < 0.5 ? 0 : 1;

}



BEGIN {

  srand();

}



$3 - $2 > 3 && $3 - $2 < 10 {

  a[NR]=$0;

} 



END {

  asort(a, b, "mycmp");

  for (i = 1; i < 3; i++) print b[i];

}

Testing with the given data:

$ for i in {1..6}; do printf 'Try %d:n' $i; ../randsel.awk file; sleep 2; done

Try 1:

g    1    8

e    6    14

Try 2:

a    1    5

b    4    12

Try 3:

b    4    12

a    1    5

Try 4:

e    6    14

a    1    5

Try 5:

b    4    12

a    1    5

Try 6:

e    6    14

b    4    12

answered Jun 22 '17 at 23:09

steeldriver

33.7k34983

Here's one way to do it in GNU awk (which supports custom sort routines):

#!/usr/bin/gawk -f



function mycmp(ia, va, ib, vb) {

  return rand() < 0.5 ? 0 : 1;

}



BEGIN {

  srand();

}



$3 - $2 > 3 && $3 - $2 < 10 {

  a[NR]=$0;

} 



END {

  asort(a, b, "mycmp");

  for (i = 1; i < 3; i++) print b[i];

}

Testing with the given data:

$ for i in {1..6}; do printf 'Try %d:n' $i; ../randsel.awk file; sleep 2; done

Try 1:

g    1    8

e    6    14

Try 2:

a    1    5

b    4    12

Try 3:

b    4    12

a    1    5

Try 4:

e    6    14

a    1    5

Try 5:

b    4    12

a    1    5

Try 6:

e    6    14

b    4    12

answered Jun 22 '17 at 23:09

steeldriver

33.7k34983

answered Jun 22 '17 at 23:09

steeldriver

33.7k34983

answered Jun 22 '17 at 23:09

steeldriver

33.7k34983

answered Jun 22 '17 at 23:09

steeldriver

33.7k34983

add a comment |

up vote
0
down vote

Posting a perl solution, as I don't see any reason why it must be in awk (except for the OP's wish):

#!/usr/bin/perl



use strict;

use warnings;

my $N = 2;

my $k;

my @r;



while(<>) {

    my @line = split(/s+/);

    if ($line[2] - $line[1] > 3 && $line[2] - $line[1] < 10) {

        if(++$k <= $N) {

            push @r, $_;

        } elsif(rand(1) <= ($N/$k)) {

            $r[rand(@r)] = $_;

        }

    }

}



print @r;

This is a classic example of reservoir sampling. The algorithm was copied from here and modified by me to meet OP's specific wishes.

When saved in file reservoir.pl you run it with ./reservoir.pl file1 file2 file3 or cat file1 file2 file3 | ./reservoir.pl.

answered Jun 23 '17 at 11:00

styrofoam fly

424311

add a comment |

up vote
0
down vote

Posting a perl solution, as I don't see any reason why it must be in awk (except for the OP's wish):

#!/usr/bin/perl



use strict;

use warnings;

my $N = 2;

my $k;

my @r;



while(<>) {

    my @line = split(/s+/);

    if ($line[2] - $line[1] > 3 && $line[2] - $line[1] < 10) {

        if(++$k <= $N) {

            push @r, $_;

        } elsif(rand(1) <= ($N/$k)) {

            $r[rand(@r)] = $_;

        }

    }

}



print @r;

This is a classic example of reservoir sampling. The algorithm was copied from here and modified by me to meet OP's specific wishes.

When saved in file reservoir.pl you run it with ./reservoir.pl file1 file2 file3 or cat file1 file2 file3 | ./reservoir.pl.

answered Jun 23 '17 at 11:00

styrofoam fly

424311

add a comment |

up vote
0
down vote

Posting a perl solution, as I don't see any reason why it must be in awk (except for the OP's wish):

#!/usr/bin/perl



use strict;

use warnings;

my $N = 2;

my $k;

my @r;



while(<>) {

    my @line = split(/s+/);

    if ($line[2] - $line[1] > 3 && $line[2] - $line[1] < 10) {

        if(++$k <= $N) {

            push @r, $_;

        } elsif(rand(1) <= ($N/$k)) {

            $r[rand(@r)] = $_;

        }

    }

}



print @r;

This is a classic example of reservoir sampling. The algorithm was copied from here and modified by me to meet OP's specific wishes.

When saved in file reservoir.pl you run it with ./reservoir.pl file1 file2 file3 or cat file1 file2 file3 | ./reservoir.pl.

answered Jun 23 '17 at 11:00

styrofoam fly

424311

Posting a perl solution, as I don't see any reason why it must be in awk (except for the OP's wish):

#!/usr/bin/perl



use strict;

use warnings;

my $N = 2;

my $k;

my @r;



while(<>) {

    my @line = split(/s+/);

    if ($line[2] - $line[1] > 3 && $line[2] - $line[1] < 10) {

        if(++$k <= $N) {

            push @r, $_;

        } elsif(rand(1) <= ($N/$k)) {

            $r[rand(@r)] = $_;

        }

    }

}



print @r;

This is a classic example of reservoir sampling. The algorithm was copied from here and modified by me to meet OP's specific wishes.

When saved in file reservoir.pl you run it with ./reservoir.pl file1 file2 file3 or cat file1 file2 file3 | ./reservoir.pl.

answered Jun 23 '17 at 11:00

styrofoam fly

424311

answered Jun 23 '17 at 11:00

styrofoam fly

424311

answered Jun 23 '17 at 11:00

styrofoam fly

424311

answered Jun 23 '17 at 11:00

styrofoam fly

424311

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Cfrtjryk