AWK: get random lines of file satisfying a condition?











up vote
6
down vote

favorite
1












I am trying to get a set number of random lines that satisfy a condition.



e.g. if my file was:



a    1    5
b 4 12
c 2 3
e 6 14
f 7 52
g 1 8


then I would like exactly two random lines where the difference between column 3 and column 2 is greater than 3 but less than 10 (e.g. lines starting with a, b, e, and g would qualify)



How would I approach this?



awk (if something and random) '{print $1,$2,$3}'










share|improve this question




























    up vote
    6
    down vote

    favorite
    1












    I am trying to get a set number of random lines that satisfy a condition.



    e.g. if my file was:



    a    1    5
    b 4 12
    c 2 3
    e 6 14
    f 7 52
    g 1 8


    then I would like exactly two random lines where the difference between column 3 and column 2 is greater than 3 but less than 10 (e.g. lines starting with a, b, e, and g would qualify)



    How would I approach this?



    awk (if something and random) '{print $1,$2,$3}'










    share|improve this question


























      up vote
      6
      down vote

      favorite
      1









      up vote
      6
      down vote

      favorite
      1






      1





      I am trying to get a set number of random lines that satisfy a condition.



      e.g. if my file was:



      a    1    5
      b 4 12
      c 2 3
      e 6 14
      f 7 52
      g 1 8


      then I would like exactly two random lines where the difference between column 3 and column 2 is greater than 3 but less than 10 (e.g. lines starting with a, b, e, and g would qualify)



      How would I approach this?



      awk (if something and random) '{print $1,$2,$3}'










      share|improve this question















      I am trying to get a set number of random lines that satisfy a condition.



      e.g. if my file was:



      a    1    5
      b 4 12
      c 2 3
      e 6 14
      f 7 52
      g 1 8


      then I would like exactly two random lines where the difference between column 3 and column 2 is greater than 3 but less than 10 (e.g. lines starting with a, b, e, and g would qualify)



      How would I approach this?



      awk (if something and random) '{print $1,$2,$3}'







      awk






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 25 at 23:37









      Rui F Ribeiro

      38.3k1477127




      38.3k1477127










      asked Jun 22 '17 at 20:22









      SumNeuron

      1312




      1312






















          4 Answers
          4






          active

          oldest

          votes

















          up vote
          11
          down vote













          You can do this in awk but getting the random selection of lines will be complex and will require writing quite a bit of code. I would instead use awk to get the lines that match your criteria and then use the standard tool shuf to choose a random selection:



          $ awk '$3-$2>3 && $3-$2 < 10' file | shuf -n2
          g 1 8
          a 1 5


          If you run this a few times, you'll see you get a random selection of lines:



          $ for i in {1..5}; do awk '$3-$2>3 && $3-$2 < 10' file | shuf -n2; echo "--";  done
          g 1 8
          e 6 14
          --
          g 1 8
          e 6 14
          --
          b 4 12
          g 1 8
          --
          b 4 12
          e 6 14
          --
          e 6 14
          b 4 12
          --


          The shuf tool is part of the GNU coreutils, so it should be installed by default on most any Linux system and easily available for most any *nix.






          share|improve this answer



















          • 2




            Selecting a single value is easy (condition && rand() < 1 / ++n { selection = $0 }), but I'm not sure how to adapt that to selecting N, so yeah, shuf is probably the way to go.
            – Kevin
            Jun 22 '17 at 20:39












          • @Kevin ah, clever trick with the / ++n! However, the problem is that i) you need to also use srand() to set a random seed else you always get the same output and ii) you can sometimes get no output at all for a small file. I tried extending it to print two lines, but I still can't get past ii): awk 'BEGIN{srand()}{if($3-$2>3 && $3-$2 <10 && rand() < 1 / ++n){a[k++]=$0}if(k>1){print a[0]"n"a[1]; exit}}' file and, in any case, that's more effort than it's worth, as you said.
            – terdon
            Jun 22 '17 at 20:48








          • 1




            I believe awk -v count=2 'BEGIN { srand() } $3 - $2 > 3 && $3 - $2 < 10 && rand() < count / ++n { if (n <= count) { s[n] = $0 } else { s[int(rand()*count)] = $0 } } END { for (i = 1; i <= count; i++) print s[i] }' file should work... but I'm not a statistician.
            – Kevin
            Jun 22 '17 at 21:14












          • @Kevin looks like it. Nice one! May as well make it into an answer. If only to show why we tend to use existing utilities when possible :)
            – terdon
            Jun 22 '17 at 21:16










          • sure, done. I did notice a bug in the comment, fixed in my answer.
            – Kevin
            Jun 22 '17 at 21:32


















          up vote
          4
          down vote













          If you want a pure awk answer that only iterates through the list once:



          awk -v count=2 'BEGIN { srand() } $3 - $2 > 3 && $3 - $2 < 10 && rand() < count / ++n { if (n <= count) { s[n] = $0 } else { s[1+int(rand()*count)] = $0 } } END { for (i in s) print s[i] }' input.txt


          Stored in a file for easier reading:



          BEGIN { srand() }
          $3 - $2 > 3 &&
          $3 - $2 < 10 &&
          rand() < count / ++n {
          if (n <= count) {
          s[n] = $0
          } else {
          s[1+int(rand()*count)] = $0
          }
          }
          END {
          for (i in s) print s[i]
          }


          The algorithm is a slight variation on Knuth's algorithm R; I'm pretty sure the change doesn't alter the distribution but I'm not a statistician so I can't guarantee it.



          Commented for those less familiar with awk:



          # Before the first line is read...
          BEGIN {
          # ...seed the random number generator.
          srand()
          }

          # For each line:
          # if the difference between the second and third columns is between 3 and 10 (exclusive)...
          $3 - $2 > 3 &&
          $3 - $2 < 10 &&
          # ... with a probability of (total rows to select) / (total matching rows so far) ...
          rand() < count / ++n {
          # ... If we haven't reached the number of rows we need, just add it to our list
          if (n <= count) {
          s[n] = $0
          } else {
          # otherwise, replace a random entry in our list with the current line.
          s[1+int(rand()*count)] = $0
          }
          }

          # After all lines have been processed...
          END {
          # Print all lines in our list.
          for (i in s) print s[i]
          }





          share|improve this answer























          • Could you please explain this witchcraft for those uninitiated in awk :)
            – SumNeuron
            Jun 22 '17 at 22:49






          • 1




            Added some explanation.
            – Kevin
            Jun 22 '17 at 23:19


















          up vote
          2
          down vote













          Here's one way to do it in GNU awk (which supports custom sort routines):



          #!/usr/bin/gawk -f

          function mycmp(ia, va, ib, vb) {
          return rand() < 0.5 ? 0 : 1;
          }

          BEGIN {
          srand();
          }

          $3 - $2 > 3 && $3 - $2 < 10 {
          a[NR]=$0;
          }

          END {
          asort(a, b, "mycmp");
          for (i = 1; i < 3; i++) print b[i];
          }


          Testing with the given data:



          $ for i in {1..6}; do printf 'Try %d:n' $i; ../randsel.awk file; sleep 2; done
          Try 1:
          g 1 8
          e 6 14
          Try 2:
          a 1 5
          b 4 12
          Try 3:
          b 4 12
          a 1 5
          Try 4:
          e 6 14
          a 1 5
          Try 5:
          b 4 12
          a 1 5
          Try 6:
          e 6 14
          b 4 12





          share|improve this answer




























            up vote
            0
            down vote













            Posting a perl solution, as I don't see any reason why it must be in awk (except for the OP's wish):



            #!/usr/bin/perl

            use strict;
            use warnings;
            my $N = 2;
            my $k;
            my @r;

            while(<>) {
            my @line = split(/s+/);
            if ($line[2] - $line[1] > 3 && $line[2] - $line[1] < 10) {
            if(++$k <= $N) {
            push @r, $_;
            } elsif(rand(1) <= ($N/$k)) {
            $r[rand(@r)] = $_;
            }
            }
            }

            print @r;


            This is a classic example of reservoir sampling. The algorithm was copied from here and modified by me to meet OP's specific wishes.



            When saved in file reservoir.pl you run it with ./reservoir.pl file1 file2 file3 or cat file1 file2 file3 | ./reservoir.pl.






            share|improve this answer





















              Your Answer








              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "106"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              convertImagesToLinks: false,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: null,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f372816%2fawk-get-random-lines-of-file-satisfying-a-condition%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              4 Answers
              4






              active

              oldest

              votes








              4 Answers
              4






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes








              up vote
              11
              down vote













              You can do this in awk but getting the random selection of lines will be complex and will require writing quite a bit of code. I would instead use awk to get the lines that match your criteria and then use the standard tool shuf to choose a random selection:



              $ awk '$3-$2>3 && $3-$2 < 10' file | shuf -n2
              g 1 8
              a 1 5


              If you run this a few times, you'll see you get a random selection of lines:



              $ for i in {1..5}; do awk '$3-$2>3 && $3-$2 < 10' file | shuf -n2; echo "--";  done
              g 1 8
              e 6 14
              --
              g 1 8
              e 6 14
              --
              b 4 12
              g 1 8
              --
              b 4 12
              e 6 14
              --
              e 6 14
              b 4 12
              --


              The shuf tool is part of the GNU coreutils, so it should be installed by default on most any Linux system and easily available for most any *nix.






              share|improve this answer



















              • 2




                Selecting a single value is easy (condition && rand() < 1 / ++n { selection = $0 }), but I'm not sure how to adapt that to selecting N, so yeah, shuf is probably the way to go.
                – Kevin
                Jun 22 '17 at 20:39












              • @Kevin ah, clever trick with the / ++n! However, the problem is that i) you need to also use srand() to set a random seed else you always get the same output and ii) you can sometimes get no output at all for a small file. I tried extending it to print two lines, but I still can't get past ii): awk 'BEGIN{srand()}{if($3-$2>3 && $3-$2 <10 && rand() < 1 / ++n){a[k++]=$0}if(k>1){print a[0]"n"a[1]; exit}}' file and, in any case, that's more effort than it's worth, as you said.
                – terdon
                Jun 22 '17 at 20:48








              • 1




                I believe awk -v count=2 'BEGIN { srand() } $3 - $2 > 3 && $3 - $2 < 10 && rand() < count / ++n { if (n <= count) { s[n] = $0 } else { s[int(rand()*count)] = $0 } } END { for (i = 1; i <= count; i++) print s[i] }' file should work... but I'm not a statistician.
                – Kevin
                Jun 22 '17 at 21:14












              • @Kevin looks like it. Nice one! May as well make it into an answer. If only to show why we tend to use existing utilities when possible :)
                – terdon
                Jun 22 '17 at 21:16










              • sure, done. I did notice a bug in the comment, fixed in my answer.
                – Kevin
                Jun 22 '17 at 21:32















              up vote
              11
              down vote













              You can do this in awk but getting the random selection of lines will be complex and will require writing quite a bit of code. I would instead use awk to get the lines that match your criteria and then use the standard tool shuf to choose a random selection:



              $ awk '$3-$2>3 && $3-$2 < 10' file | shuf -n2
              g 1 8
              a 1 5


              If you run this a few times, you'll see you get a random selection of lines:



              $ for i in {1..5}; do awk '$3-$2>3 && $3-$2 < 10' file | shuf -n2; echo "--";  done
              g 1 8
              e 6 14
              --
              g 1 8
              e 6 14
              --
              b 4 12
              g 1 8
              --
              b 4 12
              e 6 14
              --
              e 6 14
              b 4 12
              --


              The shuf tool is part of the GNU coreutils, so it should be installed by default on most any Linux system and easily available for most any *nix.






              share|improve this answer



















              • 2




                Selecting a single value is easy (condition && rand() < 1 / ++n { selection = $0 }), but I'm not sure how to adapt that to selecting N, so yeah, shuf is probably the way to go.
                – Kevin
                Jun 22 '17 at 20:39












              • @Kevin ah, clever trick with the / ++n! However, the problem is that i) you need to also use srand() to set a random seed else you always get the same output and ii) you can sometimes get no output at all for a small file. I tried extending it to print two lines, but I still can't get past ii): awk 'BEGIN{srand()}{if($3-$2>3 && $3-$2 <10 && rand() < 1 / ++n){a[k++]=$0}if(k>1){print a[0]"n"a[1]; exit}}' file and, in any case, that's more effort than it's worth, as you said.
                – terdon
                Jun 22 '17 at 20:48








              • 1




                I believe awk -v count=2 'BEGIN { srand() } $3 - $2 > 3 && $3 - $2 < 10 && rand() < count / ++n { if (n <= count) { s[n] = $0 } else { s[int(rand()*count)] = $0 } } END { for (i = 1; i <= count; i++) print s[i] }' file should work... but I'm not a statistician.
                – Kevin
                Jun 22 '17 at 21:14












              • @Kevin looks like it. Nice one! May as well make it into an answer. If only to show why we tend to use existing utilities when possible :)
                – terdon
                Jun 22 '17 at 21:16










              • sure, done. I did notice a bug in the comment, fixed in my answer.
                – Kevin
                Jun 22 '17 at 21:32













              up vote
              11
              down vote










              up vote
              11
              down vote









              You can do this in awk but getting the random selection of lines will be complex and will require writing quite a bit of code. I would instead use awk to get the lines that match your criteria and then use the standard tool shuf to choose a random selection:



              $ awk '$3-$2>3 && $3-$2 < 10' file | shuf -n2
              g 1 8
              a 1 5


              If you run this a few times, you'll see you get a random selection of lines:



              $ for i in {1..5}; do awk '$3-$2>3 && $3-$2 < 10' file | shuf -n2; echo "--";  done
              g 1 8
              e 6 14
              --
              g 1 8
              e 6 14
              --
              b 4 12
              g 1 8
              --
              b 4 12
              e 6 14
              --
              e 6 14
              b 4 12
              --


              The shuf tool is part of the GNU coreutils, so it should be installed by default on most any Linux system and easily available for most any *nix.






              share|improve this answer














              You can do this in awk but getting the random selection of lines will be complex and will require writing quite a bit of code. I would instead use awk to get the lines that match your criteria and then use the standard tool shuf to choose a random selection:



              $ awk '$3-$2>3 && $3-$2 < 10' file | shuf -n2
              g 1 8
              a 1 5


              If you run this a few times, you'll see you get a random selection of lines:



              $ for i in {1..5}; do awk '$3-$2>3 && $3-$2 < 10' file | shuf -n2; echo "--";  done
              g 1 8
              e 6 14
              --
              g 1 8
              e 6 14
              --
              b 4 12
              g 1 8
              --
              b 4 12
              e 6 14
              --
              e 6 14
              b 4 12
              --


              The shuf tool is part of the GNU coreutils, so it should be installed by default on most any Linux system and easily available for most any *nix.







              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited Jun 22 '17 at 21:17

























              answered Jun 22 '17 at 20:38









              terdon

              127k31244421




              127k31244421








              • 2




                Selecting a single value is easy (condition && rand() < 1 / ++n { selection = $0 }), but I'm not sure how to adapt that to selecting N, so yeah, shuf is probably the way to go.
                – Kevin
                Jun 22 '17 at 20:39












              • @Kevin ah, clever trick with the / ++n! However, the problem is that i) you need to also use srand() to set a random seed else you always get the same output and ii) you can sometimes get no output at all for a small file. I tried extending it to print two lines, but I still can't get past ii): awk 'BEGIN{srand()}{if($3-$2>3 && $3-$2 <10 && rand() < 1 / ++n){a[k++]=$0}if(k>1){print a[0]"n"a[1]; exit}}' file and, in any case, that's more effort than it's worth, as you said.
                – terdon
                Jun 22 '17 at 20:48








              • 1




                I believe awk -v count=2 'BEGIN { srand() } $3 - $2 > 3 && $3 - $2 < 10 && rand() < count / ++n { if (n <= count) { s[n] = $0 } else { s[int(rand()*count)] = $0 } } END { for (i = 1; i <= count; i++) print s[i] }' file should work... but I'm not a statistician.
                – Kevin
                Jun 22 '17 at 21:14












              • @Kevin looks like it. Nice one! May as well make it into an answer. If only to show why we tend to use existing utilities when possible :)
                – terdon
                Jun 22 '17 at 21:16










              • sure, done. I did notice a bug in the comment, fixed in my answer.
                – Kevin
                Jun 22 '17 at 21:32














              • 2




                Selecting a single value is easy (condition && rand() < 1 / ++n { selection = $0 }), but I'm not sure how to adapt that to selecting N, so yeah, shuf is probably the way to go.
                – Kevin
                Jun 22 '17 at 20:39












              • @Kevin ah, clever trick with the / ++n! However, the problem is that i) you need to also use srand() to set a random seed else you always get the same output and ii) you can sometimes get no output at all for a small file. I tried extending it to print two lines, but I still can't get past ii): awk 'BEGIN{srand()}{if($3-$2>3 && $3-$2 <10 && rand() < 1 / ++n){a[k++]=$0}if(k>1){print a[0]"n"a[1]; exit}}' file and, in any case, that's more effort than it's worth, as you said.
                – terdon
                Jun 22 '17 at 20:48








              • 1




                I believe awk -v count=2 'BEGIN { srand() } $3 - $2 > 3 && $3 - $2 < 10 && rand() < count / ++n { if (n <= count) { s[n] = $0 } else { s[int(rand()*count)] = $0 } } END { for (i = 1; i <= count; i++) print s[i] }' file should work... but I'm not a statistician.
                – Kevin
                Jun 22 '17 at 21:14












              • @Kevin looks like it. Nice one! May as well make it into an answer. If only to show why we tend to use existing utilities when possible :)
                – terdon
                Jun 22 '17 at 21:16










              • sure, done. I did notice a bug in the comment, fixed in my answer.
                – Kevin
                Jun 22 '17 at 21:32








              2




              2




              Selecting a single value is easy (condition && rand() < 1 / ++n { selection = $0 }), but I'm not sure how to adapt that to selecting N, so yeah, shuf is probably the way to go.
              – Kevin
              Jun 22 '17 at 20:39






              Selecting a single value is easy (condition && rand() < 1 / ++n { selection = $0 }), but I'm not sure how to adapt that to selecting N, so yeah, shuf is probably the way to go.
              – Kevin
              Jun 22 '17 at 20:39














              @Kevin ah, clever trick with the / ++n! However, the problem is that i) you need to also use srand() to set a random seed else you always get the same output and ii) you can sometimes get no output at all for a small file. I tried extending it to print two lines, but I still can't get past ii): awk 'BEGIN{srand()}{if($3-$2>3 && $3-$2 <10 && rand() < 1 / ++n){a[k++]=$0}if(k>1){print a[0]"n"a[1]; exit}}' file and, in any case, that's more effort than it's worth, as you said.
              – terdon
              Jun 22 '17 at 20:48






              @Kevin ah, clever trick with the / ++n! However, the problem is that i) you need to also use srand() to set a random seed else you always get the same output and ii) you can sometimes get no output at all for a small file. I tried extending it to print two lines, but I still can't get past ii): awk 'BEGIN{srand()}{if($3-$2>3 && $3-$2 <10 && rand() < 1 / ++n){a[k++]=$0}if(k>1){print a[0]"n"a[1]; exit}}' file and, in any case, that's more effort than it's worth, as you said.
              – terdon
              Jun 22 '17 at 20:48






              1




              1




              I believe awk -v count=2 'BEGIN { srand() } $3 - $2 > 3 && $3 - $2 < 10 && rand() < count / ++n { if (n <= count) { s[n] = $0 } else { s[int(rand()*count)] = $0 } } END { for (i = 1; i <= count; i++) print s[i] }' file should work... but I'm not a statistician.
              – Kevin
              Jun 22 '17 at 21:14






              I believe awk -v count=2 'BEGIN { srand() } $3 - $2 > 3 && $3 - $2 < 10 && rand() < count / ++n { if (n <= count) { s[n] = $0 } else { s[int(rand()*count)] = $0 } } END { for (i = 1; i <= count; i++) print s[i] }' file should work... but I'm not a statistician.
              – Kevin
              Jun 22 '17 at 21:14














              @Kevin looks like it. Nice one! May as well make it into an answer. If only to show why we tend to use existing utilities when possible :)
              – terdon
              Jun 22 '17 at 21:16




              @Kevin looks like it. Nice one! May as well make it into an answer. If only to show why we tend to use existing utilities when possible :)
              – terdon
              Jun 22 '17 at 21:16












              sure, done. I did notice a bug in the comment, fixed in my answer.
              – Kevin
              Jun 22 '17 at 21:32




              sure, done. I did notice a bug in the comment, fixed in my answer.
              – Kevin
              Jun 22 '17 at 21:32












              up vote
              4
              down vote













              If you want a pure awk answer that only iterates through the list once:



              awk -v count=2 'BEGIN { srand() } $3 - $2 > 3 && $3 - $2 < 10 && rand() < count / ++n { if (n <= count) { s[n] = $0 } else { s[1+int(rand()*count)] = $0 } } END { for (i in s) print s[i] }' input.txt


              Stored in a file for easier reading:



              BEGIN { srand() }
              $3 - $2 > 3 &&
              $3 - $2 < 10 &&
              rand() < count / ++n {
              if (n <= count) {
              s[n] = $0
              } else {
              s[1+int(rand()*count)] = $0
              }
              }
              END {
              for (i in s) print s[i]
              }


              The algorithm is a slight variation on Knuth's algorithm R; I'm pretty sure the change doesn't alter the distribution but I'm not a statistician so I can't guarantee it.



              Commented for those less familiar with awk:



              # Before the first line is read...
              BEGIN {
              # ...seed the random number generator.
              srand()
              }

              # For each line:
              # if the difference between the second and third columns is between 3 and 10 (exclusive)...
              $3 - $2 > 3 &&
              $3 - $2 < 10 &&
              # ... with a probability of (total rows to select) / (total matching rows so far) ...
              rand() < count / ++n {
              # ... If we haven't reached the number of rows we need, just add it to our list
              if (n <= count) {
              s[n] = $0
              } else {
              # otherwise, replace a random entry in our list with the current line.
              s[1+int(rand()*count)] = $0
              }
              }

              # After all lines have been processed...
              END {
              # Print all lines in our list.
              for (i in s) print s[i]
              }





              share|improve this answer























              • Could you please explain this witchcraft for those uninitiated in awk :)
                – SumNeuron
                Jun 22 '17 at 22:49






              • 1




                Added some explanation.
                – Kevin
                Jun 22 '17 at 23:19















              up vote
              4
              down vote













              If you want a pure awk answer that only iterates through the list once:



              awk -v count=2 'BEGIN { srand() } $3 - $2 > 3 && $3 - $2 < 10 && rand() < count / ++n { if (n <= count) { s[n] = $0 } else { s[1+int(rand()*count)] = $0 } } END { for (i in s) print s[i] }' input.txt


              Stored in a file for easier reading:



              BEGIN { srand() }
              $3 - $2 > 3 &&
              $3 - $2 < 10 &&
              rand() < count / ++n {
              if (n <= count) {
              s[n] = $0
              } else {
              s[1+int(rand()*count)] = $0
              }
              }
              END {
              for (i in s) print s[i]
              }


              The algorithm is a slight variation on Knuth's algorithm R; I'm pretty sure the change doesn't alter the distribution but I'm not a statistician so I can't guarantee it.



              Commented for those less familiar with awk:



              # Before the first line is read...
              BEGIN {
              # ...seed the random number generator.
              srand()
              }

              # For each line:
              # if the difference between the second and third columns is between 3 and 10 (exclusive)...
              $3 - $2 > 3 &&
              $3 - $2 < 10 &&
              # ... with a probability of (total rows to select) / (total matching rows so far) ...
              rand() < count / ++n {
              # ... If we haven't reached the number of rows we need, just add it to our list
              if (n <= count) {
              s[n] = $0
              } else {
              # otherwise, replace a random entry in our list with the current line.
              s[1+int(rand()*count)] = $0
              }
              }

              # After all lines have been processed...
              END {
              # Print all lines in our list.
              for (i in s) print s[i]
              }





              share|improve this answer























              • Could you please explain this witchcraft for those uninitiated in awk :)
                – SumNeuron
                Jun 22 '17 at 22:49






              • 1




                Added some explanation.
                – Kevin
                Jun 22 '17 at 23:19













              up vote
              4
              down vote










              up vote
              4
              down vote









              If you want a pure awk answer that only iterates through the list once:



              awk -v count=2 'BEGIN { srand() } $3 - $2 > 3 && $3 - $2 < 10 && rand() < count / ++n { if (n <= count) { s[n] = $0 } else { s[1+int(rand()*count)] = $0 } } END { for (i in s) print s[i] }' input.txt


              Stored in a file for easier reading:



              BEGIN { srand() }
              $3 - $2 > 3 &&
              $3 - $2 < 10 &&
              rand() < count / ++n {
              if (n <= count) {
              s[n] = $0
              } else {
              s[1+int(rand()*count)] = $0
              }
              }
              END {
              for (i in s) print s[i]
              }


              The algorithm is a slight variation on Knuth's algorithm R; I'm pretty sure the change doesn't alter the distribution but I'm not a statistician so I can't guarantee it.



              Commented for those less familiar with awk:



              # Before the first line is read...
              BEGIN {
              # ...seed the random number generator.
              srand()
              }

              # For each line:
              # if the difference between the second and third columns is between 3 and 10 (exclusive)...
              $3 - $2 > 3 &&
              $3 - $2 < 10 &&
              # ... with a probability of (total rows to select) / (total matching rows so far) ...
              rand() < count / ++n {
              # ... If we haven't reached the number of rows we need, just add it to our list
              if (n <= count) {
              s[n] = $0
              } else {
              # otherwise, replace a random entry in our list with the current line.
              s[1+int(rand()*count)] = $0
              }
              }

              # After all lines have been processed...
              END {
              # Print all lines in our list.
              for (i in s) print s[i]
              }





              share|improve this answer














              If you want a pure awk answer that only iterates through the list once:



              awk -v count=2 'BEGIN { srand() } $3 - $2 > 3 && $3 - $2 < 10 && rand() < count / ++n { if (n <= count) { s[n] = $0 } else { s[1+int(rand()*count)] = $0 } } END { for (i in s) print s[i] }' input.txt


              Stored in a file for easier reading:



              BEGIN { srand() }
              $3 - $2 > 3 &&
              $3 - $2 < 10 &&
              rand() < count / ++n {
              if (n <= count) {
              s[n] = $0
              } else {
              s[1+int(rand()*count)] = $0
              }
              }
              END {
              for (i in s) print s[i]
              }


              The algorithm is a slight variation on Knuth's algorithm R; I'm pretty sure the change doesn't alter the distribution but I'm not a statistician so I can't guarantee it.



              Commented for those less familiar with awk:



              # Before the first line is read...
              BEGIN {
              # ...seed the random number generator.
              srand()
              }

              # For each line:
              # if the difference between the second and third columns is between 3 and 10 (exclusive)...
              $3 - $2 > 3 &&
              $3 - $2 < 10 &&
              # ... with a probability of (total rows to select) / (total matching rows so far) ...
              rand() < count / ++n {
              # ... If we haven't reached the number of rows we need, just add it to our list
              if (n <= count) {
              s[n] = $0
              } else {
              # otherwise, replace a random entry in our list with the current line.
              s[1+int(rand()*count)] = $0
              }
              }

              # After all lines have been processed...
              END {
              # Print all lines in our list.
              for (i in s) print s[i]
              }






              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited Jun 22 '17 at 23:19

























              answered Jun 22 '17 at 21:23









              Kevin

              26.7k106198




              26.7k106198












              • Could you please explain this witchcraft for those uninitiated in awk :)
                – SumNeuron
                Jun 22 '17 at 22:49






              • 1




                Added some explanation.
                – Kevin
                Jun 22 '17 at 23:19


















              • Could you please explain this witchcraft for those uninitiated in awk :)
                – SumNeuron
                Jun 22 '17 at 22:49






              • 1




                Added some explanation.
                – Kevin
                Jun 22 '17 at 23:19
















              Could you please explain this witchcraft for those uninitiated in awk :)
              – SumNeuron
              Jun 22 '17 at 22:49




              Could you please explain this witchcraft for those uninitiated in awk :)
              – SumNeuron
              Jun 22 '17 at 22:49




              1




              1




              Added some explanation.
              – Kevin
              Jun 22 '17 at 23:19




              Added some explanation.
              – Kevin
              Jun 22 '17 at 23:19










              up vote
              2
              down vote













              Here's one way to do it in GNU awk (which supports custom sort routines):



              #!/usr/bin/gawk -f

              function mycmp(ia, va, ib, vb) {
              return rand() < 0.5 ? 0 : 1;
              }

              BEGIN {
              srand();
              }

              $3 - $2 > 3 && $3 - $2 < 10 {
              a[NR]=$0;
              }

              END {
              asort(a, b, "mycmp");
              for (i = 1; i < 3; i++) print b[i];
              }


              Testing with the given data:



              $ for i in {1..6}; do printf 'Try %d:n' $i; ../randsel.awk file; sleep 2; done
              Try 1:
              g 1 8
              e 6 14
              Try 2:
              a 1 5
              b 4 12
              Try 3:
              b 4 12
              a 1 5
              Try 4:
              e 6 14
              a 1 5
              Try 5:
              b 4 12
              a 1 5
              Try 6:
              e 6 14
              b 4 12





              share|improve this answer

























                up vote
                2
                down vote













                Here's one way to do it in GNU awk (which supports custom sort routines):



                #!/usr/bin/gawk -f

                function mycmp(ia, va, ib, vb) {
                return rand() < 0.5 ? 0 : 1;
                }

                BEGIN {
                srand();
                }

                $3 - $2 > 3 && $3 - $2 < 10 {
                a[NR]=$0;
                }

                END {
                asort(a, b, "mycmp");
                for (i = 1; i < 3; i++) print b[i];
                }


                Testing with the given data:



                $ for i in {1..6}; do printf 'Try %d:n' $i; ../randsel.awk file; sleep 2; done
                Try 1:
                g 1 8
                e 6 14
                Try 2:
                a 1 5
                b 4 12
                Try 3:
                b 4 12
                a 1 5
                Try 4:
                e 6 14
                a 1 5
                Try 5:
                b 4 12
                a 1 5
                Try 6:
                e 6 14
                b 4 12





                share|improve this answer























                  up vote
                  2
                  down vote










                  up vote
                  2
                  down vote









                  Here's one way to do it in GNU awk (which supports custom sort routines):



                  #!/usr/bin/gawk -f

                  function mycmp(ia, va, ib, vb) {
                  return rand() < 0.5 ? 0 : 1;
                  }

                  BEGIN {
                  srand();
                  }

                  $3 - $2 > 3 && $3 - $2 < 10 {
                  a[NR]=$0;
                  }

                  END {
                  asort(a, b, "mycmp");
                  for (i = 1; i < 3; i++) print b[i];
                  }


                  Testing with the given data:



                  $ for i in {1..6}; do printf 'Try %d:n' $i; ../randsel.awk file; sleep 2; done
                  Try 1:
                  g 1 8
                  e 6 14
                  Try 2:
                  a 1 5
                  b 4 12
                  Try 3:
                  b 4 12
                  a 1 5
                  Try 4:
                  e 6 14
                  a 1 5
                  Try 5:
                  b 4 12
                  a 1 5
                  Try 6:
                  e 6 14
                  b 4 12





                  share|improve this answer












                  Here's one way to do it in GNU awk (which supports custom sort routines):



                  #!/usr/bin/gawk -f

                  function mycmp(ia, va, ib, vb) {
                  return rand() < 0.5 ? 0 : 1;
                  }

                  BEGIN {
                  srand();
                  }

                  $3 - $2 > 3 && $3 - $2 < 10 {
                  a[NR]=$0;
                  }

                  END {
                  asort(a, b, "mycmp");
                  for (i = 1; i < 3; i++) print b[i];
                  }


                  Testing with the given data:



                  $ for i in {1..6}; do printf 'Try %d:n' $i; ../randsel.awk file; sleep 2; done
                  Try 1:
                  g 1 8
                  e 6 14
                  Try 2:
                  a 1 5
                  b 4 12
                  Try 3:
                  b 4 12
                  a 1 5
                  Try 4:
                  e 6 14
                  a 1 5
                  Try 5:
                  b 4 12
                  a 1 5
                  Try 6:
                  e 6 14
                  b 4 12






                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Jun 22 '17 at 23:09









                  steeldriver

                  33.7k34983




                  33.7k34983






















                      up vote
                      0
                      down vote













                      Posting a perl solution, as I don't see any reason why it must be in awk (except for the OP's wish):



                      #!/usr/bin/perl

                      use strict;
                      use warnings;
                      my $N = 2;
                      my $k;
                      my @r;

                      while(<>) {
                      my @line = split(/s+/);
                      if ($line[2] - $line[1] > 3 && $line[2] - $line[1] < 10) {
                      if(++$k <= $N) {
                      push @r, $_;
                      } elsif(rand(1) <= ($N/$k)) {
                      $r[rand(@r)] = $_;
                      }
                      }
                      }

                      print @r;


                      This is a classic example of reservoir sampling. The algorithm was copied from here and modified by me to meet OP's specific wishes.



                      When saved in file reservoir.pl you run it with ./reservoir.pl file1 file2 file3 or cat file1 file2 file3 | ./reservoir.pl.






                      share|improve this answer

























                        up vote
                        0
                        down vote













                        Posting a perl solution, as I don't see any reason why it must be in awk (except for the OP's wish):



                        #!/usr/bin/perl

                        use strict;
                        use warnings;
                        my $N = 2;
                        my $k;
                        my @r;

                        while(<>) {
                        my @line = split(/s+/);
                        if ($line[2] - $line[1] > 3 && $line[2] - $line[1] < 10) {
                        if(++$k <= $N) {
                        push @r, $_;
                        } elsif(rand(1) <= ($N/$k)) {
                        $r[rand(@r)] = $_;
                        }
                        }
                        }

                        print @r;


                        This is a classic example of reservoir sampling. The algorithm was copied from here and modified by me to meet OP's specific wishes.



                        When saved in file reservoir.pl you run it with ./reservoir.pl file1 file2 file3 or cat file1 file2 file3 | ./reservoir.pl.






                        share|improve this answer























                          up vote
                          0
                          down vote










                          up vote
                          0
                          down vote









                          Posting a perl solution, as I don't see any reason why it must be in awk (except for the OP's wish):



                          #!/usr/bin/perl

                          use strict;
                          use warnings;
                          my $N = 2;
                          my $k;
                          my @r;

                          while(<>) {
                          my @line = split(/s+/);
                          if ($line[2] - $line[1] > 3 && $line[2] - $line[1] < 10) {
                          if(++$k <= $N) {
                          push @r, $_;
                          } elsif(rand(1) <= ($N/$k)) {
                          $r[rand(@r)] = $_;
                          }
                          }
                          }

                          print @r;


                          This is a classic example of reservoir sampling. The algorithm was copied from here and modified by me to meet OP's specific wishes.



                          When saved in file reservoir.pl you run it with ./reservoir.pl file1 file2 file3 or cat file1 file2 file3 | ./reservoir.pl.






                          share|improve this answer












                          Posting a perl solution, as I don't see any reason why it must be in awk (except for the OP's wish):



                          #!/usr/bin/perl

                          use strict;
                          use warnings;
                          my $N = 2;
                          my $k;
                          my @r;

                          while(<>) {
                          my @line = split(/s+/);
                          if ($line[2] - $line[1] > 3 && $line[2] - $line[1] < 10) {
                          if(++$k <= $N) {
                          push @r, $_;
                          } elsif(rand(1) <= ($N/$k)) {
                          $r[rand(@r)] = $_;
                          }
                          }
                          }

                          print @r;


                          This is a classic example of reservoir sampling. The algorithm was copied from here and modified by me to meet OP's specific wishes.



                          When saved in file reservoir.pl you run it with ./reservoir.pl file1 file2 file3 or cat file1 file2 file3 | ./reservoir.pl.







                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered Jun 23 '17 at 11:00









                          styrofoam fly

                          424311




                          424311






























                              draft saved

                              draft discarded




















































                              Thanks for contributing an answer to Unix & Linux Stack Exchange!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              To learn more, see our tips on writing great answers.





                              Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                              Please pay close attention to the following guidance:


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f372816%2fawk-get-random-lines-of-file-satisfying-a-condition%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              Morgemoulin

                              Scott Moir

                              Souastre