Random mutagenesis with bash












4














I have a string e.g.



1234567890


and I want to replace random positions of that string with corresponding position from a random sequence in another set of other strings e.g.



ABCDEFGHIJ
KLMNOPQRST
UVWXYZABCD
...


If I chose to make 3 replacements, the script should chose 3 random numbers e.g. 3,7,8; and 3 random sequences e.g. 1, 1, 3; make the replacements to generate the expected output:



12C456GB90


Is there a way to do this without significant looping? I wrote a simple bash script to generate a random position and a random sequence line then do 1 replacement, then repeat the process on the output, repeat, repeat. This works perfectly, however in my real-life files (much larger than the examples), I want to generate 10,000 or more replacements. Oh, and I will need to do this multiple times to generate multiple 'mutated' variant sequences.



EDIT: At the moment I am using something like this:



#chose random number between 1 and the number of characters in the string
randomposition=$(jot -r 1 1 $seqpositions)
#chose a random number between 1 and the number of lines in the set of potential replacement strings
randomline=$(jot -r 1 1 $alignlines)
#find the character at randomline:randomposition
newAA=$(sed -n "$randomline,$randomline p" $alignmentfile | cut -c$randomposition)
#replace the character at 'string:randomposition' with the character at 'randomline:randomposition'
sed "s/./$newAA/$randomposition" $sequencefile


(with some additional bits, obviously) and just looping through this thousands of times










share|improve this question




















  • 2




    What is the application of this? Do you work in a field where this sort of thing may already be implemented by a field specific standard tool?
    – Kusalananda
    Oct 8 at 18:34










  • Looks like a computational biology question: I agree with @Kusalananda: give us a real-world example.
    – Fabby
    Oct 8 at 18:37










  • Yes, it's indeed biology based. The idea is that I have a protein sequence, and I want to mutate that sequence randomly-ish; that is, I want to mutate it randomly, but only allow specific characters (amino acid residues) that have been observed previously at that specific position i.e. by swapping with characters in my alignment file at the same position. By doing it this way, I keep the protein sequence resembling something true, while also maintaining some information about the frequency of characters at each position (mutation to a rare character only occurs rarely etc). Does that help?
    – catchprj
    Oct 8 at 20:34










  • I edited the original question to include the relevant parts of the script I am currently using. Perhaps someone can see how I could make it change more than one position at a time?
    – catchprj
    Oct 8 at 20:50






  • 1




    Could you please include the expected output so we can work on it. currently, not clear what is the fruit is ?!
    – user90704
    Oct 8 at 22:11
















4














I have a string e.g.



1234567890


and I want to replace random positions of that string with corresponding position from a random sequence in another set of other strings e.g.



ABCDEFGHIJ
KLMNOPQRST
UVWXYZABCD
...


If I chose to make 3 replacements, the script should chose 3 random numbers e.g. 3,7,8; and 3 random sequences e.g. 1, 1, 3; make the replacements to generate the expected output:



12C456GB90


Is there a way to do this without significant looping? I wrote a simple bash script to generate a random position and a random sequence line then do 1 replacement, then repeat the process on the output, repeat, repeat. This works perfectly, however in my real-life files (much larger than the examples), I want to generate 10,000 or more replacements. Oh, and I will need to do this multiple times to generate multiple 'mutated' variant sequences.



EDIT: At the moment I am using something like this:



#chose random number between 1 and the number of characters in the string
randomposition=$(jot -r 1 1 $seqpositions)
#chose a random number between 1 and the number of lines in the set of potential replacement strings
randomline=$(jot -r 1 1 $alignlines)
#find the character at randomline:randomposition
newAA=$(sed -n "$randomline,$randomline p" $alignmentfile | cut -c$randomposition)
#replace the character at 'string:randomposition' with the character at 'randomline:randomposition'
sed "s/./$newAA/$randomposition" $sequencefile


(with some additional bits, obviously) and just looping through this thousands of times










share|improve this question




















  • 2




    What is the application of this? Do you work in a field where this sort of thing may already be implemented by a field specific standard tool?
    – Kusalananda
    Oct 8 at 18:34










  • Looks like a computational biology question: I agree with @Kusalananda: give us a real-world example.
    – Fabby
    Oct 8 at 18:37










  • Yes, it's indeed biology based. The idea is that I have a protein sequence, and I want to mutate that sequence randomly-ish; that is, I want to mutate it randomly, but only allow specific characters (amino acid residues) that have been observed previously at that specific position i.e. by swapping with characters in my alignment file at the same position. By doing it this way, I keep the protein sequence resembling something true, while also maintaining some information about the frequency of characters at each position (mutation to a rare character only occurs rarely etc). Does that help?
    – catchprj
    Oct 8 at 20:34










  • I edited the original question to include the relevant parts of the script I am currently using. Perhaps someone can see how I could make it change more than one position at a time?
    – catchprj
    Oct 8 at 20:50






  • 1




    Could you please include the expected output so we can work on it. currently, not clear what is the fruit is ?!
    – user90704
    Oct 8 at 22:11














4












4








4







I have a string e.g.



1234567890


and I want to replace random positions of that string with corresponding position from a random sequence in another set of other strings e.g.



ABCDEFGHIJ
KLMNOPQRST
UVWXYZABCD
...


If I chose to make 3 replacements, the script should chose 3 random numbers e.g. 3,7,8; and 3 random sequences e.g. 1, 1, 3; make the replacements to generate the expected output:



12C456GB90


Is there a way to do this without significant looping? I wrote a simple bash script to generate a random position and a random sequence line then do 1 replacement, then repeat the process on the output, repeat, repeat. This works perfectly, however in my real-life files (much larger than the examples), I want to generate 10,000 or more replacements. Oh, and I will need to do this multiple times to generate multiple 'mutated' variant sequences.



EDIT: At the moment I am using something like this:



#chose random number between 1 and the number of characters in the string
randomposition=$(jot -r 1 1 $seqpositions)
#chose a random number between 1 and the number of lines in the set of potential replacement strings
randomline=$(jot -r 1 1 $alignlines)
#find the character at randomline:randomposition
newAA=$(sed -n "$randomline,$randomline p" $alignmentfile | cut -c$randomposition)
#replace the character at 'string:randomposition' with the character at 'randomline:randomposition'
sed "s/./$newAA/$randomposition" $sequencefile


(with some additional bits, obviously) and just looping through this thousands of times










share|improve this question















I have a string e.g.



1234567890


and I want to replace random positions of that string with corresponding position from a random sequence in another set of other strings e.g.



ABCDEFGHIJ
KLMNOPQRST
UVWXYZABCD
...


If I chose to make 3 replacements, the script should chose 3 random numbers e.g. 3,7,8; and 3 random sequences e.g. 1, 1, 3; make the replacements to generate the expected output:



12C456GB90


Is there a way to do this without significant looping? I wrote a simple bash script to generate a random position and a random sequence line then do 1 replacement, then repeat the process on the output, repeat, repeat. This works perfectly, however in my real-life files (much larger than the examples), I want to generate 10,000 or more replacements. Oh, and I will need to do this multiple times to generate multiple 'mutated' variant sequences.



EDIT: At the moment I am using something like this:



#chose random number between 1 and the number of characters in the string
randomposition=$(jot -r 1 1 $seqpositions)
#chose a random number between 1 and the number of lines in the set of potential replacement strings
randomline=$(jot -r 1 1 $alignlines)
#find the character at randomline:randomposition
newAA=$(sed -n "$randomline,$randomline p" $alignmentfile | cut -c$randomposition)
#replace the character at 'string:randomposition' with the character at 'randomline:randomposition'
sed "s/./$newAA/$randomposition" $sequencefile


(with some additional bits, obviously) and just looping through this thousands of times







bash scripting






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Oct 10 at 16:15









Rui F Ribeiro

39k1479129




39k1479129










asked Oct 8 at 18:20









catchprj

384




384








  • 2




    What is the application of this? Do you work in a field where this sort of thing may already be implemented by a field specific standard tool?
    – Kusalananda
    Oct 8 at 18:34










  • Looks like a computational biology question: I agree with @Kusalananda: give us a real-world example.
    – Fabby
    Oct 8 at 18:37










  • Yes, it's indeed biology based. The idea is that I have a protein sequence, and I want to mutate that sequence randomly-ish; that is, I want to mutate it randomly, but only allow specific characters (amino acid residues) that have been observed previously at that specific position i.e. by swapping with characters in my alignment file at the same position. By doing it this way, I keep the protein sequence resembling something true, while also maintaining some information about the frequency of characters at each position (mutation to a rare character only occurs rarely etc). Does that help?
    – catchprj
    Oct 8 at 20:34










  • I edited the original question to include the relevant parts of the script I am currently using. Perhaps someone can see how I could make it change more than one position at a time?
    – catchprj
    Oct 8 at 20:50






  • 1




    Could you please include the expected output so we can work on it. currently, not clear what is the fruit is ?!
    – user90704
    Oct 8 at 22:11














  • 2




    What is the application of this? Do you work in a field where this sort of thing may already be implemented by a field specific standard tool?
    – Kusalananda
    Oct 8 at 18:34










  • Looks like a computational biology question: I agree with @Kusalananda: give us a real-world example.
    – Fabby
    Oct 8 at 18:37










  • Yes, it's indeed biology based. The idea is that I have a protein sequence, and I want to mutate that sequence randomly-ish; that is, I want to mutate it randomly, but only allow specific characters (amino acid residues) that have been observed previously at that specific position i.e. by swapping with characters in my alignment file at the same position. By doing it this way, I keep the protein sequence resembling something true, while also maintaining some information about the frequency of characters at each position (mutation to a rare character only occurs rarely etc). Does that help?
    – catchprj
    Oct 8 at 20:34










  • I edited the original question to include the relevant parts of the script I am currently using. Perhaps someone can see how I could make it change more than one position at a time?
    – catchprj
    Oct 8 at 20:50






  • 1




    Could you please include the expected output so we can work on it. currently, not clear what is the fruit is ?!
    – user90704
    Oct 8 at 22:11








2




2




What is the application of this? Do you work in a field where this sort of thing may already be implemented by a field specific standard tool?
– Kusalananda
Oct 8 at 18:34




What is the application of this? Do you work in a field where this sort of thing may already be implemented by a field specific standard tool?
– Kusalananda
Oct 8 at 18:34












Looks like a computational biology question: I agree with @Kusalananda: give us a real-world example.
– Fabby
Oct 8 at 18:37




Looks like a computational biology question: I agree with @Kusalananda: give us a real-world example.
– Fabby
Oct 8 at 18:37












Yes, it's indeed biology based. The idea is that I have a protein sequence, and I want to mutate that sequence randomly-ish; that is, I want to mutate it randomly, but only allow specific characters (amino acid residues) that have been observed previously at that specific position i.e. by swapping with characters in my alignment file at the same position. By doing it this way, I keep the protein sequence resembling something true, while also maintaining some information about the frequency of characters at each position (mutation to a rare character only occurs rarely etc). Does that help?
– catchprj
Oct 8 at 20:34




Yes, it's indeed biology based. The idea is that I have a protein sequence, and I want to mutate that sequence randomly-ish; that is, I want to mutate it randomly, but only allow specific characters (amino acid residues) that have been observed previously at that specific position i.e. by swapping with characters in my alignment file at the same position. By doing it this way, I keep the protein sequence resembling something true, while also maintaining some information about the frequency of characters at each position (mutation to a rare character only occurs rarely etc). Does that help?
– catchprj
Oct 8 at 20:34












I edited the original question to include the relevant parts of the script I am currently using. Perhaps someone can see how I could make it change more than one position at a time?
– catchprj
Oct 8 at 20:50




I edited the original question to include the relevant parts of the script I am currently using. Perhaps someone can see how I could make it change more than one position at a time?
– catchprj
Oct 8 at 20:50




1




1




Could you please include the expected output so we can work on it. currently, not clear what is the fruit is ?!
– user90704
Oct 8 at 22:11




Could you please include the expected output so we can work on it. currently, not clear what is the fruit is ?!
– user90704
Oct 8 at 22:11










3 Answers
3






active

oldest

votes


















0














Note:



This is strictly for amusement purposes; an equivalent program in C would be much simpler and orders of magnitude faster; as to bash, let's not even talk about ;-)



The following perl script will mutate a list of ~1M sequences, and ~10k alignments in about 10 seconds on my laptop.



#! /usr/bin/perl
# usage mutagen number_of_replacements alignment_file [ sequence_file ..]
use strict;
my $max = shift() - 1;
my $algf = shift;
open my $alg, $algf or die "open $algf: $!";
my @alg = <$alg>;

sub prand { map int(rand() * $_[0]), 0..$max }
while(<>){
my @ip = prand length() - 1;
my @op = prand scalar @alg;
for my $i (0..$max){
my $p = $ip[$i];
substr $_, $p, 1, substr $alg[$op[$i]], $p, 1;
}
print;
}


Usage example:



$ cat seq
1634870295
5684937021
2049163587
6598471230
$ cat alg
DPMBHZJEIO
INTMJZOYKQ
KNTXGLCJSR
GLJZRFVSEX
SYJVHEPNAZ
$ perl mutagen 3 alg seq
1L3V8702I5
5684HE7Y21
2049JZC587
6598H7C2E0


If the generated n random numbers have to be different between them, then prand should be changed to:



sub prand {
my (@r, $m, %h);
die "more replacements than positions/alignments" if $max >= $_[0];
for(0..$max){
my $r = int(rand() * $_[0]);
$r = ($r + 1) % $_[0] while $h{$r};
$h{$r} = 1;
push @r, $r;
}
@r;
}


A debug-enabled version, that will pretty-print the mutation with colors when given the -d switch:



#! /usr/bin/perl
# usage mutagen [-d] number_of_replacements alignment_file [ sequence_file ..]
use strict;

my $debug = $ARGV[0] eq '-d' ? shift : 0;
my $max = shift() - 1;
my $algf = shift;
open my $alg, $algf or die "open $algf: $!";
my @alg = <$alg>;

sub prand { map int(rand() * $_[0]), 0..$max }
while(<>){
my @ip = prand length() - 1;
my @op = prand scalar @alg;

if($debug){
my $t = ' ' x (length() - 1);
substr $t, $ip[$_], 1, $ip[$_] for 0..$max;
warn "@ip | @opn $_ $tn";
for my $i (0..$max){
my $t = $alg[$op[$i]];
$t =~ s/(.{$ip[$i]})(.)/$1e[1;31m$2e[m/;
printf STDERR " %2d %s", $op[$i], $t;
}
}
for my $i (0..$max){
my $p = $ip[$i];
substr $_, $p, 1, substr $alg[$op[$i]], $p, 1;
}
print;
if($debug){
my @t = split "", $_;
for my $i (0..$max){
$_ = "e[1;31m$_e[m" for $t[$ip[$i]];
}
warn " = ", @t, "n";
}
}





share|improve this answer























  • Wow! I can't even begin to understand that, but it works perfectly. I bow down to you. Thanks so much
    – catchprj
    Oct 9 at 8:50










  • I notice the script breaks if you request more replacements than there are sequences in the alignment, which suggests that it will only access each alignment sequence once, right? Is it possible to chose a random alignment sequence each time, even if it has been chosen before? That way, in your usage example above, it would be possible to do >5 replacements (it's even possible, though unlikely, they all come from the same alignment sequence)
    – catchprj
    Oct 9 at 9:07










  • Yes that's very easy to do, it will even simplify the script a lot -- replace the sub prand {...} with sub sub prand { map int(rand() * $_[0]), 0..$max }.
    – qubert
    Oct 9 at 9:32












  • @catchprj I've updated the answer.
    – qubert
    Oct 9 at 9:54



















0














This linear would generate an infinite number of random keys:



cat /dev/urandom | tr -dc 'A-Z0-9' | fold -w 10 | head -n 1


Sample output:



MB0JZZ85VI
2OKOY4JL61
2YN7B71Z6K
KH29TYCQ4K
B4N1XOFY5O


Explanation:



/dev/random, /dev/urandom or even /dev/arandom are special files that serve as pseudorandom number generators in the system. They allow access to environmental noise collected from device drivers and other sources, more information can be reached here



The fold command in UNIX is a command line utility for folding contents of specified files, or standard input. By default it wraps lines at a maximum width of 80 columns. It also supports specifying the column width and wrapping by numbers of bytes. The flag w in the command fold represent the columns width and it can help, indirectly, to adjust for how many bytes would be included in the randomly generated keys.



The regex expression in the command tr controls for which characters would be included in the random keys.



head -n would adjust for how many random keys would be generated. For example, replacing -n 1 by 10000 would generate 10.000 keys.






share|improve this answer























  • This would give a sequence of 10 random uppercase characters and/or digits, but it would not mutate a given string the way that the question asked about. Note that the strings in the question (the original one and the set of other strings) are not random. These (I assume) represent, for example, DNA that is mutated in different ways.
    – Kusalananda
    Oct 8 at 19:42












  • @Kusalananda, I appreciate your comment! Honestly, I could not find simpler method to generate 10.000 replacements without looping, let's wait for the OP feedback and then we can go from there ;-)
    – user88036
    Oct 8 at 20:02





















0














Your original bash attempt was slow because of the number of external processes being started. Each random number called jot, and each string manipulation used two sed and a cut.



As you're using bash, and not pure sh, you can benefit from the $RANDOM variable, Substring Expansion and Arrays. These make it possible to perform the replacements with no external commands -- not even any bash subshells.



#!/bin/bash

count=$1
read sequence < $2
IFS=$'n' read -d '' -a replacements < $3
len=${#sequence}
choices=${#replacements[*]}

while ((count--)) ; do
pos=$(($RANDOM % $len))
choice=$(($RANDOM % $choices))
replacement=${replacements[$choice]}
sequence=${sequence:0:$pos}${replacement:$pos:1}${sequence:$((pos+1))}
done

echo "$sequence"


Note that $RANDOM won't exceed 32767, so if your sequences are bigger than that (or even approaching that size), you will need something more complex than $RANDOM % maximum.



This is still unlikely to beat a dedicated scripting language for speed, let alone a compiled language.






share|improve this answer





















    Your Answer








    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "106"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f474057%2frandom-mutagenesis-with-bash%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    3 Answers
    3






    active

    oldest

    votes








    3 Answers
    3






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    Note:



    This is strictly for amusement purposes; an equivalent program in C would be much simpler and orders of magnitude faster; as to bash, let's not even talk about ;-)



    The following perl script will mutate a list of ~1M sequences, and ~10k alignments in about 10 seconds on my laptop.



    #! /usr/bin/perl
    # usage mutagen number_of_replacements alignment_file [ sequence_file ..]
    use strict;
    my $max = shift() - 1;
    my $algf = shift;
    open my $alg, $algf or die "open $algf: $!";
    my @alg = <$alg>;

    sub prand { map int(rand() * $_[0]), 0..$max }
    while(<>){
    my @ip = prand length() - 1;
    my @op = prand scalar @alg;
    for my $i (0..$max){
    my $p = $ip[$i];
    substr $_, $p, 1, substr $alg[$op[$i]], $p, 1;
    }
    print;
    }


    Usage example:



    $ cat seq
    1634870295
    5684937021
    2049163587
    6598471230
    $ cat alg
    DPMBHZJEIO
    INTMJZOYKQ
    KNTXGLCJSR
    GLJZRFVSEX
    SYJVHEPNAZ
    $ perl mutagen 3 alg seq
    1L3V8702I5
    5684HE7Y21
    2049JZC587
    6598H7C2E0


    If the generated n random numbers have to be different between them, then prand should be changed to:



    sub prand {
    my (@r, $m, %h);
    die "more replacements than positions/alignments" if $max >= $_[0];
    for(0..$max){
    my $r = int(rand() * $_[0]);
    $r = ($r + 1) % $_[0] while $h{$r};
    $h{$r} = 1;
    push @r, $r;
    }
    @r;
    }


    A debug-enabled version, that will pretty-print the mutation with colors when given the -d switch:



    #! /usr/bin/perl
    # usage mutagen [-d] number_of_replacements alignment_file [ sequence_file ..]
    use strict;

    my $debug = $ARGV[0] eq '-d' ? shift : 0;
    my $max = shift() - 1;
    my $algf = shift;
    open my $alg, $algf or die "open $algf: $!";
    my @alg = <$alg>;

    sub prand { map int(rand() * $_[0]), 0..$max }
    while(<>){
    my @ip = prand length() - 1;
    my @op = prand scalar @alg;

    if($debug){
    my $t = ' ' x (length() - 1);
    substr $t, $ip[$_], 1, $ip[$_] for 0..$max;
    warn "@ip | @opn $_ $tn";
    for my $i (0..$max){
    my $t = $alg[$op[$i]];
    $t =~ s/(.{$ip[$i]})(.)/$1e[1;31m$2e[m/;
    printf STDERR " %2d %s", $op[$i], $t;
    }
    }
    for my $i (0..$max){
    my $p = $ip[$i];
    substr $_, $p, 1, substr $alg[$op[$i]], $p, 1;
    }
    print;
    if($debug){
    my @t = split "", $_;
    for my $i (0..$max){
    $_ = "e[1;31m$_e[m" for $t[$ip[$i]];
    }
    warn " = ", @t, "n";
    }
    }





    share|improve this answer























    • Wow! I can't even begin to understand that, but it works perfectly. I bow down to you. Thanks so much
      – catchprj
      Oct 9 at 8:50










    • I notice the script breaks if you request more replacements than there are sequences in the alignment, which suggests that it will only access each alignment sequence once, right? Is it possible to chose a random alignment sequence each time, even if it has been chosen before? That way, in your usage example above, it would be possible to do >5 replacements (it's even possible, though unlikely, they all come from the same alignment sequence)
      – catchprj
      Oct 9 at 9:07










    • Yes that's very easy to do, it will even simplify the script a lot -- replace the sub prand {...} with sub sub prand { map int(rand() * $_[0]), 0..$max }.
      – qubert
      Oct 9 at 9:32












    • @catchprj I've updated the answer.
      – qubert
      Oct 9 at 9:54
















    0














    Note:



    This is strictly for amusement purposes; an equivalent program in C would be much simpler and orders of magnitude faster; as to bash, let's not even talk about ;-)



    The following perl script will mutate a list of ~1M sequences, and ~10k alignments in about 10 seconds on my laptop.



    #! /usr/bin/perl
    # usage mutagen number_of_replacements alignment_file [ sequence_file ..]
    use strict;
    my $max = shift() - 1;
    my $algf = shift;
    open my $alg, $algf or die "open $algf: $!";
    my @alg = <$alg>;

    sub prand { map int(rand() * $_[0]), 0..$max }
    while(<>){
    my @ip = prand length() - 1;
    my @op = prand scalar @alg;
    for my $i (0..$max){
    my $p = $ip[$i];
    substr $_, $p, 1, substr $alg[$op[$i]], $p, 1;
    }
    print;
    }


    Usage example:



    $ cat seq
    1634870295
    5684937021
    2049163587
    6598471230
    $ cat alg
    DPMBHZJEIO
    INTMJZOYKQ
    KNTXGLCJSR
    GLJZRFVSEX
    SYJVHEPNAZ
    $ perl mutagen 3 alg seq
    1L3V8702I5
    5684HE7Y21
    2049JZC587
    6598H7C2E0


    If the generated n random numbers have to be different between them, then prand should be changed to:



    sub prand {
    my (@r, $m, %h);
    die "more replacements than positions/alignments" if $max >= $_[0];
    for(0..$max){
    my $r = int(rand() * $_[0]);
    $r = ($r + 1) % $_[0] while $h{$r};
    $h{$r} = 1;
    push @r, $r;
    }
    @r;
    }


    A debug-enabled version, that will pretty-print the mutation with colors when given the -d switch:



    #! /usr/bin/perl
    # usage mutagen [-d] number_of_replacements alignment_file [ sequence_file ..]
    use strict;

    my $debug = $ARGV[0] eq '-d' ? shift : 0;
    my $max = shift() - 1;
    my $algf = shift;
    open my $alg, $algf or die "open $algf: $!";
    my @alg = <$alg>;

    sub prand { map int(rand() * $_[0]), 0..$max }
    while(<>){
    my @ip = prand length() - 1;
    my @op = prand scalar @alg;

    if($debug){
    my $t = ' ' x (length() - 1);
    substr $t, $ip[$_], 1, $ip[$_] for 0..$max;
    warn "@ip | @opn $_ $tn";
    for my $i (0..$max){
    my $t = $alg[$op[$i]];
    $t =~ s/(.{$ip[$i]})(.)/$1e[1;31m$2e[m/;
    printf STDERR " %2d %s", $op[$i], $t;
    }
    }
    for my $i (0..$max){
    my $p = $ip[$i];
    substr $_, $p, 1, substr $alg[$op[$i]], $p, 1;
    }
    print;
    if($debug){
    my @t = split "", $_;
    for my $i (0..$max){
    $_ = "e[1;31m$_e[m" for $t[$ip[$i]];
    }
    warn " = ", @t, "n";
    }
    }





    share|improve this answer























    • Wow! I can't even begin to understand that, but it works perfectly. I bow down to you. Thanks so much
      – catchprj
      Oct 9 at 8:50










    • I notice the script breaks if you request more replacements than there are sequences in the alignment, which suggests that it will only access each alignment sequence once, right? Is it possible to chose a random alignment sequence each time, even if it has been chosen before? That way, in your usage example above, it would be possible to do >5 replacements (it's even possible, though unlikely, they all come from the same alignment sequence)
      – catchprj
      Oct 9 at 9:07










    • Yes that's very easy to do, it will even simplify the script a lot -- replace the sub prand {...} with sub sub prand { map int(rand() * $_[0]), 0..$max }.
      – qubert
      Oct 9 at 9:32












    • @catchprj I've updated the answer.
      – qubert
      Oct 9 at 9:54














    0












    0








    0






    Note:



    This is strictly for amusement purposes; an equivalent program in C would be much simpler and orders of magnitude faster; as to bash, let's not even talk about ;-)



    The following perl script will mutate a list of ~1M sequences, and ~10k alignments in about 10 seconds on my laptop.



    #! /usr/bin/perl
    # usage mutagen number_of_replacements alignment_file [ sequence_file ..]
    use strict;
    my $max = shift() - 1;
    my $algf = shift;
    open my $alg, $algf or die "open $algf: $!";
    my @alg = <$alg>;

    sub prand { map int(rand() * $_[0]), 0..$max }
    while(<>){
    my @ip = prand length() - 1;
    my @op = prand scalar @alg;
    for my $i (0..$max){
    my $p = $ip[$i];
    substr $_, $p, 1, substr $alg[$op[$i]], $p, 1;
    }
    print;
    }


    Usage example:



    $ cat seq
    1634870295
    5684937021
    2049163587
    6598471230
    $ cat alg
    DPMBHZJEIO
    INTMJZOYKQ
    KNTXGLCJSR
    GLJZRFVSEX
    SYJVHEPNAZ
    $ perl mutagen 3 alg seq
    1L3V8702I5
    5684HE7Y21
    2049JZC587
    6598H7C2E0


    If the generated n random numbers have to be different between them, then prand should be changed to:



    sub prand {
    my (@r, $m, %h);
    die "more replacements than positions/alignments" if $max >= $_[0];
    for(0..$max){
    my $r = int(rand() * $_[0]);
    $r = ($r + 1) % $_[0] while $h{$r};
    $h{$r} = 1;
    push @r, $r;
    }
    @r;
    }


    A debug-enabled version, that will pretty-print the mutation with colors when given the -d switch:



    #! /usr/bin/perl
    # usage mutagen [-d] number_of_replacements alignment_file [ sequence_file ..]
    use strict;

    my $debug = $ARGV[0] eq '-d' ? shift : 0;
    my $max = shift() - 1;
    my $algf = shift;
    open my $alg, $algf or die "open $algf: $!";
    my @alg = <$alg>;

    sub prand { map int(rand() * $_[0]), 0..$max }
    while(<>){
    my @ip = prand length() - 1;
    my @op = prand scalar @alg;

    if($debug){
    my $t = ' ' x (length() - 1);
    substr $t, $ip[$_], 1, $ip[$_] for 0..$max;
    warn "@ip | @opn $_ $tn";
    for my $i (0..$max){
    my $t = $alg[$op[$i]];
    $t =~ s/(.{$ip[$i]})(.)/$1e[1;31m$2e[m/;
    printf STDERR " %2d %s", $op[$i], $t;
    }
    }
    for my $i (0..$max){
    my $p = $ip[$i];
    substr $_, $p, 1, substr $alg[$op[$i]], $p, 1;
    }
    print;
    if($debug){
    my @t = split "", $_;
    for my $i (0..$max){
    $_ = "e[1;31m$_e[m" for $t[$ip[$i]];
    }
    warn " = ", @t, "n";
    }
    }





    share|improve this answer














    Note:



    This is strictly for amusement purposes; an equivalent program in C would be much simpler and orders of magnitude faster; as to bash, let's not even talk about ;-)



    The following perl script will mutate a list of ~1M sequences, and ~10k alignments in about 10 seconds on my laptop.



    #! /usr/bin/perl
    # usage mutagen number_of_replacements alignment_file [ sequence_file ..]
    use strict;
    my $max = shift() - 1;
    my $algf = shift;
    open my $alg, $algf or die "open $algf: $!";
    my @alg = <$alg>;

    sub prand { map int(rand() * $_[0]), 0..$max }
    while(<>){
    my @ip = prand length() - 1;
    my @op = prand scalar @alg;
    for my $i (0..$max){
    my $p = $ip[$i];
    substr $_, $p, 1, substr $alg[$op[$i]], $p, 1;
    }
    print;
    }


    Usage example:



    $ cat seq
    1634870295
    5684937021
    2049163587
    6598471230
    $ cat alg
    DPMBHZJEIO
    INTMJZOYKQ
    KNTXGLCJSR
    GLJZRFVSEX
    SYJVHEPNAZ
    $ perl mutagen 3 alg seq
    1L3V8702I5
    5684HE7Y21
    2049JZC587
    6598H7C2E0


    If the generated n random numbers have to be different between them, then prand should be changed to:



    sub prand {
    my (@r, $m, %h);
    die "more replacements than positions/alignments" if $max >= $_[0];
    for(0..$max){
    my $r = int(rand() * $_[0]);
    $r = ($r + 1) % $_[0] while $h{$r};
    $h{$r} = 1;
    push @r, $r;
    }
    @r;
    }


    A debug-enabled version, that will pretty-print the mutation with colors when given the -d switch:



    #! /usr/bin/perl
    # usage mutagen [-d] number_of_replacements alignment_file [ sequence_file ..]
    use strict;

    my $debug = $ARGV[0] eq '-d' ? shift : 0;
    my $max = shift() - 1;
    my $algf = shift;
    open my $alg, $algf or die "open $algf: $!";
    my @alg = <$alg>;

    sub prand { map int(rand() * $_[0]), 0..$max }
    while(<>){
    my @ip = prand length() - 1;
    my @op = prand scalar @alg;

    if($debug){
    my $t = ' ' x (length() - 1);
    substr $t, $ip[$_], 1, $ip[$_] for 0..$max;
    warn "@ip | @opn $_ $tn";
    for my $i (0..$max){
    my $t = $alg[$op[$i]];
    $t =~ s/(.{$ip[$i]})(.)/$1e[1;31m$2e[m/;
    printf STDERR " %2d %s", $op[$i], $t;
    }
    }
    for my $i (0..$max){
    my $p = $ip[$i];
    substr $_, $p, 1, substr $alg[$op[$i]], $p, 1;
    }
    print;
    if($debug){
    my @t = split "", $_;
    for my $i (0..$max){
    $_ = "e[1;31m$_e[m" for $t[$ip[$i]];
    }
    warn " = ", @t, "n";
    }
    }






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Oct 9 at 10:03

























    answered Oct 8 at 22:58









    qubert

    5566




    5566












    • Wow! I can't even begin to understand that, but it works perfectly. I bow down to you. Thanks so much
      – catchprj
      Oct 9 at 8:50










    • I notice the script breaks if you request more replacements than there are sequences in the alignment, which suggests that it will only access each alignment sequence once, right? Is it possible to chose a random alignment sequence each time, even if it has been chosen before? That way, in your usage example above, it would be possible to do >5 replacements (it's even possible, though unlikely, they all come from the same alignment sequence)
      – catchprj
      Oct 9 at 9:07










    • Yes that's very easy to do, it will even simplify the script a lot -- replace the sub prand {...} with sub sub prand { map int(rand() * $_[0]), 0..$max }.
      – qubert
      Oct 9 at 9:32












    • @catchprj I've updated the answer.
      – qubert
      Oct 9 at 9:54


















    • Wow! I can't even begin to understand that, but it works perfectly. I bow down to you. Thanks so much
      – catchprj
      Oct 9 at 8:50










    • I notice the script breaks if you request more replacements than there are sequences in the alignment, which suggests that it will only access each alignment sequence once, right? Is it possible to chose a random alignment sequence each time, even if it has been chosen before? That way, in your usage example above, it would be possible to do >5 replacements (it's even possible, though unlikely, they all come from the same alignment sequence)
      – catchprj
      Oct 9 at 9:07










    • Yes that's very easy to do, it will even simplify the script a lot -- replace the sub prand {...} with sub sub prand { map int(rand() * $_[0]), 0..$max }.
      – qubert
      Oct 9 at 9:32












    • @catchprj I've updated the answer.
      – qubert
      Oct 9 at 9:54
















    Wow! I can't even begin to understand that, but it works perfectly. I bow down to you. Thanks so much
    – catchprj
    Oct 9 at 8:50




    Wow! I can't even begin to understand that, but it works perfectly. I bow down to you. Thanks so much
    – catchprj
    Oct 9 at 8:50












    I notice the script breaks if you request more replacements than there are sequences in the alignment, which suggests that it will only access each alignment sequence once, right? Is it possible to chose a random alignment sequence each time, even if it has been chosen before? That way, in your usage example above, it would be possible to do >5 replacements (it's even possible, though unlikely, they all come from the same alignment sequence)
    – catchprj
    Oct 9 at 9:07




    I notice the script breaks if you request more replacements than there are sequences in the alignment, which suggests that it will only access each alignment sequence once, right? Is it possible to chose a random alignment sequence each time, even if it has been chosen before? That way, in your usage example above, it would be possible to do >5 replacements (it's even possible, though unlikely, they all come from the same alignment sequence)
    – catchprj
    Oct 9 at 9:07












    Yes that's very easy to do, it will even simplify the script a lot -- replace the sub prand {...} with sub sub prand { map int(rand() * $_[0]), 0..$max }.
    – qubert
    Oct 9 at 9:32






    Yes that's very easy to do, it will even simplify the script a lot -- replace the sub prand {...} with sub sub prand { map int(rand() * $_[0]), 0..$max }.
    – qubert
    Oct 9 at 9:32














    @catchprj I've updated the answer.
    – qubert
    Oct 9 at 9:54




    @catchprj I've updated the answer.
    – qubert
    Oct 9 at 9:54













    0














    This linear would generate an infinite number of random keys:



    cat /dev/urandom | tr -dc 'A-Z0-9' | fold -w 10 | head -n 1


    Sample output:



    MB0JZZ85VI
    2OKOY4JL61
    2YN7B71Z6K
    KH29TYCQ4K
    B4N1XOFY5O


    Explanation:



    /dev/random, /dev/urandom or even /dev/arandom are special files that serve as pseudorandom number generators in the system. They allow access to environmental noise collected from device drivers and other sources, more information can be reached here



    The fold command in UNIX is a command line utility for folding contents of specified files, or standard input. By default it wraps lines at a maximum width of 80 columns. It also supports specifying the column width and wrapping by numbers of bytes. The flag w in the command fold represent the columns width and it can help, indirectly, to adjust for how many bytes would be included in the randomly generated keys.



    The regex expression in the command tr controls for which characters would be included in the random keys.



    head -n would adjust for how many random keys would be generated. For example, replacing -n 1 by 10000 would generate 10.000 keys.






    share|improve this answer























    • This would give a sequence of 10 random uppercase characters and/or digits, but it would not mutate a given string the way that the question asked about. Note that the strings in the question (the original one and the set of other strings) are not random. These (I assume) represent, for example, DNA that is mutated in different ways.
      – Kusalananda
      Oct 8 at 19:42












    • @Kusalananda, I appreciate your comment! Honestly, I could not find simpler method to generate 10.000 replacements without looping, let's wait for the OP feedback and then we can go from there ;-)
      – user88036
      Oct 8 at 20:02


















    0














    This linear would generate an infinite number of random keys:



    cat /dev/urandom | tr -dc 'A-Z0-9' | fold -w 10 | head -n 1


    Sample output:



    MB0JZZ85VI
    2OKOY4JL61
    2YN7B71Z6K
    KH29TYCQ4K
    B4N1XOFY5O


    Explanation:



    /dev/random, /dev/urandom or even /dev/arandom are special files that serve as pseudorandom number generators in the system. They allow access to environmental noise collected from device drivers and other sources, more information can be reached here



    The fold command in UNIX is a command line utility for folding contents of specified files, or standard input. By default it wraps lines at a maximum width of 80 columns. It also supports specifying the column width and wrapping by numbers of bytes. The flag w in the command fold represent the columns width and it can help, indirectly, to adjust for how many bytes would be included in the randomly generated keys.



    The regex expression in the command tr controls for which characters would be included in the random keys.



    head -n would adjust for how many random keys would be generated. For example, replacing -n 1 by 10000 would generate 10.000 keys.






    share|improve this answer























    • This would give a sequence of 10 random uppercase characters and/or digits, but it would not mutate a given string the way that the question asked about. Note that the strings in the question (the original one and the set of other strings) are not random. These (I assume) represent, for example, DNA that is mutated in different ways.
      – Kusalananda
      Oct 8 at 19:42












    • @Kusalananda, I appreciate your comment! Honestly, I could not find simpler method to generate 10.000 replacements without looping, let's wait for the OP feedback and then we can go from there ;-)
      – user88036
      Oct 8 at 20:02
















    0












    0








    0






    This linear would generate an infinite number of random keys:



    cat /dev/urandom | tr -dc 'A-Z0-9' | fold -w 10 | head -n 1


    Sample output:



    MB0JZZ85VI
    2OKOY4JL61
    2YN7B71Z6K
    KH29TYCQ4K
    B4N1XOFY5O


    Explanation:



    /dev/random, /dev/urandom or even /dev/arandom are special files that serve as pseudorandom number generators in the system. They allow access to environmental noise collected from device drivers and other sources, more information can be reached here



    The fold command in UNIX is a command line utility for folding contents of specified files, or standard input. By default it wraps lines at a maximum width of 80 columns. It also supports specifying the column width and wrapping by numbers of bytes. The flag w in the command fold represent the columns width and it can help, indirectly, to adjust for how many bytes would be included in the randomly generated keys.



    The regex expression in the command tr controls for which characters would be included in the random keys.



    head -n would adjust for how many random keys would be generated. For example, replacing -n 1 by 10000 would generate 10.000 keys.






    share|improve this answer














    This linear would generate an infinite number of random keys:



    cat /dev/urandom | tr -dc 'A-Z0-9' | fold -w 10 | head -n 1


    Sample output:



    MB0JZZ85VI
    2OKOY4JL61
    2YN7B71Z6K
    KH29TYCQ4K
    B4N1XOFY5O


    Explanation:



    /dev/random, /dev/urandom or even /dev/arandom are special files that serve as pseudorandom number generators in the system. They allow access to environmental noise collected from device drivers and other sources, more information can be reached here



    The fold command in UNIX is a command line utility for folding contents of specified files, or standard input. By default it wraps lines at a maximum width of 80 columns. It also supports specifying the column width and wrapping by numbers of bytes. The flag w in the command fold represent the columns width and it can help, indirectly, to adjust for how many bytes would be included in the randomly generated keys.



    The regex expression in the command tr controls for which characters would be included in the random keys.



    head -n would adjust for how many random keys would be generated. For example, replacing -n 1 by 10000 would generate 10.000 keys.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Oct 9 at 10:08

























    answered Oct 8 at 19:02







    user88036



















    • This would give a sequence of 10 random uppercase characters and/or digits, but it would not mutate a given string the way that the question asked about. Note that the strings in the question (the original one and the set of other strings) are not random. These (I assume) represent, for example, DNA that is mutated in different ways.
      – Kusalananda
      Oct 8 at 19:42












    • @Kusalananda, I appreciate your comment! Honestly, I could not find simpler method to generate 10.000 replacements without looping, let's wait for the OP feedback and then we can go from there ;-)
      – user88036
      Oct 8 at 20:02




















    • This would give a sequence of 10 random uppercase characters and/or digits, but it would not mutate a given string the way that the question asked about. Note that the strings in the question (the original one and the set of other strings) are not random. These (I assume) represent, for example, DNA that is mutated in different ways.
      – Kusalananda
      Oct 8 at 19:42












    • @Kusalananda, I appreciate your comment! Honestly, I could not find simpler method to generate 10.000 replacements without looping, let's wait for the OP feedback and then we can go from there ;-)
      – user88036
      Oct 8 at 20:02


















    This would give a sequence of 10 random uppercase characters and/or digits, but it would not mutate a given string the way that the question asked about. Note that the strings in the question (the original one and the set of other strings) are not random. These (I assume) represent, for example, DNA that is mutated in different ways.
    – Kusalananda
    Oct 8 at 19:42






    This would give a sequence of 10 random uppercase characters and/or digits, but it would not mutate a given string the way that the question asked about. Note that the strings in the question (the original one and the set of other strings) are not random. These (I assume) represent, for example, DNA that is mutated in different ways.
    – Kusalananda
    Oct 8 at 19:42














    @Kusalananda, I appreciate your comment! Honestly, I could not find simpler method to generate 10.000 replacements without looping, let's wait for the OP feedback and then we can go from there ;-)
    – user88036
    Oct 8 at 20:02






    @Kusalananda, I appreciate your comment! Honestly, I could not find simpler method to generate 10.000 replacements without looping, let's wait for the OP feedback and then we can go from there ;-)
    – user88036
    Oct 8 at 20:02













    0














    Your original bash attempt was slow because of the number of external processes being started. Each random number called jot, and each string manipulation used two sed and a cut.



    As you're using bash, and not pure sh, you can benefit from the $RANDOM variable, Substring Expansion and Arrays. These make it possible to perform the replacements with no external commands -- not even any bash subshells.



    #!/bin/bash

    count=$1
    read sequence < $2
    IFS=$'n' read -d '' -a replacements < $3
    len=${#sequence}
    choices=${#replacements[*]}

    while ((count--)) ; do
    pos=$(($RANDOM % $len))
    choice=$(($RANDOM % $choices))
    replacement=${replacements[$choice]}
    sequence=${sequence:0:$pos}${replacement:$pos:1}${sequence:$((pos+1))}
    done

    echo "$sequence"


    Note that $RANDOM won't exceed 32767, so if your sequences are bigger than that (or even approaching that size), you will need something more complex than $RANDOM % maximum.



    This is still unlikely to beat a dedicated scripting language for speed, let alone a compiled language.






    share|improve this answer


























      0














      Your original bash attempt was slow because of the number of external processes being started. Each random number called jot, and each string manipulation used two sed and a cut.



      As you're using bash, and not pure sh, you can benefit from the $RANDOM variable, Substring Expansion and Arrays. These make it possible to perform the replacements with no external commands -- not even any bash subshells.



      #!/bin/bash

      count=$1
      read sequence < $2
      IFS=$'n' read -d '' -a replacements < $3
      len=${#sequence}
      choices=${#replacements[*]}

      while ((count--)) ; do
      pos=$(($RANDOM % $len))
      choice=$(($RANDOM % $choices))
      replacement=${replacements[$choice]}
      sequence=${sequence:0:$pos}${replacement:$pos:1}${sequence:$((pos+1))}
      done

      echo "$sequence"


      Note that $RANDOM won't exceed 32767, so if your sequences are bigger than that (or even approaching that size), you will need something more complex than $RANDOM % maximum.



      This is still unlikely to beat a dedicated scripting language for speed, let alone a compiled language.






      share|improve this answer
























        0












        0








        0






        Your original bash attempt was slow because of the number of external processes being started. Each random number called jot, and each string manipulation used two sed and a cut.



        As you're using bash, and not pure sh, you can benefit from the $RANDOM variable, Substring Expansion and Arrays. These make it possible to perform the replacements with no external commands -- not even any bash subshells.



        #!/bin/bash

        count=$1
        read sequence < $2
        IFS=$'n' read -d '' -a replacements < $3
        len=${#sequence}
        choices=${#replacements[*]}

        while ((count--)) ; do
        pos=$(($RANDOM % $len))
        choice=$(($RANDOM % $choices))
        replacement=${replacements[$choice]}
        sequence=${sequence:0:$pos}${replacement:$pos:1}${sequence:$((pos+1))}
        done

        echo "$sequence"


        Note that $RANDOM won't exceed 32767, so if your sequences are bigger than that (or even approaching that size), you will need something more complex than $RANDOM % maximum.



        This is still unlikely to beat a dedicated scripting language for speed, let alone a compiled language.






        share|improve this answer












        Your original bash attempt was slow because of the number of external processes being started. Each random number called jot, and each string manipulation used two sed and a cut.



        As you're using bash, and not pure sh, you can benefit from the $RANDOM variable, Substring Expansion and Arrays. These make it possible to perform the replacements with no external commands -- not even any bash subshells.



        #!/bin/bash

        count=$1
        read sequence < $2
        IFS=$'n' read -d '' -a replacements < $3
        len=${#sequence}
        choices=${#replacements[*]}

        while ((count--)) ; do
        pos=$(($RANDOM % $len))
        choice=$(($RANDOM % $choices))
        replacement=${replacements[$choice]}
        sequence=${sequence:0:$pos}${replacement:$pos:1}${sequence:$((pos+1))}
        done

        echo "$sequence"


        Note that $RANDOM won't exceed 32767, so if your sequences are bigger than that (or even approaching that size), you will need something more complex than $RANDOM % maximum.



        This is still unlikely to beat a dedicated scripting language for speed, let alone a compiled language.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Oct 9 at 13:32









        JigglyNaga

        3,708930




        3,708930






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Unix & Linux Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f474057%2frandom-mutagenesis-with-bash%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Morgemoulin

            Scott Moir

            Souastre