Random mutagenesis with bash
I have a string e.g.
1234567890
and I want to replace random positions of that string with corresponding position from a random sequence in another set of other strings e.g.
ABCDEFGHIJ
KLMNOPQRST
UVWXYZABCD
...
If I chose to make 3 replacements, the script should chose 3 random numbers e.g. 3,7,8; and 3 random sequences e.g. 1, 1, 3; make the replacements to generate the expected output:
12C456GB90
Is there a way to do this without significant looping? I wrote a simple bash script to generate a random position and a random sequence line then do 1 replacement, then repeat the process on the output, repeat, repeat. This works perfectly, however in my real-life files (much larger than the examples), I want to generate 10,000 or more replacements. Oh, and I will need to do this multiple times to generate multiple 'mutated' variant sequences.
EDIT: At the moment I am using something like this:
#chose random number between 1 and the number of characters in the string
randomposition=$(jot -r 1 1 $seqpositions)
#chose a random number between 1 and the number of lines in the set of potential replacement strings
randomline=$(jot -r 1 1 $alignlines)
#find the character at randomline:randomposition
newAA=$(sed -n "$randomline,$randomline p" $alignmentfile | cut -c$randomposition)
#replace the character at 'string:randomposition' with the character at 'randomline:randomposition'
sed "s/./$newAA/$randomposition" $sequencefile
(with some additional bits, obviously) and just looping through this thousands of times
bash scripting
|
show 6 more comments
I have a string e.g.
1234567890
and I want to replace random positions of that string with corresponding position from a random sequence in another set of other strings e.g.
ABCDEFGHIJ
KLMNOPQRST
UVWXYZABCD
...
If I chose to make 3 replacements, the script should chose 3 random numbers e.g. 3,7,8; and 3 random sequences e.g. 1, 1, 3; make the replacements to generate the expected output:
12C456GB90
Is there a way to do this without significant looping? I wrote a simple bash script to generate a random position and a random sequence line then do 1 replacement, then repeat the process on the output, repeat, repeat. This works perfectly, however in my real-life files (much larger than the examples), I want to generate 10,000 or more replacements. Oh, and I will need to do this multiple times to generate multiple 'mutated' variant sequences.
EDIT: At the moment I am using something like this:
#chose random number between 1 and the number of characters in the string
randomposition=$(jot -r 1 1 $seqpositions)
#chose a random number between 1 and the number of lines in the set of potential replacement strings
randomline=$(jot -r 1 1 $alignlines)
#find the character at randomline:randomposition
newAA=$(sed -n "$randomline,$randomline p" $alignmentfile | cut -c$randomposition)
#replace the character at 'string:randomposition' with the character at 'randomline:randomposition'
sed "s/./$newAA/$randomposition" $sequencefile
(with some additional bits, obviously) and just looping through this thousands of times
bash scripting
2
What is the application of this? Do you work in a field where this sort of thing may already be implemented by a field specific standard tool?
– Kusalananda
Oct 8 at 18:34
Looks like a computational biology question: I agree with @Kusalananda: give us a real-world example.
– Fabby
Oct 8 at 18:37
Yes, it's indeed biology based. The idea is that I have a protein sequence, and I want to mutate that sequence randomly-ish; that is, I want to mutate it randomly, but only allow specific characters (amino acid residues) that have been observed previously at that specific position i.e. by swapping with characters in my alignment file at the same position. By doing it this way, I keep the protein sequence resembling something true, while also maintaining some information about the frequency of characters at each position (mutation to a rare character only occurs rarely etc). Does that help?
– catchprj
Oct 8 at 20:34
I edited the original question to include the relevant parts of the script I am currently using. Perhaps someone can see how I could make it change more than one position at a time?
– catchprj
Oct 8 at 20:50
1
Could you please include the expected output so we can work on it. currently, not clear what is the fruit is ?!
– user90704
Oct 8 at 22:11
|
show 6 more comments
I have a string e.g.
1234567890
and I want to replace random positions of that string with corresponding position from a random sequence in another set of other strings e.g.
ABCDEFGHIJ
KLMNOPQRST
UVWXYZABCD
...
If I chose to make 3 replacements, the script should chose 3 random numbers e.g. 3,7,8; and 3 random sequences e.g. 1, 1, 3; make the replacements to generate the expected output:
12C456GB90
Is there a way to do this without significant looping? I wrote a simple bash script to generate a random position and a random sequence line then do 1 replacement, then repeat the process on the output, repeat, repeat. This works perfectly, however in my real-life files (much larger than the examples), I want to generate 10,000 or more replacements. Oh, and I will need to do this multiple times to generate multiple 'mutated' variant sequences.
EDIT: At the moment I am using something like this:
#chose random number between 1 and the number of characters in the string
randomposition=$(jot -r 1 1 $seqpositions)
#chose a random number between 1 and the number of lines in the set of potential replacement strings
randomline=$(jot -r 1 1 $alignlines)
#find the character at randomline:randomposition
newAA=$(sed -n "$randomline,$randomline p" $alignmentfile | cut -c$randomposition)
#replace the character at 'string:randomposition' with the character at 'randomline:randomposition'
sed "s/./$newAA/$randomposition" $sequencefile
(with some additional bits, obviously) and just looping through this thousands of times
bash scripting
I have a string e.g.
1234567890
and I want to replace random positions of that string with corresponding position from a random sequence in another set of other strings e.g.
ABCDEFGHIJ
KLMNOPQRST
UVWXYZABCD
...
If I chose to make 3 replacements, the script should chose 3 random numbers e.g. 3,7,8; and 3 random sequences e.g. 1, 1, 3; make the replacements to generate the expected output:
12C456GB90
Is there a way to do this without significant looping? I wrote a simple bash script to generate a random position and a random sequence line then do 1 replacement, then repeat the process on the output, repeat, repeat. This works perfectly, however in my real-life files (much larger than the examples), I want to generate 10,000 or more replacements. Oh, and I will need to do this multiple times to generate multiple 'mutated' variant sequences.
EDIT: At the moment I am using something like this:
#chose random number between 1 and the number of characters in the string
randomposition=$(jot -r 1 1 $seqpositions)
#chose a random number between 1 and the number of lines in the set of potential replacement strings
randomline=$(jot -r 1 1 $alignlines)
#find the character at randomline:randomposition
newAA=$(sed -n "$randomline,$randomline p" $alignmentfile | cut -c$randomposition)
#replace the character at 'string:randomposition' with the character at 'randomline:randomposition'
sed "s/./$newAA/$randomposition" $sequencefile
(with some additional bits, obviously) and just looping through this thousands of times
bash scripting
bash scripting
edited Oct 10 at 16:15
Rui F Ribeiro
39k1479129
39k1479129
asked Oct 8 at 18:20
catchprj
384
384
2
What is the application of this? Do you work in a field where this sort of thing may already be implemented by a field specific standard tool?
– Kusalananda
Oct 8 at 18:34
Looks like a computational biology question: I agree with @Kusalananda: give us a real-world example.
– Fabby
Oct 8 at 18:37
Yes, it's indeed biology based. The idea is that I have a protein sequence, and I want to mutate that sequence randomly-ish; that is, I want to mutate it randomly, but only allow specific characters (amino acid residues) that have been observed previously at that specific position i.e. by swapping with characters in my alignment file at the same position. By doing it this way, I keep the protein sequence resembling something true, while also maintaining some information about the frequency of characters at each position (mutation to a rare character only occurs rarely etc). Does that help?
– catchprj
Oct 8 at 20:34
I edited the original question to include the relevant parts of the script I am currently using. Perhaps someone can see how I could make it change more than one position at a time?
– catchprj
Oct 8 at 20:50
1
Could you please include the expected output so we can work on it. currently, not clear what is the fruit is ?!
– user90704
Oct 8 at 22:11
|
show 6 more comments
2
What is the application of this? Do you work in a field where this sort of thing may already be implemented by a field specific standard tool?
– Kusalananda
Oct 8 at 18:34
Looks like a computational biology question: I agree with @Kusalananda: give us a real-world example.
– Fabby
Oct 8 at 18:37
Yes, it's indeed biology based. The idea is that I have a protein sequence, and I want to mutate that sequence randomly-ish; that is, I want to mutate it randomly, but only allow specific characters (amino acid residues) that have been observed previously at that specific position i.e. by swapping with characters in my alignment file at the same position. By doing it this way, I keep the protein sequence resembling something true, while also maintaining some information about the frequency of characters at each position (mutation to a rare character only occurs rarely etc). Does that help?
– catchprj
Oct 8 at 20:34
I edited the original question to include the relevant parts of the script I am currently using. Perhaps someone can see how I could make it change more than one position at a time?
– catchprj
Oct 8 at 20:50
1
Could you please include the expected output so we can work on it. currently, not clear what is the fruit is ?!
– user90704
Oct 8 at 22:11
2
2
What is the application of this? Do you work in a field where this sort of thing may already be implemented by a field specific standard tool?
– Kusalananda
Oct 8 at 18:34
What is the application of this? Do you work in a field where this sort of thing may already be implemented by a field specific standard tool?
– Kusalananda
Oct 8 at 18:34
Looks like a computational biology question: I agree with @Kusalananda: give us a real-world example.
– Fabby
Oct 8 at 18:37
Looks like a computational biology question: I agree with @Kusalananda: give us a real-world example.
– Fabby
Oct 8 at 18:37
Yes, it's indeed biology based. The idea is that I have a protein sequence, and I want to mutate that sequence randomly-ish; that is, I want to mutate it randomly, but only allow specific characters (amino acid residues) that have been observed previously at that specific position i.e. by swapping with characters in my alignment file at the same position. By doing it this way, I keep the protein sequence resembling something true, while also maintaining some information about the frequency of characters at each position (mutation to a rare character only occurs rarely etc). Does that help?
– catchprj
Oct 8 at 20:34
Yes, it's indeed biology based. The idea is that I have a protein sequence, and I want to mutate that sequence randomly-ish; that is, I want to mutate it randomly, but only allow specific characters (amino acid residues) that have been observed previously at that specific position i.e. by swapping with characters in my alignment file at the same position. By doing it this way, I keep the protein sequence resembling something true, while also maintaining some information about the frequency of characters at each position (mutation to a rare character only occurs rarely etc). Does that help?
– catchprj
Oct 8 at 20:34
I edited the original question to include the relevant parts of the script I am currently using. Perhaps someone can see how I could make it change more than one position at a time?
– catchprj
Oct 8 at 20:50
I edited the original question to include the relevant parts of the script I am currently using. Perhaps someone can see how I could make it change more than one position at a time?
– catchprj
Oct 8 at 20:50
1
1
Could you please include the expected output so we can work on it. currently, not clear what is the fruit is ?!
– user90704
Oct 8 at 22:11
Could you please include the expected output so we can work on it. currently, not clear what is the fruit is ?!
– user90704
Oct 8 at 22:11
|
show 6 more comments
3 Answers
3
active
oldest
votes
Note:
This is strictly for amusement purposes; an equivalent program in C
would be much simpler and orders of magnitude faster; as to bash
, let's not even talk about ;-)
The following perl
script will mutate a list of ~1M sequences, and ~10k alignments in about 10 seconds on my laptop.
#! /usr/bin/perl
# usage mutagen number_of_replacements alignment_file [ sequence_file ..]
use strict;
my $max = shift() - 1;
my $algf = shift;
open my $alg, $algf or die "open $algf: $!";
my @alg = <$alg>;
sub prand { map int(rand() * $_[0]), 0..$max }
while(<>){
my @ip = prand length() - 1;
my @op = prand scalar @alg;
for my $i (0..$max){
my $p = $ip[$i];
substr $_, $p, 1, substr $alg[$op[$i]], $p, 1;
}
print;
}
Usage example:
$ cat seq
1634870295
5684937021
2049163587
6598471230
$ cat alg
DPMBHZJEIO
INTMJZOYKQ
KNTXGLCJSR
GLJZRFVSEX
SYJVHEPNAZ
$ perl mutagen 3 alg seq
1L3V8702I5
5684HE7Y21
2049JZC587
6598H7C2E0
If the generated n
random numbers have to be different between them, then prand
should be changed to:
sub prand {
my (@r, $m, %h);
die "more replacements than positions/alignments" if $max >= $_[0];
for(0..$max){
my $r = int(rand() * $_[0]);
$r = ($r + 1) % $_[0] while $h{$r};
$h{$r} = 1;
push @r, $r;
}
@r;
}
A debug-enabled version, that will pretty-print the mutation with colors when given the -d
switch:
#! /usr/bin/perl
# usage mutagen [-d] number_of_replacements alignment_file [ sequence_file ..]
use strict;
my $debug = $ARGV[0] eq '-d' ? shift : 0;
my $max = shift() - 1;
my $algf = shift;
open my $alg, $algf or die "open $algf: $!";
my @alg = <$alg>;
sub prand { map int(rand() * $_[0]), 0..$max }
while(<>){
my @ip = prand length() - 1;
my @op = prand scalar @alg;
if($debug){
my $t = ' ' x (length() - 1);
substr $t, $ip[$_], 1, $ip[$_] for 0..$max;
warn "@ip | @opn $_ $tn";
for my $i (0..$max){
my $t = $alg[$op[$i]];
$t =~ s/(.{$ip[$i]})(.)/$1e[1;31m$2e[m/;
printf STDERR " %2d %s", $op[$i], $t;
}
}
for my $i (0..$max){
my $p = $ip[$i];
substr $_, $p, 1, substr $alg[$op[$i]], $p, 1;
}
print;
if($debug){
my @t = split "", $_;
for my $i (0..$max){
$_ = "e[1;31m$_e[m" for $t[$ip[$i]];
}
warn " = ", @t, "n";
}
}
Wow! I can't even begin to understand that, but it works perfectly. I bow down to you. Thanks so much
– catchprj
Oct 9 at 8:50
I notice the script breaks if you request more replacements than there are sequences in the alignment, which suggests that it will only access each alignment sequence once, right? Is it possible to chose a random alignment sequence each time, even if it has been chosen before? That way, in your usage example above, it would be possible to do >5 replacements (it's even possible, though unlikely, they all come from the same alignment sequence)
– catchprj
Oct 9 at 9:07
Yes that's very easy to do, it will even simplify the script a lot -- replace thesub prand {...}
with subsub prand { map int(rand() * $_[0]), 0..$max }
.
– qubert
Oct 9 at 9:32
@catchprj I've updated the answer.
– qubert
Oct 9 at 9:54
add a comment |
This linear would generate an infinite number of random keys:
cat /dev/urandom | tr -dc 'A-Z0-9' | fold -w 10 | head -n 1
Sample output:
MB0JZZ85VI
2OKOY4JL61
2YN7B71Z6K
KH29TYCQ4K
B4N1XOFY5O
Explanation:
/dev/random
, /dev/urandom
or even /dev/arandom
are special files that serve as pseudorandom number generators in the system. They allow access to environmental noise collected from device drivers and other sources, more information can be reached here
The fold command in UNIX is a command line utility for folding contents of specified files, or standard input. By default it wraps lines at a maximum width of 80 columns. It also supports specifying the column width and wrapping by numbers of bytes. The flag w
in the command fold
represent the columns width and it can help, indirectly, to adjust for how many bytes would be included in the randomly generated keys.
The regex expression in the command tr
controls for which characters would be included in the random keys.
head -n
would adjust for how many random keys would be generated. For example, replacing -n 1
by 10000
would generate 10.000 keys.
This would give a sequence of 10 random uppercase characters and/or digits, but it would not mutate a given string the way that the question asked about. Note that the strings in the question (the original one and the set of other strings) are not random. These (I assume) represent, for example, DNA that is mutated in different ways.
– Kusalananda
Oct 8 at 19:42
@Kusalananda, I appreciate your comment! Honestly, I could not find simpler method to generate 10.000 replacements without looping, let's wait for the OP feedback and then we can go from there ;-)
– user88036
Oct 8 at 20:02
add a comment |
Your original bash
attempt was slow because of the number of external processes being started. Each random number called jot
, and each string manipulation used two sed
and a cut
.
As you're using bash
, and not pure sh
, you can benefit from the $RANDOM variable, Substring Expansion and Arrays. These make it possible to perform the replacements with no external commands -- not even any bash
subshells.
#!/bin/bash
count=$1
read sequence < $2
IFS=$'n' read -d '' -a replacements < $3
len=${#sequence}
choices=${#replacements[*]}
while ((count--)) ; do
pos=$(($RANDOM % $len))
choice=$(($RANDOM % $choices))
replacement=${replacements[$choice]}
sequence=${sequence:0:$pos}${replacement:$pos:1}${sequence:$((pos+1))}
done
echo "$sequence"
Note that $RANDOM won't exceed 32767, so if your sequences are bigger than that (or even approaching that size), you will need something more complex than $RANDOM % maximum
.
This is still unlikely to beat a dedicated scripting language for speed, let alone a compiled language.
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f474057%2frandom-mutagenesis-with-bash%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
Note:
This is strictly for amusement purposes; an equivalent program in C
would be much simpler and orders of magnitude faster; as to bash
, let's not even talk about ;-)
The following perl
script will mutate a list of ~1M sequences, and ~10k alignments in about 10 seconds on my laptop.
#! /usr/bin/perl
# usage mutagen number_of_replacements alignment_file [ sequence_file ..]
use strict;
my $max = shift() - 1;
my $algf = shift;
open my $alg, $algf or die "open $algf: $!";
my @alg = <$alg>;
sub prand { map int(rand() * $_[0]), 0..$max }
while(<>){
my @ip = prand length() - 1;
my @op = prand scalar @alg;
for my $i (0..$max){
my $p = $ip[$i];
substr $_, $p, 1, substr $alg[$op[$i]], $p, 1;
}
print;
}
Usage example:
$ cat seq
1634870295
5684937021
2049163587
6598471230
$ cat alg
DPMBHZJEIO
INTMJZOYKQ
KNTXGLCJSR
GLJZRFVSEX
SYJVHEPNAZ
$ perl mutagen 3 alg seq
1L3V8702I5
5684HE7Y21
2049JZC587
6598H7C2E0
If the generated n
random numbers have to be different between them, then prand
should be changed to:
sub prand {
my (@r, $m, %h);
die "more replacements than positions/alignments" if $max >= $_[0];
for(0..$max){
my $r = int(rand() * $_[0]);
$r = ($r + 1) % $_[0] while $h{$r};
$h{$r} = 1;
push @r, $r;
}
@r;
}
A debug-enabled version, that will pretty-print the mutation with colors when given the -d
switch:
#! /usr/bin/perl
# usage mutagen [-d] number_of_replacements alignment_file [ sequence_file ..]
use strict;
my $debug = $ARGV[0] eq '-d' ? shift : 0;
my $max = shift() - 1;
my $algf = shift;
open my $alg, $algf or die "open $algf: $!";
my @alg = <$alg>;
sub prand { map int(rand() * $_[0]), 0..$max }
while(<>){
my @ip = prand length() - 1;
my @op = prand scalar @alg;
if($debug){
my $t = ' ' x (length() - 1);
substr $t, $ip[$_], 1, $ip[$_] for 0..$max;
warn "@ip | @opn $_ $tn";
for my $i (0..$max){
my $t = $alg[$op[$i]];
$t =~ s/(.{$ip[$i]})(.)/$1e[1;31m$2e[m/;
printf STDERR " %2d %s", $op[$i], $t;
}
}
for my $i (0..$max){
my $p = $ip[$i];
substr $_, $p, 1, substr $alg[$op[$i]], $p, 1;
}
print;
if($debug){
my @t = split "", $_;
for my $i (0..$max){
$_ = "e[1;31m$_e[m" for $t[$ip[$i]];
}
warn " = ", @t, "n";
}
}
Wow! I can't even begin to understand that, but it works perfectly. I bow down to you. Thanks so much
– catchprj
Oct 9 at 8:50
I notice the script breaks if you request more replacements than there are sequences in the alignment, which suggests that it will only access each alignment sequence once, right? Is it possible to chose a random alignment sequence each time, even if it has been chosen before? That way, in your usage example above, it would be possible to do >5 replacements (it's even possible, though unlikely, they all come from the same alignment sequence)
– catchprj
Oct 9 at 9:07
Yes that's very easy to do, it will even simplify the script a lot -- replace thesub prand {...}
with subsub prand { map int(rand() * $_[0]), 0..$max }
.
– qubert
Oct 9 at 9:32
@catchprj I've updated the answer.
– qubert
Oct 9 at 9:54
add a comment |
Note:
This is strictly for amusement purposes; an equivalent program in C
would be much simpler and orders of magnitude faster; as to bash
, let's not even talk about ;-)
The following perl
script will mutate a list of ~1M sequences, and ~10k alignments in about 10 seconds on my laptop.
#! /usr/bin/perl
# usage mutagen number_of_replacements alignment_file [ sequence_file ..]
use strict;
my $max = shift() - 1;
my $algf = shift;
open my $alg, $algf or die "open $algf: $!";
my @alg = <$alg>;
sub prand { map int(rand() * $_[0]), 0..$max }
while(<>){
my @ip = prand length() - 1;
my @op = prand scalar @alg;
for my $i (0..$max){
my $p = $ip[$i];
substr $_, $p, 1, substr $alg[$op[$i]], $p, 1;
}
print;
}
Usage example:
$ cat seq
1634870295
5684937021
2049163587
6598471230
$ cat alg
DPMBHZJEIO
INTMJZOYKQ
KNTXGLCJSR
GLJZRFVSEX
SYJVHEPNAZ
$ perl mutagen 3 alg seq
1L3V8702I5
5684HE7Y21
2049JZC587
6598H7C2E0
If the generated n
random numbers have to be different between them, then prand
should be changed to:
sub prand {
my (@r, $m, %h);
die "more replacements than positions/alignments" if $max >= $_[0];
for(0..$max){
my $r = int(rand() * $_[0]);
$r = ($r + 1) % $_[0] while $h{$r};
$h{$r} = 1;
push @r, $r;
}
@r;
}
A debug-enabled version, that will pretty-print the mutation with colors when given the -d
switch:
#! /usr/bin/perl
# usage mutagen [-d] number_of_replacements alignment_file [ sequence_file ..]
use strict;
my $debug = $ARGV[0] eq '-d' ? shift : 0;
my $max = shift() - 1;
my $algf = shift;
open my $alg, $algf or die "open $algf: $!";
my @alg = <$alg>;
sub prand { map int(rand() * $_[0]), 0..$max }
while(<>){
my @ip = prand length() - 1;
my @op = prand scalar @alg;
if($debug){
my $t = ' ' x (length() - 1);
substr $t, $ip[$_], 1, $ip[$_] for 0..$max;
warn "@ip | @opn $_ $tn";
for my $i (0..$max){
my $t = $alg[$op[$i]];
$t =~ s/(.{$ip[$i]})(.)/$1e[1;31m$2e[m/;
printf STDERR " %2d %s", $op[$i], $t;
}
}
for my $i (0..$max){
my $p = $ip[$i];
substr $_, $p, 1, substr $alg[$op[$i]], $p, 1;
}
print;
if($debug){
my @t = split "", $_;
for my $i (0..$max){
$_ = "e[1;31m$_e[m" for $t[$ip[$i]];
}
warn " = ", @t, "n";
}
}
Wow! I can't even begin to understand that, but it works perfectly. I bow down to you. Thanks so much
– catchprj
Oct 9 at 8:50
I notice the script breaks if you request more replacements than there are sequences in the alignment, which suggests that it will only access each alignment sequence once, right? Is it possible to chose a random alignment sequence each time, even if it has been chosen before? That way, in your usage example above, it would be possible to do >5 replacements (it's even possible, though unlikely, they all come from the same alignment sequence)
– catchprj
Oct 9 at 9:07
Yes that's very easy to do, it will even simplify the script a lot -- replace thesub prand {...}
with subsub prand { map int(rand() * $_[0]), 0..$max }
.
– qubert
Oct 9 at 9:32
@catchprj I've updated the answer.
– qubert
Oct 9 at 9:54
add a comment |
Note:
This is strictly for amusement purposes; an equivalent program in C
would be much simpler and orders of magnitude faster; as to bash
, let's not even talk about ;-)
The following perl
script will mutate a list of ~1M sequences, and ~10k alignments in about 10 seconds on my laptop.
#! /usr/bin/perl
# usage mutagen number_of_replacements alignment_file [ sequence_file ..]
use strict;
my $max = shift() - 1;
my $algf = shift;
open my $alg, $algf or die "open $algf: $!";
my @alg = <$alg>;
sub prand { map int(rand() * $_[0]), 0..$max }
while(<>){
my @ip = prand length() - 1;
my @op = prand scalar @alg;
for my $i (0..$max){
my $p = $ip[$i];
substr $_, $p, 1, substr $alg[$op[$i]], $p, 1;
}
print;
}
Usage example:
$ cat seq
1634870295
5684937021
2049163587
6598471230
$ cat alg
DPMBHZJEIO
INTMJZOYKQ
KNTXGLCJSR
GLJZRFVSEX
SYJVHEPNAZ
$ perl mutagen 3 alg seq
1L3V8702I5
5684HE7Y21
2049JZC587
6598H7C2E0
If the generated n
random numbers have to be different between them, then prand
should be changed to:
sub prand {
my (@r, $m, %h);
die "more replacements than positions/alignments" if $max >= $_[0];
for(0..$max){
my $r = int(rand() * $_[0]);
$r = ($r + 1) % $_[0] while $h{$r};
$h{$r} = 1;
push @r, $r;
}
@r;
}
A debug-enabled version, that will pretty-print the mutation with colors when given the -d
switch:
#! /usr/bin/perl
# usage mutagen [-d] number_of_replacements alignment_file [ sequence_file ..]
use strict;
my $debug = $ARGV[0] eq '-d' ? shift : 0;
my $max = shift() - 1;
my $algf = shift;
open my $alg, $algf or die "open $algf: $!";
my @alg = <$alg>;
sub prand { map int(rand() * $_[0]), 0..$max }
while(<>){
my @ip = prand length() - 1;
my @op = prand scalar @alg;
if($debug){
my $t = ' ' x (length() - 1);
substr $t, $ip[$_], 1, $ip[$_] for 0..$max;
warn "@ip | @opn $_ $tn";
for my $i (0..$max){
my $t = $alg[$op[$i]];
$t =~ s/(.{$ip[$i]})(.)/$1e[1;31m$2e[m/;
printf STDERR " %2d %s", $op[$i], $t;
}
}
for my $i (0..$max){
my $p = $ip[$i];
substr $_, $p, 1, substr $alg[$op[$i]], $p, 1;
}
print;
if($debug){
my @t = split "", $_;
for my $i (0..$max){
$_ = "e[1;31m$_e[m" for $t[$ip[$i]];
}
warn " = ", @t, "n";
}
}
Note:
This is strictly for amusement purposes; an equivalent program in C
would be much simpler and orders of magnitude faster; as to bash
, let's not even talk about ;-)
The following perl
script will mutate a list of ~1M sequences, and ~10k alignments in about 10 seconds on my laptop.
#! /usr/bin/perl
# usage mutagen number_of_replacements alignment_file [ sequence_file ..]
use strict;
my $max = shift() - 1;
my $algf = shift;
open my $alg, $algf or die "open $algf: $!";
my @alg = <$alg>;
sub prand { map int(rand() * $_[0]), 0..$max }
while(<>){
my @ip = prand length() - 1;
my @op = prand scalar @alg;
for my $i (0..$max){
my $p = $ip[$i];
substr $_, $p, 1, substr $alg[$op[$i]], $p, 1;
}
print;
}
Usage example:
$ cat seq
1634870295
5684937021
2049163587
6598471230
$ cat alg
DPMBHZJEIO
INTMJZOYKQ
KNTXGLCJSR
GLJZRFVSEX
SYJVHEPNAZ
$ perl mutagen 3 alg seq
1L3V8702I5
5684HE7Y21
2049JZC587
6598H7C2E0
If the generated n
random numbers have to be different between them, then prand
should be changed to:
sub prand {
my (@r, $m, %h);
die "more replacements than positions/alignments" if $max >= $_[0];
for(0..$max){
my $r = int(rand() * $_[0]);
$r = ($r + 1) % $_[0] while $h{$r};
$h{$r} = 1;
push @r, $r;
}
@r;
}
A debug-enabled version, that will pretty-print the mutation with colors when given the -d
switch:
#! /usr/bin/perl
# usage mutagen [-d] number_of_replacements alignment_file [ sequence_file ..]
use strict;
my $debug = $ARGV[0] eq '-d' ? shift : 0;
my $max = shift() - 1;
my $algf = shift;
open my $alg, $algf or die "open $algf: $!";
my @alg = <$alg>;
sub prand { map int(rand() * $_[0]), 0..$max }
while(<>){
my @ip = prand length() - 1;
my @op = prand scalar @alg;
if($debug){
my $t = ' ' x (length() - 1);
substr $t, $ip[$_], 1, $ip[$_] for 0..$max;
warn "@ip | @opn $_ $tn";
for my $i (0..$max){
my $t = $alg[$op[$i]];
$t =~ s/(.{$ip[$i]})(.)/$1e[1;31m$2e[m/;
printf STDERR " %2d %s", $op[$i], $t;
}
}
for my $i (0..$max){
my $p = $ip[$i];
substr $_, $p, 1, substr $alg[$op[$i]], $p, 1;
}
print;
if($debug){
my @t = split "", $_;
for my $i (0..$max){
$_ = "e[1;31m$_e[m" for $t[$ip[$i]];
}
warn " = ", @t, "n";
}
}
edited Oct 9 at 10:03
answered Oct 8 at 22:58
qubert
5566
5566
Wow! I can't even begin to understand that, but it works perfectly. I bow down to you. Thanks so much
– catchprj
Oct 9 at 8:50
I notice the script breaks if you request more replacements than there are sequences in the alignment, which suggests that it will only access each alignment sequence once, right? Is it possible to chose a random alignment sequence each time, even if it has been chosen before? That way, in your usage example above, it would be possible to do >5 replacements (it's even possible, though unlikely, they all come from the same alignment sequence)
– catchprj
Oct 9 at 9:07
Yes that's very easy to do, it will even simplify the script a lot -- replace thesub prand {...}
with subsub prand { map int(rand() * $_[0]), 0..$max }
.
– qubert
Oct 9 at 9:32
@catchprj I've updated the answer.
– qubert
Oct 9 at 9:54
add a comment |
Wow! I can't even begin to understand that, but it works perfectly. I bow down to you. Thanks so much
– catchprj
Oct 9 at 8:50
I notice the script breaks if you request more replacements than there are sequences in the alignment, which suggests that it will only access each alignment sequence once, right? Is it possible to chose a random alignment sequence each time, even if it has been chosen before? That way, in your usage example above, it would be possible to do >5 replacements (it's even possible, though unlikely, they all come from the same alignment sequence)
– catchprj
Oct 9 at 9:07
Yes that's very easy to do, it will even simplify the script a lot -- replace thesub prand {...}
with subsub prand { map int(rand() * $_[0]), 0..$max }
.
– qubert
Oct 9 at 9:32
@catchprj I've updated the answer.
– qubert
Oct 9 at 9:54
Wow! I can't even begin to understand that, but it works perfectly. I bow down to you. Thanks so much
– catchprj
Oct 9 at 8:50
Wow! I can't even begin to understand that, but it works perfectly. I bow down to you. Thanks so much
– catchprj
Oct 9 at 8:50
I notice the script breaks if you request more replacements than there are sequences in the alignment, which suggests that it will only access each alignment sequence once, right? Is it possible to chose a random alignment sequence each time, even if it has been chosen before? That way, in your usage example above, it would be possible to do >5 replacements (it's even possible, though unlikely, they all come from the same alignment sequence)
– catchprj
Oct 9 at 9:07
I notice the script breaks if you request more replacements than there are sequences in the alignment, which suggests that it will only access each alignment sequence once, right? Is it possible to chose a random alignment sequence each time, even if it has been chosen before? That way, in your usage example above, it would be possible to do >5 replacements (it's even possible, though unlikely, they all come from the same alignment sequence)
– catchprj
Oct 9 at 9:07
Yes that's very easy to do, it will even simplify the script a lot -- replace the
sub prand {...}
with sub sub prand { map int(rand() * $_[0]), 0..$max }
.– qubert
Oct 9 at 9:32
Yes that's very easy to do, it will even simplify the script a lot -- replace the
sub prand {...}
with sub sub prand { map int(rand() * $_[0]), 0..$max }
.– qubert
Oct 9 at 9:32
@catchprj I've updated the answer.
– qubert
Oct 9 at 9:54
@catchprj I've updated the answer.
– qubert
Oct 9 at 9:54
add a comment |
This linear would generate an infinite number of random keys:
cat /dev/urandom | tr -dc 'A-Z0-9' | fold -w 10 | head -n 1
Sample output:
MB0JZZ85VI
2OKOY4JL61
2YN7B71Z6K
KH29TYCQ4K
B4N1XOFY5O
Explanation:
/dev/random
, /dev/urandom
or even /dev/arandom
are special files that serve as pseudorandom number generators in the system. They allow access to environmental noise collected from device drivers and other sources, more information can be reached here
The fold command in UNIX is a command line utility for folding contents of specified files, or standard input. By default it wraps lines at a maximum width of 80 columns. It also supports specifying the column width and wrapping by numbers of bytes. The flag w
in the command fold
represent the columns width and it can help, indirectly, to adjust for how many bytes would be included in the randomly generated keys.
The regex expression in the command tr
controls for which characters would be included in the random keys.
head -n
would adjust for how many random keys would be generated. For example, replacing -n 1
by 10000
would generate 10.000 keys.
This would give a sequence of 10 random uppercase characters and/or digits, but it would not mutate a given string the way that the question asked about. Note that the strings in the question (the original one and the set of other strings) are not random. These (I assume) represent, for example, DNA that is mutated in different ways.
– Kusalananda
Oct 8 at 19:42
@Kusalananda, I appreciate your comment! Honestly, I could not find simpler method to generate 10.000 replacements without looping, let's wait for the OP feedback and then we can go from there ;-)
– user88036
Oct 8 at 20:02
add a comment |
This linear would generate an infinite number of random keys:
cat /dev/urandom | tr -dc 'A-Z0-9' | fold -w 10 | head -n 1
Sample output:
MB0JZZ85VI
2OKOY4JL61
2YN7B71Z6K
KH29TYCQ4K
B4N1XOFY5O
Explanation:
/dev/random
, /dev/urandom
or even /dev/arandom
are special files that serve as pseudorandom number generators in the system. They allow access to environmental noise collected from device drivers and other sources, more information can be reached here
The fold command in UNIX is a command line utility for folding contents of specified files, or standard input. By default it wraps lines at a maximum width of 80 columns. It also supports specifying the column width and wrapping by numbers of bytes. The flag w
in the command fold
represent the columns width and it can help, indirectly, to adjust for how many bytes would be included in the randomly generated keys.
The regex expression in the command tr
controls for which characters would be included in the random keys.
head -n
would adjust for how many random keys would be generated. For example, replacing -n 1
by 10000
would generate 10.000 keys.
This would give a sequence of 10 random uppercase characters and/or digits, but it would not mutate a given string the way that the question asked about. Note that the strings in the question (the original one and the set of other strings) are not random. These (I assume) represent, for example, DNA that is mutated in different ways.
– Kusalananda
Oct 8 at 19:42
@Kusalananda, I appreciate your comment! Honestly, I could not find simpler method to generate 10.000 replacements without looping, let's wait for the OP feedback and then we can go from there ;-)
– user88036
Oct 8 at 20:02
add a comment |
This linear would generate an infinite number of random keys:
cat /dev/urandom | tr -dc 'A-Z0-9' | fold -w 10 | head -n 1
Sample output:
MB0JZZ85VI
2OKOY4JL61
2YN7B71Z6K
KH29TYCQ4K
B4N1XOFY5O
Explanation:
/dev/random
, /dev/urandom
or even /dev/arandom
are special files that serve as pseudorandom number generators in the system. They allow access to environmental noise collected from device drivers and other sources, more information can be reached here
The fold command in UNIX is a command line utility for folding contents of specified files, or standard input. By default it wraps lines at a maximum width of 80 columns. It also supports specifying the column width and wrapping by numbers of bytes. The flag w
in the command fold
represent the columns width and it can help, indirectly, to adjust for how many bytes would be included in the randomly generated keys.
The regex expression in the command tr
controls for which characters would be included in the random keys.
head -n
would adjust for how many random keys would be generated. For example, replacing -n 1
by 10000
would generate 10.000 keys.
This linear would generate an infinite number of random keys:
cat /dev/urandom | tr -dc 'A-Z0-9' | fold -w 10 | head -n 1
Sample output:
MB0JZZ85VI
2OKOY4JL61
2YN7B71Z6K
KH29TYCQ4K
B4N1XOFY5O
Explanation:
/dev/random
, /dev/urandom
or even /dev/arandom
are special files that serve as pseudorandom number generators in the system. They allow access to environmental noise collected from device drivers and other sources, more information can be reached here
The fold command in UNIX is a command line utility for folding contents of specified files, or standard input. By default it wraps lines at a maximum width of 80 columns. It also supports specifying the column width and wrapping by numbers of bytes. The flag w
in the command fold
represent the columns width and it can help, indirectly, to adjust for how many bytes would be included in the randomly generated keys.
The regex expression in the command tr
controls for which characters would be included in the random keys.
head -n
would adjust for how many random keys would be generated. For example, replacing -n 1
by 10000
would generate 10.000 keys.
edited Oct 9 at 10:08
answered Oct 8 at 19:02
user88036
This would give a sequence of 10 random uppercase characters and/or digits, but it would not mutate a given string the way that the question asked about. Note that the strings in the question (the original one and the set of other strings) are not random. These (I assume) represent, for example, DNA that is mutated in different ways.
– Kusalananda
Oct 8 at 19:42
@Kusalananda, I appreciate your comment! Honestly, I could not find simpler method to generate 10.000 replacements without looping, let's wait for the OP feedback and then we can go from there ;-)
– user88036
Oct 8 at 20:02
add a comment |
This would give a sequence of 10 random uppercase characters and/or digits, but it would not mutate a given string the way that the question asked about. Note that the strings in the question (the original one and the set of other strings) are not random. These (I assume) represent, for example, DNA that is mutated in different ways.
– Kusalananda
Oct 8 at 19:42
@Kusalananda, I appreciate your comment! Honestly, I could not find simpler method to generate 10.000 replacements without looping, let's wait for the OP feedback and then we can go from there ;-)
– user88036
Oct 8 at 20:02
This would give a sequence of 10 random uppercase characters and/or digits, but it would not mutate a given string the way that the question asked about. Note that the strings in the question (the original one and the set of other strings) are not random. These (I assume) represent, for example, DNA that is mutated in different ways.
– Kusalananda
Oct 8 at 19:42
This would give a sequence of 10 random uppercase characters and/or digits, but it would not mutate a given string the way that the question asked about. Note that the strings in the question (the original one and the set of other strings) are not random. These (I assume) represent, for example, DNA that is mutated in different ways.
– Kusalananda
Oct 8 at 19:42
@Kusalananda, I appreciate your comment! Honestly, I could not find simpler method to generate 10.000 replacements without looping, let's wait for the OP feedback and then we can go from there ;-)
– user88036
Oct 8 at 20:02
@Kusalananda, I appreciate your comment! Honestly, I could not find simpler method to generate 10.000 replacements without looping, let's wait for the OP feedback and then we can go from there ;-)
– user88036
Oct 8 at 20:02
add a comment |
Your original bash
attempt was slow because of the number of external processes being started. Each random number called jot
, and each string manipulation used two sed
and a cut
.
As you're using bash
, and not pure sh
, you can benefit from the $RANDOM variable, Substring Expansion and Arrays. These make it possible to perform the replacements with no external commands -- not even any bash
subshells.
#!/bin/bash
count=$1
read sequence < $2
IFS=$'n' read -d '' -a replacements < $3
len=${#sequence}
choices=${#replacements[*]}
while ((count--)) ; do
pos=$(($RANDOM % $len))
choice=$(($RANDOM % $choices))
replacement=${replacements[$choice]}
sequence=${sequence:0:$pos}${replacement:$pos:1}${sequence:$((pos+1))}
done
echo "$sequence"
Note that $RANDOM won't exceed 32767, so if your sequences are bigger than that (or even approaching that size), you will need something more complex than $RANDOM % maximum
.
This is still unlikely to beat a dedicated scripting language for speed, let alone a compiled language.
add a comment |
Your original bash
attempt was slow because of the number of external processes being started. Each random number called jot
, and each string manipulation used two sed
and a cut
.
As you're using bash
, and not pure sh
, you can benefit from the $RANDOM variable, Substring Expansion and Arrays. These make it possible to perform the replacements with no external commands -- not even any bash
subshells.
#!/bin/bash
count=$1
read sequence < $2
IFS=$'n' read -d '' -a replacements < $3
len=${#sequence}
choices=${#replacements[*]}
while ((count--)) ; do
pos=$(($RANDOM % $len))
choice=$(($RANDOM % $choices))
replacement=${replacements[$choice]}
sequence=${sequence:0:$pos}${replacement:$pos:1}${sequence:$((pos+1))}
done
echo "$sequence"
Note that $RANDOM won't exceed 32767, so if your sequences are bigger than that (or even approaching that size), you will need something more complex than $RANDOM % maximum
.
This is still unlikely to beat a dedicated scripting language for speed, let alone a compiled language.
add a comment |
Your original bash
attempt was slow because of the number of external processes being started. Each random number called jot
, and each string manipulation used two sed
and a cut
.
As you're using bash
, and not pure sh
, you can benefit from the $RANDOM variable, Substring Expansion and Arrays. These make it possible to perform the replacements with no external commands -- not even any bash
subshells.
#!/bin/bash
count=$1
read sequence < $2
IFS=$'n' read -d '' -a replacements < $3
len=${#sequence}
choices=${#replacements[*]}
while ((count--)) ; do
pos=$(($RANDOM % $len))
choice=$(($RANDOM % $choices))
replacement=${replacements[$choice]}
sequence=${sequence:0:$pos}${replacement:$pos:1}${sequence:$((pos+1))}
done
echo "$sequence"
Note that $RANDOM won't exceed 32767, so if your sequences are bigger than that (or even approaching that size), you will need something more complex than $RANDOM % maximum
.
This is still unlikely to beat a dedicated scripting language for speed, let alone a compiled language.
Your original bash
attempt was slow because of the number of external processes being started. Each random number called jot
, and each string manipulation used two sed
and a cut
.
As you're using bash
, and not pure sh
, you can benefit from the $RANDOM variable, Substring Expansion and Arrays. These make it possible to perform the replacements with no external commands -- not even any bash
subshells.
#!/bin/bash
count=$1
read sequence < $2
IFS=$'n' read -d '' -a replacements < $3
len=${#sequence}
choices=${#replacements[*]}
while ((count--)) ; do
pos=$(($RANDOM % $len))
choice=$(($RANDOM % $choices))
replacement=${replacements[$choice]}
sequence=${sequence:0:$pos}${replacement:$pos:1}${sequence:$((pos+1))}
done
echo "$sequence"
Note that $RANDOM won't exceed 32767, so if your sequences are bigger than that (or even approaching that size), you will need something more complex than $RANDOM % maximum
.
This is still unlikely to beat a dedicated scripting language for speed, let alone a compiled language.
answered Oct 9 at 13:32
JigglyNaga
3,708930
3,708930
add a comment |
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f474057%2frandom-mutagenesis-with-bash%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
What is the application of this? Do you work in a field where this sort of thing may already be implemented by a field specific standard tool?
– Kusalananda
Oct 8 at 18:34
Looks like a computational biology question: I agree with @Kusalananda: give us a real-world example.
– Fabby
Oct 8 at 18:37
Yes, it's indeed biology based. The idea is that I have a protein sequence, and I want to mutate that sequence randomly-ish; that is, I want to mutate it randomly, but only allow specific characters (amino acid residues) that have been observed previously at that specific position i.e. by swapping with characters in my alignment file at the same position. By doing it this way, I keep the protein sequence resembling something true, while also maintaining some information about the frequency of characters at each position (mutation to a rare character only occurs rarely etc). Does that help?
– catchprj
Oct 8 at 20:34
I edited the original question to include the relevant parts of the script I am currently using. Perhaps someone can see how I could make it change more than one position at a time?
– catchprj
Oct 8 at 20:50
1
Could you please include the expected output so we can work on it. currently, not clear what is the fruit is ?!
– user90704
Oct 8 at 22:11