Extract text from brackets
up vote
1
down vote
favorite
I have a file like this:
input file:
Evigen1000005_c0_g1_i1 0.240 1.212 1.408 3.784 2.029 0.963 -1.22409810298695 1 NA NA NA NA PF04597.13;Ribophorin_I;4.6e-148 NA 1;21;0.875 len=569;ExpAA=32.49;First60=12.82;PredHel=1;Topology=o433-450i Q9SFX3 OST1A_ARATH reviewed Dolichyl-diphosphooligosaccharide--protein_glycosyltransferase_subunit_1A_(EC_2.4.99.18)_(Ribophorin_IA)_(RPN-IA)_(Ribophorin-1A) OST1A_RPN1A_At1g76400_F15M4.10 Arabidopsis_thaliana_(Mouse-ear_cress) 614 protein_N-linked_glycosylation_via_asparagine_[GO:0018279] endoplasmic_reticulum_[GO:0005783];_integral_component_of_membrane_[GO:0016021];_membrane_[GO:0016020];_oligosaccharyltransferase_complex_[GO:0008250] dolichyl-diphosphooligosaccharide-protein_glycotransferase_activity_[GO:0004579] 3702.AT1G76400.1; PF04597; IPR007676; 3702 ath:AT1G76400; F15M4.10 2.4.99.18 SUBCELLULAR_LOCATION:_Endoplasmic_reticulum_membrane_{ECO:0000250};_Single-pass_type_I_membrane_protein_{ECO:0000250}. SIGNAL_1_25_{ECO:0000255}. AT1G76400.1; NA NA NA NA NA NA NA NA NA NA NA
Evigen1000006_c0_g1_i1 0.358 0.179 0.000 0.424 0.139 0.183 NA NA NA NA NA NA PF07767.10;Nop53_(60S_ribosomal_biogenesis);5.2e-21 NA 1;31;0.588 len=170;ExpAA=14.33;First60=14.27;PredHel=0;Topology=o O22892 NOP53_ARATH reviewed Ribosome_biogenesis_protein_NOP53 At2g40430_T2P4.22 Arabidopsis_thaliana_(Mouse-ear_cress) 442 ribosomal_large_subunit_assembly_[GO:0000027];_ribosomal_large_subunit_export_from_nucleus_[GO:0000055] nucleolus_[GO:0005730];_nucleoplasm_[GO:0005654] rRNA_binding_[GO:0019843] 3702.AT2G40430.2; PF07767; IPR011687; 3702 ath:AT2G40430; T2P4.22 SUBCELLULAR_LOCATION:_Nucleus,_nucleolus_{ECO:0000250|UniProtKB:Q9NZM5}._Nucleus,_nucleoplasm_{ECO:0000250|UniProtKB:Q9NZM5}. AT2G40430.1_[O22892-1]; NA NA NA NA NA NA NA NA NA NA NA NA NA
I want to take out the text form the brackets which only starts with "GO:". After each GO: there are 7 digits. e.g. "GO:0018279". They are GO terms. The number of GO terms in each row are not equal. The output must be a file which the first column includes the Untranscript ids (e.g. Evigen1000005_c0_g1_i1) and the rest GO terms. I want an output file like this:
output file:
Evigen1000005_c0_g1_i1 GO:0018279 GO:0005783 GO:0016021 GO:0016020 GO:0008250 GO:0004579
Evigen1000006_c0_g1_i1 GO:0000027 GO:0000055 GO:0005730 GO:0005654 GO:0019843
shell-script awk sed grep
|
show 2 more comments
up vote
1
down vote
favorite
I have a file like this:
input file:
Evigen1000005_c0_g1_i1 0.240 1.212 1.408 3.784 2.029 0.963 -1.22409810298695 1 NA NA NA NA PF04597.13;Ribophorin_I;4.6e-148 NA 1;21;0.875 len=569;ExpAA=32.49;First60=12.82;PredHel=1;Topology=o433-450i Q9SFX3 OST1A_ARATH reviewed Dolichyl-diphosphooligosaccharide--protein_glycosyltransferase_subunit_1A_(EC_2.4.99.18)_(Ribophorin_IA)_(RPN-IA)_(Ribophorin-1A) OST1A_RPN1A_At1g76400_F15M4.10 Arabidopsis_thaliana_(Mouse-ear_cress) 614 protein_N-linked_glycosylation_via_asparagine_[GO:0018279] endoplasmic_reticulum_[GO:0005783];_integral_component_of_membrane_[GO:0016021];_membrane_[GO:0016020];_oligosaccharyltransferase_complex_[GO:0008250] dolichyl-diphosphooligosaccharide-protein_glycotransferase_activity_[GO:0004579] 3702.AT1G76400.1; PF04597; IPR007676; 3702 ath:AT1G76400; F15M4.10 2.4.99.18 SUBCELLULAR_LOCATION:_Endoplasmic_reticulum_membrane_{ECO:0000250};_Single-pass_type_I_membrane_protein_{ECO:0000250}. SIGNAL_1_25_{ECO:0000255}. AT1G76400.1; NA NA NA NA NA NA NA NA NA NA NA
Evigen1000006_c0_g1_i1 0.358 0.179 0.000 0.424 0.139 0.183 NA NA NA NA NA NA PF07767.10;Nop53_(60S_ribosomal_biogenesis);5.2e-21 NA 1;31;0.588 len=170;ExpAA=14.33;First60=14.27;PredHel=0;Topology=o O22892 NOP53_ARATH reviewed Ribosome_biogenesis_protein_NOP53 At2g40430_T2P4.22 Arabidopsis_thaliana_(Mouse-ear_cress) 442 ribosomal_large_subunit_assembly_[GO:0000027];_ribosomal_large_subunit_export_from_nucleus_[GO:0000055] nucleolus_[GO:0005730];_nucleoplasm_[GO:0005654] rRNA_binding_[GO:0019843] 3702.AT2G40430.2; PF07767; IPR011687; 3702 ath:AT2G40430; T2P4.22 SUBCELLULAR_LOCATION:_Nucleus,_nucleolus_{ECO:0000250|UniProtKB:Q9NZM5}._Nucleus,_nucleoplasm_{ECO:0000250|UniProtKB:Q9NZM5}. AT2G40430.1_[O22892-1]; NA NA NA NA NA NA NA NA NA NA NA NA NA
I want to take out the text form the brackets which only starts with "GO:". After each GO: there are 7 digits. e.g. "GO:0018279". They are GO terms. The number of GO terms in each row are not equal. The output must be a file which the first column includes the Untranscript ids (e.g. Evigen1000005_c0_g1_i1) and the rest GO terms. I want an output file like this:
output file:
Evigen1000005_c0_g1_i1 GO:0018279 GO:0005783 GO:0016021 GO:0016020 GO:0008250 GO:0004579
Evigen1000006_c0_g1_i1 GO:0000027 GO:0000055 GO:0005730 GO:0005654 GO:0019843
shell-script awk sed grep
Should [GO:0016021] and [O22892-1] be excluded from the output? In that case why? Please specify in detail how to select the output!
– sudodus
Nov 24 at 0:22
yes, it should excluded
– Mehdi
Nov 24 at 7:06
If you explain why to exclude these data (what makes them 'wrong'), it should be possible to design a method to exclude them automatically. Otherwise we can only guess or leave the exclusion to manual methods.
– sudodus
Nov 24 at 10:54
1
the text presented in brackets started with GO: were Gene Ontology for each Unitranscrit (e.g. Evigen1000005_c0_g1_i1), to do GO category with WEGO tool we need a file which the first column is Untranscript ids and the rest GO terms.
– Mehdi
Nov 24 at 11:34
1
no, it should not be extended. there a mistake in the output file. I edit the output file.
– Mehdi
Nov 24 at 17:30
|
show 2 more comments
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I have a file like this:
input file:
Evigen1000005_c0_g1_i1 0.240 1.212 1.408 3.784 2.029 0.963 -1.22409810298695 1 NA NA NA NA PF04597.13;Ribophorin_I;4.6e-148 NA 1;21;0.875 len=569;ExpAA=32.49;First60=12.82;PredHel=1;Topology=o433-450i Q9SFX3 OST1A_ARATH reviewed Dolichyl-diphosphooligosaccharide--protein_glycosyltransferase_subunit_1A_(EC_2.4.99.18)_(Ribophorin_IA)_(RPN-IA)_(Ribophorin-1A) OST1A_RPN1A_At1g76400_F15M4.10 Arabidopsis_thaliana_(Mouse-ear_cress) 614 protein_N-linked_glycosylation_via_asparagine_[GO:0018279] endoplasmic_reticulum_[GO:0005783];_integral_component_of_membrane_[GO:0016021];_membrane_[GO:0016020];_oligosaccharyltransferase_complex_[GO:0008250] dolichyl-diphosphooligosaccharide-protein_glycotransferase_activity_[GO:0004579] 3702.AT1G76400.1; PF04597; IPR007676; 3702 ath:AT1G76400; F15M4.10 2.4.99.18 SUBCELLULAR_LOCATION:_Endoplasmic_reticulum_membrane_{ECO:0000250};_Single-pass_type_I_membrane_protein_{ECO:0000250}. SIGNAL_1_25_{ECO:0000255}. AT1G76400.1; NA NA NA NA NA NA NA NA NA NA NA
Evigen1000006_c0_g1_i1 0.358 0.179 0.000 0.424 0.139 0.183 NA NA NA NA NA NA PF07767.10;Nop53_(60S_ribosomal_biogenesis);5.2e-21 NA 1;31;0.588 len=170;ExpAA=14.33;First60=14.27;PredHel=0;Topology=o O22892 NOP53_ARATH reviewed Ribosome_biogenesis_protein_NOP53 At2g40430_T2P4.22 Arabidopsis_thaliana_(Mouse-ear_cress) 442 ribosomal_large_subunit_assembly_[GO:0000027];_ribosomal_large_subunit_export_from_nucleus_[GO:0000055] nucleolus_[GO:0005730];_nucleoplasm_[GO:0005654] rRNA_binding_[GO:0019843] 3702.AT2G40430.2; PF07767; IPR011687; 3702 ath:AT2G40430; T2P4.22 SUBCELLULAR_LOCATION:_Nucleus,_nucleolus_{ECO:0000250|UniProtKB:Q9NZM5}._Nucleus,_nucleoplasm_{ECO:0000250|UniProtKB:Q9NZM5}. AT2G40430.1_[O22892-1]; NA NA NA NA NA NA NA NA NA NA NA NA NA
I want to take out the text form the brackets which only starts with "GO:". After each GO: there are 7 digits. e.g. "GO:0018279". They are GO terms. The number of GO terms in each row are not equal. The output must be a file which the first column includes the Untranscript ids (e.g. Evigen1000005_c0_g1_i1) and the rest GO terms. I want an output file like this:
output file:
Evigen1000005_c0_g1_i1 GO:0018279 GO:0005783 GO:0016021 GO:0016020 GO:0008250 GO:0004579
Evigen1000006_c0_g1_i1 GO:0000027 GO:0000055 GO:0005730 GO:0005654 GO:0019843
shell-script awk sed grep
I have a file like this:
input file:
Evigen1000005_c0_g1_i1 0.240 1.212 1.408 3.784 2.029 0.963 -1.22409810298695 1 NA NA NA NA PF04597.13;Ribophorin_I;4.6e-148 NA 1;21;0.875 len=569;ExpAA=32.49;First60=12.82;PredHel=1;Topology=o433-450i Q9SFX3 OST1A_ARATH reviewed Dolichyl-diphosphooligosaccharide--protein_glycosyltransferase_subunit_1A_(EC_2.4.99.18)_(Ribophorin_IA)_(RPN-IA)_(Ribophorin-1A) OST1A_RPN1A_At1g76400_F15M4.10 Arabidopsis_thaliana_(Mouse-ear_cress) 614 protein_N-linked_glycosylation_via_asparagine_[GO:0018279] endoplasmic_reticulum_[GO:0005783];_integral_component_of_membrane_[GO:0016021];_membrane_[GO:0016020];_oligosaccharyltransferase_complex_[GO:0008250] dolichyl-diphosphooligosaccharide-protein_glycotransferase_activity_[GO:0004579] 3702.AT1G76400.1; PF04597; IPR007676; 3702 ath:AT1G76400; F15M4.10 2.4.99.18 SUBCELLULAR_LOCATION:_Endoplasmic_reticulum_membrane_{ECO:0000250};_Single-pass_type_I_membrane_protein_{ECO:0000250}. SIGNAL_1_25_{ECO:0000255}. AT1G76400.1; NA NA NA NA NA NA NA NA NA NA NA
Evigen1000006_c0_g1_i1 0.358 0.179 0.000 0.424 0.139 0.183 NA NA NA NA NA NA PF07767.10;Nop53_(60S_ribosomal_biogenesis);5.2e-21 NA 1;31;0.588 len=170;ExpAA=14.33;First60=14.27;PredHel=0;Topology=o O22892 NOP53_ARATH reviewed Ribosome_biogenesis_protein_NOP53 At2g40430_T2P4.22 Arabidopsis_thaliana_(Mouse-ear_cress) 442 ribosomal_large_subunit_assembly_[GO:0000027];_ribosomal_large_subunit_export_from_nucleus_[GO:0000055] nucleolus_[GO:0005730];_nucleoplasm_[GO:0005654] rRNA_binding_[GO:0019843] 3702.AT2G40430.2; PF07767; IPR011687; 3702 ath:AT2G40430; T2P4.22 SUBCELLULAR_LOCATION:_Nucleus,_nucleolus_{ECO:0000250|UniProtKB:Q9NZM5}._Nucleus,_nucleoplasm_{ECO:0000250|UniProtKB:Q9NZM5}. AT2G40430.1_[O22892-1]; NA NA NA NA NA NA NA NA NA NA NA NA NA
I want to take out the text form the brackets which only starts with "GO:". After each GO: there are 7 digits. e.g. "GO:0018279". They are GO terms. The number of GO terms in each row are not equal. The output must be a file which the first column includes the Untranscript ids (e.g. Evigen1000005_c0_g1_i1) and the rest GO terms. I want an output file like this:
output file:
Evigen1000005_c0_g1_i1 GO:0018279 GO:0005783 GO:0016021 GO:0016020 GO:0008250 GO:0004579
Evigen1000006_c0_g1_i1 GO:0000027 GO:0000055 GO:0005730 GO:0005654 GO:0019843
shell-script awk sed grep
shell-script awk sed grep
edited Nov 25 at 9:49
asked Nov 23 at 22:09
Mehdi
196
196
Should [GO:0016021] and [O22892-1] be excluded from the output? In that case why? Please specify in detail how to select the output!
– sudodus
Nov 24 at 0:22
yes, it should excluded
– Mehdi
Nov 24 at 7:06
If you explain why to exclude these data (what makes them 'wrong'), it should be possible to design a method to exclude them automatically. Otherwise we can only guess or leave the exclusion to manual methods.
– sudodus
Nov 24 at 10:54
1
the text presented in brackets started with GO: were Gene Ontology for each Unitranscrit (e.g. Evigen1000005_c0_g1_i1), to do GO category with WEGO tool we need a file which the first column is Untranscript ids and the rest GO terms.
– Mehdi
Nov 24 at 11:34
1
no, it should not be extended. there a mistake in the output file. I edit the output file.
– Mehdi
Nov 24 at 17:30
|
show 2 more comments
Should [GO:0016021] and [O22892-1] be excluded from the output? In that case why? Please specify in detail how to select the output!
– sudodus
Nov 24 at 0:22
yes, it should excluded
– Mehdi
Nov 24 at 7:06
If you explain why to exclude these data (what makes them 'wrong'), it should be possible to design a method to exclude them automatically. Otherwise we can only guess or leave the exclusion to manual methods.
– sudodus
Nov 24 at 10:54
1
the text presented in brackets started with GO: were Gene Ontology for each Unitranscrit (e.g. Evigen1000005_c0_g1_i1), to do GO category with WEGO tool we need a file which the first column is Untranscript ids and the rest GO terms.
– Mehdi
Nov 24 at 11:34
1
no, it should not be extended. there a mistake in the output file. I edit the output file.
– Mehdi
Nov 24 at 17:30
Should [GO:0016021] and [O22892-1] be excluded from the output? In that case why? Please specify in detail how to select the output!
– sudodus
Nov 24 at 0:22
Should [GO:0016021] and [O22892-1] be excluded from the output? In that case why? Please specify in detail how to select the output!
– sudodus
Nov 24 at 0:22
yes, it should excluded
– Mehdi
Nov 24 at 7:06
yes, it should excluded
– Mehdi
Nov 24 at 7:06
If you explain why to exclude these data (what makes them 'wrong'), it should be possible to design a method to exclude them automatically. Otherwise we can only guess or leave the exclusion to manual methods.
– sudodus
Nov 24 at 10:54
If you explain why to exclude these data (what makes them 'wrong'), it should be possible to design a method to exclude them automatically. Otherwise we can only guess or leave the exclusion to manual methods.
– sudodus
Nov 24 at 10:54
1
1
the text presented in brackets started with GO: were Gene Ontology for each Unitranscrit (e.g. Evigen1000005_c0_g1_i1), to do GO category with WEGO tool we need a file which the first column is Untranscript ids and the rest GO terms.
– Mehdi
Nov 24 at 11:34
the text presented in brackets started with GO: were Gene Ontology for each Unitranscrit (e.g. Evigen1000005_c0_g1_i1), to do GO category with WEGO tool we need a file which the first column is Untranscript ids and the rest GO terms.
– Mehdi
Nov 24 at 11:34
1
1
no, it should not be extended. there a mistake in the output file. I edit the output file.
– Mehdi
Nov 24 at 17:30
no, it should not be extended. there a mistake in the output file. I edit the output file.
– Mehdi
Nov 24 at 17:30
|
show 2 more comments
3 Answers
3
active
oldest
votes
up vote
0
down vote
accepted
Suggested script
, that matches your final specification.
#!/bin/bash
while read line
do
# echo "$line"
name=${line%% *}
echo -n "$name "
data=$(<<< "$line" grep -o '[GO:.{7}]' | tr 'n' ' ' | sed -e 's/[//g' -e 's/]//g')
echo "$data"
done < "$1"
Testing:
$ ./script input
Evigen1000005_c0_g1_i1 GO:0018279 GO:0005783 GO:0016021 GO:0016020 GO:0008250 GO:0004579
Evigen1000006_c0_g1_i1 GO:0000027 GO:0000055 GO:0005730 GO:0005654 GO:0019843
That is great, it works. Thanks
– Mehdi
Nov 24 at 9:50
@Mehdi, You are welcome :-)
– sudodus
Nov 24 at 11:14
add a comment |
up vote
1
down vote
How about
sed -r 's/(^[^[:space:]]* )[^*[/1/; s/][^*([|$)/ /g' file
Evigen1000005_c0_g1_i1 GO:0018279 GO:0005783 GO:0016021 GO:0016020 GO:0008250 GO:0004579
Evigen1000006_c0_g1_i1 GO:0000027 GO:0000055 GO:0005730 GO:0005654 GO:0019843 O22892-1
Your desired output does NOT reflect the processed input sample.
EDIT: or even
sed -r 's/((^[^ ]* )|])[^*([|$)/2 /g' file
EDIT: with your question and desired output three times revised, try
sed -r 's/((^[^ ]* )|])[^*([GO)/2 GO/g; s/].*$//' file
add a comment |
up vote
0
down vote
Faster in sed:
start='[GO:' end=']'
sed -e 's,'"$start"$',1,g' -e 's,'"$end"$',2,g'
-e $'s, [^1]*, ,' -e $'s,1\([^2]*\)2[^1]*,GO:\1 ,g'
infile
or awk:
awk -vone=$'1' -vtwo=$'3' -vstart='[GO:' -v end=']' '
{
printf("%s ",$1);
gsub(start,one);
gsub(end,two);
sub("^[^"one"]*"one,"GO:")
gsub(two"[^"one"]*"one," GO:")
sub(two".*$" ,"")
}
1' infile
add a comment |
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
accepted
Suggested script
, that matches your final specification.
#!/bin/bash
while read line
do
# echo "$line"
name=${line%% *}
echo -n "$name "
data=$(<<< "$line" grep -o '[GO:.{7}]' | tr 'n' ' ' | sed -e 's/[//g' -e 's/]//g')
echo "$data"
done < "$1"
Testing:
$ ./script input
Evigen1000005_c0_g1_i1 GO:0018279 GO:0005783 GO:0016021 GO:0016020 GO:0008250 GO:0004579
Evigen1000006_c0_g1_i1 GO:0000027 GO:0000055 GO:0005730 GO:0005654 GO:0019843
That is great, it works. Thanks
– Mehdi
Nov 24 at 9:50
@Mehdi, You are welcome :-)
– sudodus
Nov 24 at 11:14
add a comment |
up vote
0
down vote
accepted
Suggested script
, that matches your final specification.
#!/bin/bash
while read line
do
# echo "$line"
name=${line%% *}
echo -n "$name "
data=$(<<< "$line" grep -o '[GO:.{7}]' | tr 'n' ' ' | sed -e 's/[//g' -e 's/]//g')
echo "$data"
done < "$1"
Testing:
$ ./script input
Evigen1000005_c0_g1_i1 GO:0018279 GO:0005783 GO:0016021 GO:0016020 GO:0008250 GO:0004579
Evigen1000006_c0_g1_i1 GO:0000027 GO:0000055 GO:0005730 GO:0005654 GO:0019843
That is great, it works. Thanks
– Mehdi
Nov 24 at 9:50
@Mehdi, You are welcome :-)
– sudodus
Nov 24 at 11:14
add a comment |
up vote
0
down vote
accepted
up vote
0
down vote
accepted
Suggested script
, that matches your final specification.
#!/bin/bash
while read line
do
# echo "$line"
name=${line%% *}
echo -n "$name "
data=$(<<< "$line" grep -o '[GO:.{7}]' | tr 'n' ' ' | sed -e 's/[//g' -e 's/]//g')
echo "$data"
done < "$1"
Testing:
$ ./script input
Evigen1000005_c0_g1_i1 GO:0018279 GO:0005783 GO:0016021 GO:0016020 GO:0008250 GO:0004579
Evigen1000006_c0_g1_i1 GO:0000027 GO:0000055 GO:0005730 GO:0005654 GO:0019843
Suggested script
, that matches your final specification.
#!/bin/bash
while read line
do
# echo "$line"
name=${line%% *}
echo -n "$name "
data=$(<<< "$line" grep -o '[GO:.{7}]' | tr 'n' ' ' | sed -e 's/[//g' -e 's/]//g')
echo "$data"
done < "$1"
Testing:
$ ./script input
Evigen1000005_c0_g1_i1 GO:0018279 GO:0005783 GO:0016021 GO:0016020 GO:0008250 GO:0004579
Evigen1000006_c0_g1_i1 GO:0000027 GO:0000055 GO:0005730 GO:0005654 GO:0019843
edited Nov 24 at 18:50
answered Nov 23 at 23:55
sudodus
58116
58116
That is great, it works. Thanks
– Mehdi
Nov 24 at 9:50
@Mehdi, You are welcome :-)
– sudodus
Nov 24 at 11:14
add a comment |
That is great, it works. Thanks
– Mehdi
Nov 24 at 9:50
@Mehdi, You are welcome :-)
– sudodus
Nov 24 at 11:14
That is great, it works. Thanks
– Mehdi
Nov 24 at 9:50
That is great, it works. Thanks
– Mehdi
Nov 24 at 9:50
@Mehdi, You are welcome :-)
– sudodus
Nov 24 at 11:14
@Mehdi, You are welcome :-)
– sudodus
Nov 24 at 11:14
add a comment |
up vote
1
down vote
How about
sed -r 's/(^[^[:space:]]* )[^*[/1/; s/][^*([|$)/ /g' file
Evigen1000005_c0_g1_i1 GO:0018279 GO:0005783 GO:0016021 GO:0016020 GO:0008250 GO:0004579
Evigen1000006_c0_g1_i1 GO:0000027 GO:0000055 GO:0005730 GO:0005654 GO:0019843 O22892-1
Your desired output does NOT reflect the processed input sample.
EDIT: or even
sed -r 's/((^[^ ]* )|])[^*([|$)/2 /g' file
EDIT: with your question and desired output three times revised, try
sed -r 's/((^[^ ]* )|])[^*([GO)/2 GO/g; s/].*$//' file
add a comment |
up vote
1
down vote
How about
sed -r 's/(^[^[:space:]]* )[^*[/1/; s/][^*([|$)/ /g' file
Evigen1000005_c0_g1_i1 GO:0018279 GO:0005783 GO:0016021 GO:0016020 GO:0008250 GO:0004579
Evigen1000006_c0_g1_i1 GO:0000027 GO:0000055 GO:0005730 GO:0005654 GO:0019843 O22892-1
Your desired output does NOT reflect the processed input sample.
EDIT: or even
sed -r 's/((^[^ ]* )|])[^*([|$)/2 /g' file
EDIT: with your question and desired output three times revised, try
sed -r 's/((^[^ ]* )|])[^*([GO)/2 GO/g; s/].*$//' file
add a comment |
up vote
1
down vote
up vote
1
down vote
How about
sed -r 's/(^[^[:space:]]* )[^*[/1/; s/][^*([|$)/ /g' file
Evigen1000005_c0_g1_i1 GO:0018279 GO:0005783 GO:0016021 GO:0016020 GO:0008250 GO:0004579
Evigen1000006_c0_g1_i1 GO:0000027 GO:0000055 GO:0005730 GO:0005654 GO:0019843 O22892-1
Your desired output does NOT reflect the processed input sample.
EDIT: or even
sed -r 's/((^[^ ]* )|])[^*([|$)/2 /g' file
EDIT: with your question and desired output three times revised, try
sed -r 's/((^[^ ]* )|])[^*([GO)/2 GO/g; s/].*$//' file
How about
sed -r 's/(^[^[:space:]]* )[^*[/1/; s/][^*([|$)/ /g' file
Evigen1000005_c0_g1_i1 GO:0018279 GO:0005783 GO:0016021 GO:0016020 GO:0008250 GO:0004579
Evigen1000006_c0_g1_i1 GO:0000027 GO:0000055 GO:0005730 GO:0005654 GO:0019843 O22892-1
Your desired output does NOT reflect the processed input sample.
EDIT: or even
sed -r 's/((^[^ ]* )|])[^*([|$)/2 /g' file
EDIT: with your question and desired output three times revised, try
sed -r 's/((^[^ ]* )|])[^*([GO)/2 GO/g; s/].*$//' file
edited Nov 25 at 22:16
answered Nov 23 at 22:51
RudiC
3,4221312
3,4221312
add a comment |
add a comment |
up vote
0
down vote
Faster in sed:
start='[GO:' end=']'
sed -e 's,'"$start"$',1,g' -e 's,'"$end"$',2,g'
-e $'s, [^1]*, ,' -e $'s,1\([^2]*\)2[^1]*,GO:\1 ,g'
infile
or awk:
awk -vone=$'1' -vtwo=$'3' -vstart='[GO:' -v end=']' '
{
printf("%s ",$1);
gsub(start,one);
gsub(end,two);
sub("^[^"one"]*"one,"GO:")
gsub(two"[^"one"]*"one," GO:")
sub(two".*$" ,"")
}
1' infile
add a comment |
up vote
0
down vote
Faster in sed:
start='[GO:' end=']'
sed -e 's,'"$start"$',1,g' -e 's,'"$end"$',2,g'
-e $'s, [^1]*, ,' -e $'s,1\([^2]*\)2[^1]*,GO:\1 ,g'
infile
or awk:
awk -vone=$'1' -vtwo=$'3' -vstart='[GO:' -v end=']' '
{
printf("%s ",$1);
gsub(start,one);
gsub(end,two);
sub("^[^"one"]*"one,"GO:")
gsub(two"[^"one"]*"one," GO:")
sub(two".*$" ,"")
}
1' infile
add a comment |
up vote
0
down vote
up vote
0
down vote
Faster in sed:
start='[GO:' end=']'
sed -e 's,'"$start"$',1,g' -e 's,'"$end"$',2,g'
-e $'s, [^1]*, ,' -e $'s,1\([^2]*\)2[^1]*,GO:\1 ,g'
infile
or awk:
awk -vone=$'1' -vtwo=$'3' -vstart='[GO:' -v end=']' '
{
printf("%s ",$1);
gsub(start,one);
gsub(end,two);
sub("^[^"one"]*"one,"GO:")
gsub(two"[^"one"]*"one," GO:")
sub(two".*$" ,"")
}
1' infile
Faster in sed:
start='[GO:' end=']'
sed -e 's,'"$start"$',1,g' -e 's,'"$end"$',2,g'
-e $'s, [^1]*, ,' -e $'s,1\([^2]*\)2[^1]*,GO:\1 ,g'
infile
or awk:
awk -vone=$'1' -vtwo=$'3' -vstart='[GO:' -v end=']' '
{
printf("%s ",$1);
gsub(start,one);
gsub(end,two);
sub("^[^"one"]*"one,"GO:")
gsub(two"[^"one"]*"one," GO:")
sub(two".*$" ,"")
}
1' infile
answered Nov 26 at 0:48
Isaac
9,91111445
9,91111445
add a comment |
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f483786%2fextract-text-from-brackets%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Should [GO:0016021] and [O22892-1] be excluded from the output? In that case why? Please specify in detail how to select the output!
– sudodus
Nov 24 at 0:22
yes, it should excluded
– Mehdi
Nov 24 at 7:06
If you explain why to exclude these data (what makes them 'wrong'), it should be possible to design a method to exclude them automatically. Otherwise we can only guess or leave the exclusion to manual methods.
– sudodus
Nov 24 at 10:54
1
the text presented in brackets started with GO: were Gene Ontology for each Unitranscrit (e.g. Evigen1000005_c0_g1_i1), to do GO category with WEGO tool we need a file which the first column is Untranscript ids and the rest GO terms.
– Mehdi
Nov 24 at 11:34
1
no, it should not be extended. there a mistake in the output file. I edit the output file.
– Mehdi
Nov 24 at 17:30