Extract text from brackets











up vote
1
down vote

favorite












I have a file like this:



input file:



Evigen1000005_c0_g1_i1  0.240   1.212   1.408   3.784   2.029   0.963   -1.22409810298695       1       NA      NA      NA      NA      PF04597.13;Ribophorin_I;4.6e-148        NA      1;21;0.875      len=569;ExpAA=32.49;First60=12.82;PredHel=1;Topology=o433-450i     Q9SFX3  OST1A_ARATH     reviewed        Dolichyl-diphosphooligosaccharide--protein_glycosyltransferase_subunit_1A_(EC_2.4.99.18)_(Ribophorin_IA)_(RPN-IA)_(Ribophorin-1A)  OST1A_RPN1A_At1g76400_F15M4.10  Arabidopsis_thaliana_(Mouse-ear_cress)  614     protein_N-linked_glycosylation_via_asparagine_[GO:0018279]      endoplasmic_reticulum_[GO:0005783];_integral_component_of_membrane_[GO:0016021];_membrane_[GO:0016020];_oligosaccharyltransferase_complex_[GO:0008250]     dolichyl-diphosphooligosaccharide-protein_glycotransferase_activity_[GO:0004579]        3702.AT1G76400.1;  PF04597;        IPR007676;      3702    ath:AT1G76400;  F15M4.10        2.4.99.18       SUBCELLULAR_LOCATION:_Endoplasmic_reticulum_membrane_{ECO:0000250};_Single-pass_type_I_membrane_protein_{ECO:0000250}.  SIGNAL_1_25_{ECO:0000255}. AT1G76400.1;    NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA 
Evigen1000006_c0_g1_i1 0.358 0.179 0.000 0.424 0.139 0.183 NA NA NA NA NA NA PF07767.10;Nop53_(60S_ribosomal_biogenesis);5.2e-21 NA 1;31;0.588 len=170;ExpAA=14.33;First60=14.27;PredHel=0;Topology=o O22892 NOP53_ARATH reviewed Ribosome_biogenesis_protein_NOP53 At2g40430_T2P4.22 Arabidopsis_thaliana_(Mouse-ear_cress) 442 ribosomal_large_subunit_assembly_[GO:0000027];_ribosomal_large_subunit_export_from_nucleus_[GO:0000055] nucleolus_[GO:0005730];_nucleoplasm_[GO:0005654] rRNA_binding_[GO:0019843] 3702.AT2G40430.2; PF07767; IPR011687; 3702 ath:AT2G40430; T2P4.22 SUBCELLULAR_LOCATION:_Nucleus,_nucleolus_{ECO:0000250|UniProtKB:Q9NZM5}._Nucleus,_nucleoplasm_{ECO:0000250|UniProtKB:Q9NZM5}. AT2G40430.1_[O22892-1]; NA NA NA NA NA NA NA NA NA NA NA NA NA


I want to take out the text form the brackets which only starts with "GO:". After each GO: there are 7 digits. e.g. "GO:0018279". They are GO terms. The number of GO terms in each row are not equal. The output must be a file which the first column includes the Untranscript ids (e.g. Evigen1000005_c0_g1_i1) and the rest GO terms. I want an output file like this:



output file:



Evigen1000005_c0_g1_i1 GO:0018279 GO:0005783 GO:0016021 GO:0016020 GO:0008250 GO:0004579
Evigen1000006_c0_g1_i1 GO:0000027 GO:0000055 GO:0005730 GO:0005654 GO:0019843









share|improve this question
























  • Should [GO:0016021] and [O22892-1] be excluded from the output? In that case why? Please specify in detail how to select the output!
    – sudodus
    Nov 24 at 0:22










  • yes, it should excluded
    – Mehdi
    Nov 24 at 7:06










  • If you explain why to exclude these data (what makes them 'wrong'), it should be possible to design a method to exclude them automatically. Otherwise we can only guess or leave the exclusion to manual methods.
    – sudodus
    Nov 24 at 10:54






  • 1




    the text presented in brackets started with GO: were Gene Ontology for each Unitranscrit (e.g. Evigen1000005_c0_g1_i1), to do GO category with WEGO tool we need a file which the first column is Untranscript ids and the rest GO terms.
    – Mehdi
    Nov 24 at 11:34






  • 1




    no, it should not be extended. there a mistake in the output file. I edit the output file.
    – Mehdi
    Nov 24 at 17:30















up vote
1
down vote

favorite












I have a file like this:



input file:



Evigen1000005_c0_g1_i1  0.240   1.212   1.408   3.784   2.029   0.963   -1.22409810298695       1       NA      NA      NA      NA      PF04597.13;Ribophorin_I;4.6e-148        NA      1;21;0.875      len=569;ExpAA=32.49;First60=12.82;PredHel=1;Topology=o433-450i     Q9SFX3  OST1A_ARATH     reviewed        Dolichyl-diphosphooligosaccharide--protein_glycosyltransferase_subunit_1A_(EC_2.4.99.18)_(Ribophorin_IA)_(RPN-IA)_(Ribophorin-1A)  OST1A_RPN1A_At1g76400_F15M4.10  Arabidopsis_thaliana_(Mouse-ear_cress)  614     protein_N-linked_glycosylation_via_asparagine_[GO:0018279]      endoplasmic_reticulum_[GO:0005783];_integral_component_of_membrane_[GO:0016021];_membrane_[GO:0016020];_oligosaccharyltransferase_complex_[GO:0008250]     dolichyl-diphosphooligosaccharide-protein_glycotransferase_activity_[GO:0004579]        3702.AT1G76400.1;  PF04597;        IPR007676;      3702    ath:AT1G76400;  F15M4.10        2.4.99.18       SUBCELLULAR_LOCATION:_Endoplasmic_reticulum_membrane_{ECO:0000250};_Single-pass_type_I_membrane_protein_{ECO:0000250}.  SIGNAL_1_25_{ECO:0000255}. AT1G76400.1;    NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA 
Evigen1000006_c0_g1_i1 0.358 0.179 0.000 0.424 0.139 0.183 NA NA NA NA NA NA PF07767.10;Nop53_(60S_ribosomal_biogenesis);5.2e-21 NA 1;31;0.588 len=170;ExpAA=14.33;First60=14.27;PredHel=0;Topology=o O22892 NOP53_ARATH reviewed Ribosome_biogenesis_protein_NOP53 At2g40430_T2P4.22 Arabidopsis_thaliana_(Mouse-ear_cress) 442 ribosomal_large_subunit_assembly_[GO:0000027];_ribosomal_large_subunit_export_from_nucleus_[GO:0000055] nucleolus_[GO:0005730];_nucleoplasm_[GO:0005654] rRNA_binding_[GO:0019843] 3702.AT2G40430.2; PF07767; IPR011687; 3702 ath:AT2G40430; T2P4.22 SUBCELLULAR_LOCATION:_Nucleus,_nucleolus_{ECO:0000250|UniProtKB:Q9NZM5}._Nucleus,_nucleoplasm_{ECO:0000250|UniProtKB:Q9NZM5}. AT2G40430.1_[O22892-1]; NA NA NA NA NA NA NA NA NA NA NA NA NA


I want to take out the text form the brackets which only starts with "GO:". After each GO: there are 7 digits. e.g. "GO:0018279". They are GO terms. The number of GO terms in each row are not equal. The output must be a file which the first column includes the Untranscript ids (e.g. Evigen1000005_c0_g1_i1) and the rest GO terms. I want an output file like this:



output file:



Evigen1000005_c0_g1_i1 GO:0018279 GO:0005783 GO:0016021 GO:0016020 GO:0008250 GO:0004579
Evigen1000006_c0_g1_i1 GO:0000027 GO:0000055 GO:0005730 GO:0005654 GO:0019843









share|improve this question
























  • Should [GO:0016021] and [O22892-1] be excluded from the output? In that case why? Please specify in detail how to select the output!
    – sudodus
    Nov 24 at 0:22










  • yes, it should excluded
    – Mehdi
    Nov 24 at 7:06










  • If you explain why to exclude these data (what makes them 'wrong'), it should be possible to design a method to exclude them automatically. Otherwise we can only guess or leave the exclusion to manual methods.
    – sudodus
    Nov 24 at 10:54






  • 1




    the text presented in brackets started with GO: were Gene Ontology for each Unitranscrit (e.g. Evigen1000005_c0_g1_i1), to do GO category with WEGO tool we need a file which the first column is Untranscript ids and the rest GO terms.
    – Mehdi
    Nov 24 at 11:34






  • 1




    no, it should not be extended. there a mistake in the output file. I edit the output file.
    – Mehdi
    Nov 24 at 17:30













up vote
1
down vote

favorite









up vote
1
down vote

favorite











I have a file like this:



input file:



Evigen1000005_c0_g1_i1  0.240   1.212   1.408   3.784   2.029   0.963   -1.22409810298695       1       NA      NA      NA      NA      PF04597.13;Ribophorin_I;4.6e-148        NA      1;21;0.875      len=569;ExpAA=32.49;First60=12.82;PredHel=1;Topology=o433-450i     Q9SFX3  OST1A_ARATH     reviewed        Dolichyl-diphosphooligosaccharide--protein_glycosyltransferase_subunit_1A_(EC_2.4.99.18)_(Ribophorin_IA)_(RPN-IA)_(Ribophorin-1A)  OST1A_RPN1A_At1g76400_F15M4.10  Arabidopsis_thaliana_(Mouse-ear_cress)  614     protein_N-linked_glycosylation_via_asparagine_[GO:0018279]      endoplasmic_reticulum_[GO:0005783];_integral_component_of_membrane_[GO:0016021];_membrane_[GO:0016020];_oligosaccharyltransferase_complex_[GO:0008250]     dolichyl-diphosphooligosaccharide-protein_glycotransferase_activity_[GO:0004579]        3702.AT1G76400.1;  PF04597;        IPR007676;      3702    ath:AT1G76400;  F15M4.10        2.4.99.18       SUBCELLULAR_LOCATION:_Endoplasmic_reticulum_membrane_{ECO:0000250};_Single-pass_type_I_membrane_protein_{ECO:0000250}.  SIGNAL_1_25_{ECO:0000255}. AT1G76400.1;    NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA 
Evigen1000006_c0_g1_i1 0.358 0.179 0.000 0.424 0.139 0.183 NA NA NA NA NA NA PF07767.10;Nop53_(60S_ribosomal_biogenesis);5.2e-21 NA 1;31;0.588 len=170;ExpAA=14.33;First60=14.27;PredHel=0;Topology=o O22892 NOP53_ARATH reviewed Ribosome_biogenesis_protein_NOP53 At2g40430_T2P4.22 Arabidopsis_thaliana_(Mouse-ear_cress) 442 ribosomal_large_subunit_assembly_[GO:0000027];_ribosomal_large_subunit_export_from_nucleus_[GO:0000055] nucleolus_[GO:0005730];_nucleoplasm_[GO:0005654] rRNA_binding_[GO:0019843] 3702.AT2G40430.2; PF07767; IPR011687; 3702 ath:AT2G40430; T2P4.22 SUBCELLULAR_LOCATION:_Nucleus,_nucleolus_{ECO:0000250|UniProtKB:Q9NZM5}._Nucleus,_nucleoplasm_{ECO:0000250|UniProtKB:Q9NZM5}. AT2G40430.1_[O22892-1]; NA NA NA NA NA NA NA NA NA NA NA NA NA


I want to take out the text form the brackets which only starts with "GO:". After each GO: there are 7 digits. e.g. "GO:0018279". They are GO terms. The number of GO terms in each row are not equal. The output must be a file which the first column includes the Untranscript ids (e.g. Evigen1000005_c0_g1_i1) and the rest GO terms. I want an output file like this:



output file:



Evigen1000005_c0_g1_i1 GO:0018279 GO:0005783 GO:0016021 GO:0016020 GO:0008250 GO:0004579
Evigen1000006_c0_g1_i1 GO:0000027 GO:0000055 GO:0005730 GO:0005654 GO:0019843









share|improve this question















I have a file like this:



input file:



Evigen1000005_c0_g1_i1  0.240   1.212   1.408   3.784   2.029   0.963   -1.22409810298695       1       NA      NA      NA      NA      PF04597.13;Ribophorin_I;4.6e-148        NA      1;21;0.875      len=569;ExpAA=32.49;First60=12.82;PredHel=1;Topology=o433-450i     Q9SFX3  OST1A_ARATH     reviewed        Dolichyl-diphosphooligosaccharide--protein_glycosyltransferase_subunit_1A_(EC_2.4.99.18)_(Ribophorin_IA)_(RPN-IA)_(Ribophorin-1A)  OST1A_RPN1A_At1g76400_F15M4.10  Arabidopsis_thaliana_(Mouse-ear_cress)  614     protein_N-linked_glycosylation_via_asparagine_[GO:0018279]      endoplasmic_reticulum_[GO:0005783];_integral_component_of_membrane_[GO:0016021];_membrane_[GO:0016020];_oligosaccharyltransferase_complex_[GO:0008250]     dolichyl-diphosphooligosaccharide-protein_glycotransferase_activity_[GO:0004579]        3702.AT1G76400.1;  PF04597;        IPR007676;      3702    ath:AT1G76400;  F15M4.10        2.4.99.18       SUBCELLULAR_LOCATION:_Endoplasmic_reticulum_membrane_{ECO:0000250};_Single-pass_type_I_membrane_protein_{ECO:0000250}.  SIGNAL_1_25_{ECO:0000255}. AT1G76400.1;    NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA 
Evigen1000006_c0_g1_i1 0.358 0.179 0.000 0.424 0.139 0.183 NA NA NA NA NA NA PF07767.10;Nop53_(60S_ribosomal_biogenesis);5.2e-21 NA 1;31;0.588 len=170;ExpAA=14.33;First60=14.27;PredHel=0;Topology=o O22892 NOP53_ARATH reviewed Ribosome_biogenesis_protein_NOP53 At2g40430_T2P4.22 Arabidopsis_thaliana_(Mouse-ear_cress) 442 ribosomal_large_subunit_assembly_[GO:0000027];_ribosomal_large_subunit_export_from_nucleus_[GO:0000055] nucleolus_[GO:0005730];_nucleoplasm_[GO:0005654] rRNA_binding_[GO:0019843] 3702.AT2G40430.2; PF07767; IPR011687; 3702 ath:AT2G40430; T2P4.22 SUBCELLULAR_LOCATION:_Nucleus,_nucleolus_{ECO:0000250|UniProtKB:Q9NZM5}._Nucleus,_nucleoplasm_{ECO:0000250|UniProtKB:Q9NZM5}. AT2G40430.1_[O22892-1]; NA NA NA NA NA NA NA NA NA NA NA NA NA


I want to take out the text form the brackets which only starts with "GO:". After each GO: there are 7 digits. e.g. "GO:0018279". They are GO terms. The number of GO terms in each row are not equal. The output must be a file which the first column includes the Untranscript ids (e.g. Evigen1000005_c0_g1_i1) and the rest GO terms. I want an output file like this:



output file:



Evigen1000005_c0_g1_i1 GO:0018279 GO:0005783 GO:0016021 GO:0016020 GO:0008250 GO:0004579
Evigen1000006_c0_g1_i1 GO:0000027 GO:0000055 GO:0005730 GO:0005654 GO:0019843






shell-script awk sed grep






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 25 at 9:49

























asked Nov 23 at 22:09









Mehdi

196




196












  • Should [GO:0016021] and [O22892-1] be excluded from the output? In that case why? Please specify in detail how to select the output!
    – sudodus
    Nov 24 at 0:22










  • yes, it should excluded
    – Mehdi
    Nov 24 at 7:06










  • If you explain why to exclude these data (what makes them 'wrong'), it should be possible to design a method to exclude them automatically. Otherwise we can only guess or leave the exclusion to manual methods.
    – sudodus
    Nov 24 at 10:54






  • 1




    the text presented in brackets started with GO: were Gene Ontology for each Unitranscrit (e.g. Evigen1000005_c0_g1_i1), to do GO category with WEGO tool we need a file which the first column is Untranscript ids and the rest GO terms.
    – Mehdi
    Nov 24 at 11:34






  • 1




    no, it should not be extended. there a mistake in the output file. I edit the output file.
    – Mehdi
    Nov 24 at 17:30


















  • Should [GO:0016021] and [O22892-1] be excluded from the output? In that case why? Please specify in detail how to select the output!
    – sudodus
    Nov 24 at 0:22










  • yes, it should excluded
    – Mehdi
    Nov 24 at 7:06










  • If you explain why to exclude these data (what makes them 'wrong'), it should be possible to design a method to exclude them automatically. Otherwise we can only guess or leave the exclusion to manual methods.
    – sudodus
    Nov 24 at 10:54






  • 1




    the text presented in brackets started with GO: were Gene Ontology for each Unitranscrit (e.g. Evigen1000005_c0_g1_i1), to do GO category with WEGO tool we need a file which the first column is Untranscript ids and the rest GO terms.
    – Mehdi
    Nov 24 at 11:34






  • 1




    no, it should not be extended. there a mistake in the output file. I edit the output file.
    – Mehdi
    Nov 24 at 17:30
















Should [GO:0016021] and [O22892-1] be excluded from the output? In that case why? Please specify in detail how to select the output!
– sudodus
Nov 24 at 0:22




Should [GO:0016021] and [O22892-1] be excluded from the output? In that case why? Please specify in detail how to select the output!
– sudodus
Nov 24 at 0:22












yes, it should excluded
– Mehdi
Nov 24 at 7:06




yes, it should excluded
– Mehdi
Nov 24 at 7:06












If you explain why to exclude these data (what makes them 'wrong'), it should be possible to design a method to exclude them automatically. Otherwise we can only guess or leave the exclusion to manual methods.
– sudodus
Nov 24 at 10:54




If you explain why to exclude these data (what makes them 'wrong'), it should be possible to design a method to exclude them automatically. Otherwise we can only guess or leave the exclusion to manual methods.
– sudodus
Nov 24 at 10:54




1




1




the text presented in brackets started with GO: were Gene Ontology for each Unitranscrit (e.g. Evigen1000005_c0_g1_i1), to do GO category with WEGO tool we need a file which the first column is Untranscript ids and the rest GO terms.
– Mehdi
Nov 24 at 11:34




the text presented in brackets started with GO: were Gene Ontology for each Unitranscrit (e.g. Evigen1000005_c0_g1_i1), to do GO category with WEGO tool we need a file which the first column is Untranscript ids and the rest GO terms.
– Mehdi
Nov 24 at 11:34




1




1




no, it should not be extended. there a mistake in the output file. I edit the output file.
– Mehdi
Nov 24 at 17:30




no, it should not be extended. there a mistake in the output file. I edit the output file.
– Mehdi
Nov 24 at 17:30










3 Answers
3






active

oldest

votes

















up vote
0
down vote



accepted










Suggested script, that matches your final specification.



#!/bin/bash

while read line
do
# echo "$line"
name=${line%% *}
echo -n "$name "
data=$(<<< "$line" grep -o '[GO:.{7}]' | tr 'n' ' ' | sed -e 's/[//g' -e 's/]//g')
echo "$data"
done < "$1"


Testing:



$ ./script input 
Evigen1000005_c0_g1_i1 GO:0018279 GO:0005783 GO:0016021 GO:0016020 GO:0008250 GO:0004579
Evigen1000006_c0_g1_i1 GO:0000027 GO:0000055 GO:0005730 GO:0005654 GO:0019843





share|improve this answer























  • That is great, it works. Thanks
    – Mehdi
    Nov 24 at 9:50










  • @Mehdi, You are welcome :-)
    – sudodus
    Nov 24 at 11:14


















up vote
1
down vote













How about



sed -r 's/(^[^[:space:]]* )[^*[/1/; s/][^*([|$)/ /g' file
Evigen1000005_c0_g1_i1 GO:0018279 GO:0005783 GO:0016021 GO:0016020 GO:0008250 GO:0004579
Evigen1000006_c0_g1_i1 GO:0000027 GO:0000055 GO:0005730 GO:0005654 GO:0019843 O22892-1


Your desired output does NOT reflect the processed input sample.



EDIT: or even



sed -r 's/((^[^ ]* )|])[^*([|$)/2 /g' file


EDIT: with your question and desired output three times revised, try



sed -r 's/((^[^ ]* )|])[^*([GO)/2 GO/g; s/].*$//' file





share|improve this answer






























    up vote
    0
    down vote













    Faster in sed:



    start='[GO:' end=']'
    sed -e 's,'"$start"$',1,g' -e 's,'"$end"$',2,g'
    -e $'s, [^1]*, ,' -e $'s,1\([^2]*\)2[^1]*,GO:\1 ,g'
    infile


    or awk:



    awk -vone=$'1' -vtwo=$'3' -vstart='[GO:' -v end=']' '
    {
    printf("%s ",$1);
    gsub(start,one);
    gsub(end,two);
    sub("^[^"one"]*"one,"GO:")
    gsub(two"[^"one"]*"one," GO:")
    sub(two".*$" ,"")
    }
    1' infile





    share|improve this answer





















      Your Answer








      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "106"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f483786%2fextract-text-from-brackets%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      3 Answers
      3






      active

      oldest

      votes








      3 Answers
      3






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      0
      down vote



      accepted










      Suggested script, that matches your final specification.



      #!/bin/bash

      while read line
      do
      # echo "$line"
      name=${line%% *}
      echo -n "$name "
      data=$(<<< "$line" grep -o '[GO:.{7}]' | tr 'n' ' ' | sed -e 's/[//g' -e 's/]//g')
      echo "$data"
      done < "$1"


      Testing:



      $ ./script input 
      Evigen1000005_c0_g1_i1 GO:0018279 GO:0005783 GO:0016021 GO:0016020 GO:0008250 GO:0004579
      Evigen1000006_c0_g1_i1 GO:0000027 GO:0000055 GO:0005730 GO:0005654 GO:0019843





      share|improve this answer























      • That is great, it works. Thanks
        – Mehdi
        Nov 24 at 9:50










      • @Mehdi, You are welcome :-)
        – sudodus
        Nov 24 at 11:14















      up vote
      0
      down vote



      accepted










      Suggested script, that matches your final specification.



      #!/bin/bash

      while read line
      do
      # echo "$line"
      name=${line%% *}
      echo -n "$name "
      data=$(<<< "$line" grep -o '[GO:.{7}]' | tr 'n' ' ' | sed -e 's/[//g' -e 's/]//g')
      echo "$data"
      done < "$1"


      Testing:



      $ ./script input 
      Evigen1000005_c0_g1_i1 GO:0018279 GO:0005783 GO:0016021 GO:0016020 GO:0008250 GO:0004579
      Evigen1000006_c0_g1_i1 GO:0000027 GO:0000055 GO:0005730 GO:0005654 GO:0019843





      share|improve this answer























      • That is great, it works. Thanks
        – Mehdi
        Nov 24 at 9:50










      • @Mehdi, You are welcome :-)
        – sudodus
        Nov 24 at 11:14













      up vote
      0
      down vote



      accepted







      up vote
      0
      down vote



      accepted






      Suggested script, that matches your final specification.



      #!/bin/bash

      while read line
      do
      # echo "$line"
      name=${line%% *}
      echo -n "$name "
      data=$(<<< "$line" grep -o '[GO:.{7}]' | tr 'n' ' ' | sed -e 's/[//g' -e 's/]//g')
      echo "$data"
      done < "$1"


      Testing:



      $ ./script input 
      Evigen1000005_c0_g1_i1 GO:0018279 GO:0005783 GO:0016021 GO:0016020 GO:0008250 GO:0004579
      Evigen1000006_c0_g1_i1 GO:0000027 GO:0000055 GO:0005730 GO:0005654 GO:0019843





      share|improve this answer














      Suggested script, that matches your final specification.



      #!/bin/bash

      while read line
      do
      # echo "$line"
      name=${line%% *}
      echo -n "$name "
      data=$(<<< "$line" grep -o '[GO:.{7}]' | tr 'n' ' ' | sed -e 's/[//g' -e 's/]//g')
      echo "$data"
      done < "$1"


      Testing:



      $ ./script input 
      Evigen1000005_c0_g1_i1 GO:0018279 GO:0005783 GO:0016021 GO:0016020 GO:0008250 GO:0004579
      Evigen1000006_c0_g1_i1 GO:0000027 GO:0000055 GO:0005730 GO:0005654 GO:0019843






      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Nov 24 at 18:50

























      answered Nov 23 at 23:55









      sudodus

      58116




      58116












      • That is great, it works. Thanks
        – Mehdi
        Nov 24 at 9:50










      • @Mehdi, You are welcome :-)
        – sudodus
        Nov 24 at 11:14


















      • That is great, it works. Thanks
        – Mehdi
        Nov 24 at 9:50










      • @Mehdi, You are welcome :-)
        – sudodus
        Nov 24 at 11:14
















      That is great, it works. Thanks
      – Mehdi
      Nov 24 at 9:50




      That is great, it works. Thanks
      – Mehdi
      Nov 24 at 9:50












      @Mehdi, You are welcome :-)
      – sudodus
      Nov 24 at 11:14




      @Mehdi, You are welcome :-)
      – sudodus
      Nov 24 at 11:14












      up vote
      1
      down vote













      How about



      sed -r 's/(^[^[:space:]]* )[^*[/1/; s/][^*([|$)/ /g' file
      Evigen1000005_c0_g1_i1 GO:0018279 GO:0005783 GO:0016021 GO:0016020 GO:0008250 GO:0004579
      Evigen1000006_c0_g1_i1 GO:0000027 GO:0000055 GO:0005730 GO:0005654 GO:0019843 O22892-1


      Your desired output does NOT reflect the processed input sample.



      EDIT: or even



      sed -r 's/((^[^ ]* )|])[^*([|$)/2 /g' file


      EDIT: with your question and desired output three times revised, try



      sed -r 's/((^[^ ]* )|])[^*([GO)/2 GO/g; s/].*$//' file





      share|improve this answer



























        up vote
        1
        down vote













        How about



        sed -r 's/(^[^[:space:]]* )[^*[/1/; s/][^*([|$)/ /g' file
        Evigen1000005_c0_g1_i1 GO:0018279 GO:0005783 GO:0016021 GO:0016020 GO:0008250 GO:0004579
        Evigen1000006_c0_g1_i1 GO:0000027 GO:0000055 GO:0005730 GO:0005654 GO:0019843 O22892-1


        Your desired output does NOT reflect the processed input sample.



        EDIT: or even



        sed -r 's/((^[^ ]* )|])[^*([|$)/2 /g' file


        EDIT: with your question and desired output three times revised, try



        sed -r 's/((^[^ ]* )|])[^*([GO)/2 GO/g; s/].*$//' file





        share|improve this answer

























          up vote
          1
          down vote










          up vote
          1
          down vote









          How about



          sed -r 's/(^[^[:space:]]* )[^*[/1/; s/][^*([|$)/ /g' file
          Evigen1000005_c0_g1_i1 GO:0018279 GO:0005783 GO:0016021 GO:0016020 GO:0008250 GO:0004579
          Evigen1000006_c0_g1_i1 GO:0000027 GO:0000055 GO:0005730 GO:0005654 GO:0019843 O22892-1


          Your desired output does NOT reflect the processed input sample.



          EDIT: or even



          sed -r 's/((^[^ ]* )|])[^*([|$)/2 /g' file


          EDIT: with your question and desired output three times revised, try



          sed -r 's/((^[^ ]* )|])[^*([GO)/2 GO/g; s/].*$//' file





          share|improve this answer














          How about



          sed -r 's/(^[^[:space:]]* )[^*[/1/; s/][^*([|$)/ /g' file
          Evigen1000005_c0_g1_i1 GO:0018279 GO:0005783 GO:0016021 GO:0016020 GO:0008250 GO:0004579
          Evigen1000006_c0_g1_i1 GO:0000027 GO:0000055 GO:0005730 GO:0005654 GO:0019843 O22892-1


          Your desired output does NOT reflect the processed input sample.



          EDIT: or even



          sed -r 's/((^[^ ]* )|])[^*([|$)/2 /g' file


          EDIT: with your question and desired output three times revised, try



          sed -r 's/((^[^ ]* )|])[^*([GO)/2 GO/g; s/].*$//' file






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 25 at 22:16

























          answered Nov 23 at 22:51









          RudiC

          3,4221312




          3,4221312






















              up vote
              0
              down vote













              Faster in sed:



              start='[GO:' end=']'
              sed -e 's,'"$start"$',1,g' -e 's,'"$end"$',2,g'
              -e $'s, [^1]*, ,' -e $'s,1\([^2]*\)2[^1]*,GO:\1 ,g'
              infile


              or awk:



              awk -vone=$'1' -vtwo=$'3' -vstart='[GO:' -v end=']' '
              {
              printf("%s ",$1);
              gsub(start,one);
              gsub(end,two);
              sub("^[^"one"]*"one,"GO:")
              gsub(two"[^"one"]*"one," GO:")
              sub(two".*$" ,"")
              }
              1' infile





              share|improve this answer

























                up vote
                0
                down vote













                Faster in sed:



                start='[GO:' end=']'
                sed -e 's,'"$start"$',1,g' -e 's,'"$end"$',2,g'
                -e $'s, [^1]*, ,' -e $'s,1\([^2]*\)2[^1]*,GO:\1 ,g'
                infile


                or awk:



                awk -vone=$'1' -vtwo=$'3' -vstart='[GO:' -v end=']' '
                {
                printf("%s ",$1);
                gsub(start,one);
                gsub(end,two);
                sub("^[^"one"]*"one,"GO:")
                gsub(two"[^"one"]*"one," GO:")
                sub(two".*$" ,"")
                }
                1' infile





                share|improve this answer























                  up vote
                  0
                  down vote










                  up vote
                  0
                  down vote









                  Faster in sed:



                  start='[GO:' end=']'
                  sed -e 's,'"$start"$',1,g' -e 's,'"$end"$',2,g'
                  -e $'s, [^1]*, ,' -e $'s,1\([^2]*\)2[^1]*,GO:\1 ,g'
                  infile


                  or awk:



                  awk -vone=$'1' -vtwo=$'3' -vstart='[GO:' -v end=']' '
                  {
                  printf("%s ",$1);
                  gsub(start,one);
                  gsub(end,two);
                  sub("^[^"one"]*"one,"GO:")
                  gsub(two"[^"one"]*"one," GO:")
                  sub(two".*$" ,"")
                  }
                  1' infile





                  share|improve this answer












                  Faster in sed:



                  start='[GO:' end=']'
                  sed -e 's,'"$start"$',1,g' -e 's,'"$end"$',2,g'
                  -e $'s, [^1]*, ,' -e $'s,1\([^2]*\)2[^1]*,GO:\1 ,g'
                  infile


                  or awk:



                  awk -vone=$'1' -vtwo=$'3' -vstart='[GO:' -v end=']' '
                  {
                  printf("%s ",$1);
                  gsub(start,one);
                  gsub(end,two);
                  sub("^[^"one"]*"one,"GO:")
                  gsub(two"[^"one"]*"one," GO:")
                  sub(two".*$" ,"")
                  }
                  1' infile






                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Nov 26 at 0:48









                  Isaac

                  9,91111445




                  9,91111445






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Unix & Linux Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.





                      Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                      Please pay close attention to the following guidance:


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f483786%2fextract-text-from-brackets%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Morgemoulin

                      Scott Moir

                      Souastre