translate nucleic sequence in it's corresponding amino acid sequence

up vote
0
down vote

favorite

Goal of the program

The goal of the program is to translate a nucleic acid sequence into it's corresponding amino acid sequence. The nucleic sequences
have to be formatted in a specific format called fasta. There is an existing implementation of this program in C here: emboss transeq

Fasta format

A fasta file looks like this:

>sequenceId comment

nucleic sequence

for example:

>Seq1 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

CCTTTATCTAATCTTTGGAGCATGAGCTGGCATAGTTGGAACCGCCCTCAGCCTCCTCATCCGTGCAGAA

CCCAGTCCTGTACCAACACCTCTTCTGATTCTTCGGCCATCCAGAAGTCTATATCCTCATTTTAC



>Seq2 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

GGTAGGTACCGCCCTAAGNCTCCTAATCCGAGCAGAACTANGCCAACCCGGAGCCCTTCTGGGAGACGAC

AATCAACATAAAA

A nucleic sequence is a string composed of the letters A, C, T, G, U, N

Expected output

A combinaison of 3 nucleic acid, named codon, gives a specif amino acid, for exemple GCT is the code for Alanine, symbolised by the letter A

With the Seq1 define above:

codon      CCT TTA TCT AAT CTT TGG AGC ATG ...

amino acid  P   L   S   N   L   W   S   M  ...

The expected output is:

>Seq1_1 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

PLSNLWSMSWHSWNRPQPPHPCRTQSCTNTSSDSSAIQKSISSFY

>Seq2_1 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

GRYRPKXPNPSRTXPTRSPSGRRQST*X

Specific rules

The program takes 4 parameters:

clean : if true, write STOP codon as X instead of *

trim : if true, remove all X and * chars from the right end of the amino - acid sequence

alternative: different way to compute reverse frame

frame: a string value in ["1", "2", "3", "F", "-1", "-2", "-3", "R", "6" ]

The frame define the position in the nucleic sequence to start from:

-frame 1

|-frame 2

||-frame 3

|||

CCTTTATCTAATCTTTGGAGCATGAGCTGGCATAGTTGGAACCGCCCTCAGCCTCCTCATCCGTGCAGAA

CCCAGTCCTGTACCAACACCTCTTCTGATTCTTCGGCCATCCAGAAGTCTATATCCTCATTTTAC

                                                              |||

                                                              ||-frame -1

                                                              |-frame -2

                                                              -frame -3





frame "F" = "1", "2", "3"

frame "R" = "-1", "-2", "-3"

frame "6" = "1", "2", "3", "-1", "-2", "-3"

In the output file, we add the frame used to the sequenceId: sequenceId = sequenceId_frame

For example if the program is used with frame=6, the expected output is

>Seq1_1 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

PLSNLWSMSWHSWNRPQPPHPCRTQSCTNTSSDSSAIQKSISSFY

>Seq1_2 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

LYLIFGA*AGIVGTALSLLIRAEPSPVPTPLLILRPSRSLYPHFT

>Seq1_3 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

FI*SLEHELA*LEPPSASSSVQNPVLYQHLF*FFGHPEVYILILX

>Seq1_4 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

VK*GYRLLDGRRIRRGVGTGLGSARMRRLRAVPTMPAHAPKIR*R

>Seq1_5 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

KMRI*TSGWPKNQKRCWYRTGFCTDEEAEGGSNYASSCSKD*IKX

>Seq1_6 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

*NEDIDFWMAEESEEVLVQDWVLHG*GG*GRFQLCQLMLQRLDKG

>Seq2_1 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

GRYRPKXPNPSRTXPTRSPSGRRQST*X

>Seq2_2 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

VGTALXLLIRAELXQPGALLGDDNQHKX

>Seq2_3 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

*VPP*XS*SEQNXANPEPFWETTINIK

>Seq2_4 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

LC*LSSPRRAPGWXSSARIRXLRAVPT

>Seq2_5 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

FMLIVVSQKGSGLA*FCSD*EX*GGTYX

>Seq2_6 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

FYVDCRLPEGLRVGXVLLGLGXLGRYLP

Improvements

Is the code easy to read / understand ?

Is the code idiomatic ?

Any other improvements (performance, doc...)

( full project with tests + flags handling is avaible on github: gotranseq)

code :

package transeq



import (

    "bufio"

    "bytes"

    "context"

    "encoding/binary"

    "fmt"

    "io"

    "runtime"

    "sync"

)



var (

    letterCode = map[byte]uint8{

        'A': aCode,

        'C': cCode,

        'T': tCode,

        'G': gCode,

        'N': nCode,

        'U': uCode,

    }

    standard = map[string]byte{

        "TTT": 'F',

        "TCT": 'S',

        "TAT": 'Y',

        "TGT": 'C',

        "TTC": 'F',

        "TCC": 'S',

        "TAC": 'Y',

        "TGC": 'C',

        "TTA": 'L',

        "TCA": 'S',

        "TAA": '*',

        "TGA": '*',

        "TTG": 'L',

        "TCG": 'S',

        "TAG": '*',

        "TGG": 'W',

        "CTT": 'L',

        "CCT": 'P',

        "CAT": 'H',

        "CGT": 'R',

        "CTC": 'L',

        "CCC": 'P',

        "CAC": 'H',

        "CGC": 'R',

        "CTA": 'L',

        "CCA": 'P',

        "CAA": 'Q',

        "CGA": 'R',

        "CTG": 'L',

        "CCG": 'P',

        "CAG": 'Q',

        "CGG": 'R',

        "ATT": 'I',

        "ACT": 'T',

        "AAT": 'N',

        "AGT": 'S',

        "ATC": 'I',

        "ACC": 'T',

        "AAC": 'N',

        "AGC": 'S',

        "ATA": 'I',

        "ACA": 'T',

        "AAA": 'K',

        "AGA": 'R',

        "ATG": 'M',

        "ACG": 'T',

        "AAG": 'K',

        "AGG": 'R',

        "GTT": 'V',

        "GCT": 'A',

        "GAT": 'D',

        "GGT": 'G',

        "GTC": 'V',

        "GCC": 'A',

        "GAC": 'D',

        "GGC": 'G',

        "GTA": 'V',

        "GCA": 'A',

        "GAA": 'E',

        "GGA": 'G',

        "GTG": 'V',

        "GCG": 'A',

        "GAG": 'E',

        "GGG": 'G',

    }

)



const (

    nCode = uint8(0)

    aCode = uint8(1)

    cCode = uint8(2)

    tCode = uint8(3)

    uCode = uint8(3)

    gCode = uint8(4)



    stopByte = '*'

    unknown  = 'X'

    // Length of the array to store code/bytes

    // uses gCode because it's the biggest uint8 of all codes

    arrayCodeSize = (uint32(gCode) | uint32(gCode)<<8 | uint32(gCode)<<16) + 1

)



func createCodeArray(clean bool) byte {



    resultMap := map[uint32]byte{}

    twoLetterMap := map[string]byte{}



    tmpCode := make(uint8, 4)



    for codon, aaCode := range standard {

        // generate 3 letter code

        for i := 0; i < 3; i++ {

            tmpCode[i] = letterCode[codon[i]]

        }

        // each codon is represented by an unique uint32:

        // each possible nucleotide is represented by an uint8 (255 possibility)

        // the three first bytes are the the code for each nucleotide

        // last byte is unused ( eq to uint8(0) )

        // example:

        // codon 'ACG' ==> uint32(aCode) | uint32(cCode)<<8 | uint32(gCode)<<16

        uint32Code := uint32(tmpCode[0]) | uint32(tmpCode[1])<<8 | uint32(tmpCode[2])<<16

        resultMap[uint32Code] = aaCode



        // generate 2 letter code

        codes, ok := twoLetterMap[codon[0:2]]

        if !ok {

            twoLetterMap[codon[0:2]] = byte{aaCode}

        } else {

            twoLetterMap[codon[0:2]] = append(codes, aaCode)

        }

    }

    for twoLetterCodon, codes := range twoLetterMap {

        uniqueAA := true

        for i := 0; i < len(codes); i++ {



            if codes[i] != codes[0] {

                uniqueAA = false

            }

        }

        if uniqueAA {

            first := letterCode[twoLetterCodon[0]]

            second := letterCode[twoLetterCodon[1]]



            uint32Code := uint32(first) | uint32(second)<<8

            resultMap[uint32Code] = codes[0]

        }

    }

    // if clean is specified, we want to replace all '*' by 'X' in the output

    // sequence, so replace all occurrences of '*' directly in the ref map

    if clean {

        for k, v := range resultMap {

            if v == stopByte {

                resultMap[k] = unknown

            }

        }

    }

    r := make(byte, arrayCodeSize)

    for k, v := range resultMap {

        r[k] = v

    }

    return r

}



func computeFrames(frameName string) (frames int, reverse bool, err error) {



    frames = make(int, 6)

    reverse = false



    switch frameName {

    case "1":

        frames[0] = 1

    case "2":

        frames[1] = 1

    case "3":

        frames[2] = 1

    case "F":

        for i := 0; i < 3; i++ {

            frames[i] = 1

        }

    case "-1":

        frames[3] = 1

        reverse = true

    case "-2":

        frames[4] = 1

        reverse = true

    case "-3":

        frames[5] = 1

        reverse = true

    case "R":

        for i := 3; i < 6; i++ {

            frames[i] = 1

        }

        reverse = true

    case "6":

        for i := range frames {

            frames[i] = 1

        }

        reverse = true

    default:

        err = fmt.Errorf("wrong value for -f | --frame parameter: %s", frameName)

    }

    return frames, reverse, err

}



type writer struct {

    buf            *bytes.Buffer

    currentLineLen int

    bytesToTrim    int

}



func (w *writer) addByte(b byte) {

    w.buf.WriteByte(b)

    w.currentLineLen++

    if b == stopByte || b == unknown {

        w.bytesToTrim++

    } else {

        w.bytesToTrim = 0

    }

}



func (w *writer) addUnknown() {

    w.buf.WriteByte(unknown)

    w.currentLineLen++

    w.bytesToTrim++

}



func (w *writer) newLine() {

    w.buf.WriteByte('n')

    w.currentLineLen = 0

    w.bytesToTrim++

}



const (

    // size of the buffer for writing to file

    maxBufferSize = 1024 * 1024 * 30

    // max line size for sequence

    maxLineSize = 60

    // suffixes ta add to sequence id for each frame

    suffixes = "123456"

)



// Translate read a fata file, translate each sequence to the corresponding prot sequence in the specified frame

func Translate(inputSequence io.Reader, out io.Writer, frame string, clean, trim, alternative bool) error {



    arrayCode := createCodeArray(clean)

    framesToGenerate, reverse, err := computeFrames(frame)

    if err != nil {

        return err

    }



    fnaSequences := make(chan encodedSequence, 10)

    errs := make(chan error, 1)



    var wg sync.WaitGroup

    ctx, cancel := context.WithCancel(context.Background())

    defer cancel()



    for nWorker := 0; nWorker < runtime.NumCPU(); nWorker++ {



        wg.Add(1)



        go func() {



            defer wg.Done()



            startPosition := make(int, 3)

            w := &writer{

                buf:            bytes.NewBuffer(nil),

                bytesToTrim:    0,

                currentLineLen: 0,

            }



            for sequence := range fnaSequences {



                select {

                case <-ctx.Done():

                    return

                default:

                }



                frameIndex := 0

                startPosition[0], startPosition[1], startPosition[2] = 0, 1, 2



                idSize := int(binary.LittleEndian.Uint32(sequence[0:4]))

                nuclSeqLength := len(sequence) - idSize



            Translate:

                for _, startPos := range startPosition {



                    if framesToGenerate[frameIndex] == 0 {

                        frameIndex++

                        continue

                    }



                    // sequence id should look like

                    // >sequenceID_<frame> comment

                    idEnd := bytes.IndexByte(sequence[4:idSize], ' ')

                    if idEnd != -1 {

                        w.buf.Write(sequence[4 : 4+idEnd])

                        w.buf.WriteByte('_')

                        w.buf.WriteByte(suffixes[frameIndex])

                        w.buf.Write(sequence[4+idEnd : idSize])

                    } else {

                        w.buf.Write(sequence[4:idSize])

                        w.buf.WriteByte('_')

                        w.buf.WriteByte(suffixes[frameIndex])

                    }

                    w.newLine()



                    // if in trim mode, nb of bytes to trim (nb of successive 'X', '*' and 'n'

                    // from right end of the sequence)

                    w.bytesToTrim = 0

                    w.currentLineLen = 0



                    // read the sequence 3 letters at a time, starting at a specific position

                    // corresponding to the frame

                    for pos := startPos + 2 + idSize; pos < len(sequence); pos += 3 {



                        if w.currentLineLen == maxLineSize {

                            w.newLine()

                        }

                        // create an uint32 from the codon, to retrieve the corresponding

                        // AA from the map

                        codonCode := uint32(sequence[pos-2]) | uint32(sequence[pos-1])<<8 | uint32(sequence[pos])<<16



                        b := arrayCode[codonCode]

                        if b != byte(0) {

                            w.addByte(b)

                        } else {

                            w.addUnknown()

                        }

                    }



                    // the last codon is only 2 nucleotid long, try to guess

                    // the corresponding AA

                    if (nuclSeqLength-startPos)%3 == 2 {



                        if w.currentLineLen == maxLineSize {

                            w.newLine()

                        }

                        codonCode := uint32(sequence[len(sequence)-2]) | uint32(sequence[len(sequence)-1])<<8



                        b := arrayCode[codonCode]

                        if b != byte(0) {

                            w.addByte(b)

                        } else {

                            w.addUnknown()

                        }

                    }



                    // the last codon is only 1 nucleotid long, no way to guess

                    // the corresponding AA

                    if (nuclSeqLength-startPos)%3 == 1 {

                        if w.currentLineLen == maxLineSize {

                            w.newLine()

                        }

                        w.addUnknown()

                    }



                    if trim && w.bytesToTrim > 0 {

                        // remove the last bytesToTrim bytes of the buffer

                        // as they are 'X', '*' or 'n'

                        w.buf.Truncate(w.buf.Len() - w.bytesToTrim)

                        w.currentLineLen -= w.bytesToTrim

                    }



                    if w.currentLineLen != 0 {

                        w.newLine()

                    }

                    frameIndex++

                }



                if reverse && frameIndex < 6 {



                    // get the complementary sequence.

                    // Basically, switch

                    //   A <-> T

                    //   C <-> G

                    // N is not modified

                    for i, n := range sequence[idSize:] {



                        switch n {

                        case aCode:

                            sequence[i+idSize] = tCode

                        case tCode:

                            // handle both tCode and uCode

                            sequence[i+idSize] = aCode

                        case cCode:

                            sequence[i+idSize] = gCode

                        case gCode:

                            sequence[i+idSize] = cCode

                        default:

                            //case N -> leave it

                        }

                    }

                    // reverse the sequence

                    for i, j := idSize, len(sequence)-1; i < j; i, j = i+1, j-1 {

                        sequence[i], sequence[j] = sequence[j], sequence[i]

                    }



                    if !alternative {

                        // Staden convention: Frame -1 is the reverse-complement of the sequence

                        // having the same codon phase as frame 1. Frame -2 is the same phase as

                        // frame 2. Frame -3 is the same phase as frame 3

                        //

                        // use the matrix to keep track of the forward frame as it depends on the

                        // length of the sequence

                        switch nuclSeqLength % 3 {

                        case 0:

                            startPosition[0], startPosition[1], startPosition[2] = 0, 2, 1

                        case 1:

                            startPosition[0], startPosition[1], startPosition[2] = 1, 0, 2

                        case 2:

                            startPosition[0], startPosition[1], startPosition[2] = 2, 1, 0

                        }

                    }

                    // run the same loop, but with the reverse-complemented sequence

                    goto Translate

                }



                if w.buf.Len() > maxBufferSize {

                    _, err := out.Write(w.buf.Bytes())

                    if err != nil {

                        select {

                        case errs <- fmt.Errorf("fail to write to output file: %v", err):

                        default:

                        }

                        cancel()

                        return

                    }

                    w.buf.Reset()

                }

                pool.Put(sequence)

            }



            if w.buf.Len() > 0 {

                _, err := out.Write(w.buf.Bytes())

                if err != nil {

                    select {

                    case errs <- fmt.Errorf("fail to write to output file: %v", err):

                    default:

                    }

                    cancel()

                    return

                }

            }

        }()

    }

    readSequenceFromFasta(ctx, inputSequence, fnaSequences)



    wg.Wait()

    select {

    case err, ok := <-errs:

        if ok {

            return err

        }

    default:

    }

    return nil

}



func readSequenceFromFasta(ctx context.Context, inputSequence io.Reader, fnaSequences chan encodedSequence) {



    feeder := &fastaChannelFeeder{

        idBuffer:       bytes.NewBuffer(nil),

        commentBuffer:  bytes.NewBuffer(nil),

        sequenceBuffer: bytes.NewBuffer(nil),

        fastaChan:      fnaSequences,

    }

    // fasta format is:

    //

    // >sequenceID some comments on sequence

    // ACAGGCAGAGACACGACAGACGACGACACAGGAGCAGACAGCAGCAGACGACCACATATT

    // TTTGCGGTCACATGACGACTTCGGCAGCGA

    //

    // see https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp

    // section 1 for details

    scanner := bufio.NewScanner(inputSequence)

Loop:

    for scanner.Scan() {



        line := scanner.Bytes()

        if len(line) == 0 {

            continue

        }

        if line[0] == '>' {



            if feeder.idBuffer.Len() > 0 {

                select {

                case <-ctx.Done():

                    break Loop

                default:

                }

                feeder.sendFasta()

            }

            feeder.reset()



            // parse the ID of the sequence. ID is formatted like this:

            // >sequenceID comments

            seqID := bytes.SplitN(line, byte{' '}, 2)

            feeder.idBuffer.Write(seqID[0])



            if len(seqID) > 1 {

                feeder.commentBuffer.WriteByte(' ')

                feeder.commentBuffer.Write(seqID[1])

            }

        } else {

            // if the line doesn't start with '>', then it's a part of the

            // nucleotide sequence, so write it to the buffer

            feeder.sequenceBuffer.Write(line)

        }

    }

    // don't forget to push last sequence

    select {

    case <-ctx.Done():

    default:

        feeder.sendFasta()

    }

    close(fnaSequences)

}



// a type to hold an encoded fasta sequence

//

//  s[0:4] stores the size of the sequence id + the size of the comment as an uint32 (little endian)

//  s[4:idSize] stores the sequence id, and the comment id there is one

//  s[idSize:] stores the nucl sequence

type encodedSequence byte



var pool = sync.Pool{

    New: func() interface{} {

        return make(encodedSequence, 512)

    },

}



func getSizedSlice(idSize, requiredSize int) encodedSequence {

    s := pool.Get().(encodedSequence)

    binary.LittleEndian.PutUint32(s[0:4], uint32(idSize))



    for len(s) < requiredSize {

        s = append(s, byte(0))

    }

    return s[0:requiredSize]

}



func (f *fastaChannelFeeder) sendFasta() {



    idSize := 4 + f.idBuffer.Len() + f.commentBuffer.Len()

    requiredSize := idSize + f.sequenceBuffer.Len()



    s := getSizedSlice(idSize, requiredSize)



    if f.commentBuffer.Len() > 0 {

        copy(s[idSize-f.commentBuffer.Len():idSize], f.commentBuffer.Bytes())

    }



    copy(s[4:4+f.idBuffer.Len()], f.idBuffer.Bytes())



    // convert the sequence of bytes to an array of uint8 codes,

    // so a codon (3 nucleotides | 3 bytes ) can be represented

    // as an uint32

    for i, b := range f.sequenceBuffer.Bytes() {



        switch b {

        case 'A':

            s[i+idSize] = aCode

        case 'C':

            s[i+idSize] = cCode

        case 'G':

            s[i+idSize] = gCode

        case 'T', 'U':

            s[i+idSize] = tCode

        case 'N':

            s[i+idSize] = nCode

        default:

            fmt.Printf("WARNING: invalid char in sequence %s: %s, ignoring", s[4:4+idSize], string(b))

        }

    }

    f.fastaChan <- s

}



type fastaChannelFeeder struct {

    idBuffer, commentBuffer, sequenceBuffer *bytes.Buffer

    fastaChan                               chan encodedSequence

}



func (f *fastaChannelFeeder) reset() {

    f.idBuffer.Reset()

    f.sequenceBuffer.Reset()

    f.commentBuffer.Reset()

}

asked 9 mins ago

felix

69839

add a comment |

up vote
0
down vote

favorite

Goal of the program

Fasta format

A fasta file looks like this:

>sequenceId comment

nucleic sequence

for example:

>Seq1 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

CCTTTATCTAATCTTTGGAGCATGAGCTGGCATAGTTGGAACCGCCCTCAGCCTCCTCATCCGTGCAGAA

CCCAGTCCTGTACCAACACCTCTTCTGATTCTTCGGCCATCCAGAAGTCTATATCCTCATTTTAC



>Seq2 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

GGTAGGTACCGCCCTAAGNCTCCTAATCCGAGCAGAACTANGCCAACCCGGAGCCCTTCTGGGAGACGAC

AATCAACATAAAA

A nucleic sequence is a string composed of the letters A, C, T, G, U, N

Expected output

A combinaison of 3 nucleic acid, named codon, gives a specif amino acid, for exemple GCT is the code for Alanine, symbolised by the letter A

With the Seq1 define above:

codon      CCT TTA TCT AAT CTT TGG AGC ATG ...

amino acid  P   L   S   N   L   W   S   M  ...

The expected output is:

>Seq1_1 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

PLSNLWSMSWHSWNRPQPPHPCRTQSCTNTSSDSSAIQKSISSFY

>Seq2_1 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

GRYRPKXPNPSRTXPTRSPSGRRQST*X

Specific rules

The program takes 4 parameters:

clean : if true, write STOP codon as X instead of *

trim : if true, remove all X and * chars from the right end of the amino - acid sequence

alternative: different way to compute reverse frame

frame: a string value in ["1", "2", "3", "F", "-1", "-2", "-3", "R", "6" ]

The frame define the position in the nucleic sequence to start from:

-frame 1

|-frame 2

||-frame 3

|||

CCTTTATCTAATCTTTGGAGCATGAGCTGGCATAGTTGGAACCGCCCTCAGCCTCCTCATCCGTGCAGAA

CCCAGTCCTGTACCAACACCTCTTCTGATTCTTCGGCCATCCAGAAGTCTATATCCTCATTTTAC

                                                              |||

                                                              ||-frame -1

                                                              |-frame -2

                                                              -frame -3





frame "F" = "1", "2", "3"

frame "R" = "-1", "-2", "-3"

frame "6" = "1", "2", "3", "-1", "-2", "-3"

In the output file, we add the frame used to the sequenceId: sequenceId = sequenceId_frame

For example if the program is used with frame=6, the expected output is

>Seq1_1 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

PLSNLWSMSWHSWNRPQPPHPCRTQSCTNTSSDSSAIQKSISSFY

>Seq1_2 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

LYLIFGA*AGIVGTALSLLIRAEPSPVPTPLLILRPSRSLYPHFT

>Seq1_3 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

FI*SLEHELA*LEPPSASSSVQNPVLYQHLF*FFGHPEVYILILX

>Seq1_4 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

VK*GYRLLDGRRIRRGVGTGLGSARMRRLRAVPTMPAHAPKIR*R

>Seq1_5 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

KMRI*TSGWPKNQKRCWYRTGFCTDEEAEGGSNYASSCSKD*IKX

>Seq1_6 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

*NEDIDFWMAEESEEVLVQDWVLHG*GG*GRFQLCQLMLQRLDKG

>Seq2_1 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

GRYRPKXPNPSRTXPTRSPSGRRQST*X

>Seq2_2 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

VGTALXLLIRAELXQPGALLGDDNQHKX

>Seq2_3 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

*VPP*XS*SEQNXANPEPFWETTINIK

>Seq2_4 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

LC*LSSPRRAPGWXSSARIRXLRAVPT

>Seq2_5 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

FMLIVVSQKGSGLA*FCSD*EX*GGTYX

>Seq2_6 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

FYVDCRLPEGLRVGXVLLGLGXLGRYLP

Improvements

Is the code easy to read / understand ?

Is the code idiomatic ?

Any other improvements (performance, doc...)

( full project with tests + flags handling is avaible on github: gotranseq)

code :

package transeq



import (

    "bufio"

    "bytes"

    "context"

    "encoding/binary"

    "fmt"

    "io"

    "runtime"

    "sync"

)



var (

    letterCode = map[byte]uint8{

        'A': aCode,

        'C': cCode,

        'T': tCode,

        'G': gCode,

        'N': nCode,

        'U': uCode,

    }

    standard = map[string]byte{

        "TTT": 'F',

        "TCT": 'S',

        "TAT": 'Y',

        "TGT": 'C',

        "TTC": 'F',

        "TCC": 'S',

        "TAC": 'Y',

        "TGC": 'C',

        "TTA": 'L',

        "TCA": 'S',

        "TAA": '*',

        "TGA": '*',

        "TTG": 'L',

        "TCG": 'S',

        "TAG": '*',

        "TGG": 'W',

        "CTT": 'L',

        "CCT": 'P',

        "CAT": 'H',

        "CGT": 'R',

        "CTC": 'L',

        "CCC": 'P',

        "CAC": 'H',

        "CGC": 'R',

        "CTA": 'L',

        "CCA": 'P',

        "CAA": 'Q',

        "CGA": 'R',

        "CTG": 'L',

        "CCG": 'P',

        "CAG": 'Q',

        "CGG": 'R',

        "ATT": 'I',

        "ACT": 'T',

        "AAT": 'N',

        "AGT": 'S',

        "ATC": 'I',

        "ACC": 'T',

        "AAC": 'N',

        "AGC": 'S',

        "ATA": 'I',

        "ACA": 'T',

        "AAA": 'K',

        "AGA": 'R',

        "ATG": 'M',

        "ACG": 'T',

        "AAG": 'K',

        "AGG": 'R',

        "GTT": 'V',

        "GCT": 'A',

        "GAT": 'D',

        "GGT": 'G',

        "GTC": 'V',

        "GCC": 'A',

        "GAC": 'D',

        "GGC": 'G',

        "GTA": 'V',

        "GCA": 'A',

        "GAA": 'E',

        "GGA": 'G',

        "GTG": 'V',

        "GCG": 'A',

        "GAG": 'E',

        "GGG": 'G',

    }

)



const (

    nCode = uint8(0)

    aCode = uint8(1)

    cCode = uint8(2)

    tCode = uint8(3)

    uCode = uint8(3)

    gCode = uint8(4)



    stopByte = '*'

    unknown  = 'X'

    // Length of the array to store code/bytes

    // uses gCode because it's the biggest uint8 of all codes

    arrayCodeSize = (uint32(gCode) | uint32(gCode)<<8 | uint32(gCode)<<16) + 1

)



func createCodeArray(clean bool) byte {



    resultMap := map[uint32]byte{}

    twoLetterMap := map[string]byte{}



    tmpCode := make(uint8, 4)



    for codon, aaCode := range standard {

        // generate 3 letter code

        for i := 0; i < 3; i++ {

            tmpCode[i] = letterCode[codon[i]]

        }

        // each codon is represented by an unique uint32:

        // each possible nucleotide is represented by an uint8 (255 possibility)

        // the three first bytes are the the code for each nucleotide

        // last byte is unused ( eq to uint8(0) )

        // example:

        // codon 'ACG' ==> uint32(aCode) | uint32(cCode)<<8 | uint32(gCode)<<16

        uint32Code := uint32(tmpCode[0]) | uint32(tmpCode[1])<<8 | uint32(tmpCode[2])<<16

        resultMap[uint32Code] = aaCode



        // generate 2 letter code

        codes, ok := twoLetterMap[codon[0:2]]

        if !ok {

            twoLetterMap[codon[0:2]] = byte{aaCode}

        } else {

            twoLetterMap[codon[0:2]] = append(codes, aaCode)

        }

    }

    for twoLetterCodon, codes := range twoLetterMap {

        uniqueAA := true

        for i := 0; i < len(codes); i++ {



            if codes[i] != codes[0] {

                uniqueAA = false

            }

        }

        if uniqueAA {

            first := letterCode[twoLetterCodon[0]]

            second := letterCode[twoLetterCodon[1]]



            uint32Code := uint32(first) | uint32(second)<<8

            resultMap[uint32Code] = codes[0]

        }

    }

    // if clean is specified, we want to replace all '*' by 'X' in the output

    // sequence, so replace all occurrences of '*' directly in the ref map

    if clean {

        for k, v := range resultMap {

            if v == stopByte {

                resultMap[k] = unknown

            }

        }

    }

    r := make(byte, arrayCodeSize)

    for k, v := range resultMap {

        r[k] = v

    }

    return r

}



func computeFrames(frameName string) (frames int, reverse bool, err error) {



    frames = make(int, 6)

    reverse = false



    switch frameName {

    case "1":

        frames[0] = 1

    case "2":

        frames[1] = 1

    case "3":

        frames[2] = 1

    case "F":

        for i := 0; i < 3; i++ {

            frames[i] = 1

        }

    case "-1":

        frames[3] = 1

        reverse = true

    case "-2":

        frames[4] = 1

        reverse = true

    case "-3":

        frames[5] = 1

        reverse = true

    case "R":

        for i := 3; i < 6; i++ {

            frames[i] = 1

        }

        reverse = true

    case "6":

        for i := range frames {

            frames[i] = 1

        }

        reverse = true

    default:

        err = fmt.Errorf("wrong value for -f | --frame parameter: %s", frameName)

    }

    return frames, reverse, err

}



type writer struct {

    buf            *bytes.Buffer

    currentLineLen int

    bytesToTrim    int

}



func (w *writer) addByte(b byte) {

    w.buf.WriteByte(b)

    w.currentLineLen++

    if b == stopByte || b == unknown {

        w.bytesToTrim++

    } else {

        w.bytesToTrim = 0

    }

}



func (w *writer) addUnknown() {

    w.buf.WriteByte(unknown)

    w.currentLineLen++

    w.bytesToTrim++

}



func (w *writer) newLine() {

    w.buf.WriteByte('n')

    w.currentLineLen = 0

    w.bytesToTrim++

}



const (

    // size of the buffer for writing to file

    maxBufferSize = 1024 * 1024 * 30

    // max line size for sequence

    maxLineSize = 60

    // suffixes ta add to sequence id for each frame

    suffixes = "123456"

)



// Translate read a fata file, translate each sequence to the corresponding prot sequence in the specified frame

func Translate(inputSequence io.Reader, out io.Writer, frame string, clean, trim, alternative bool) error {



    arrayCode := createCodeArray(clean)

    framesToGenerate, reverse, err := computeFrames(frame)

    if err != nil {

        return err

    }



    fnaSequences := make(chan encodedSequence, 10)

    errs := make(chan error, 1)



    var wg sync.WaitGroup

    ctx, cancel := context.WithCancel(context.Background())

    defer cancel()



    for nWorker := 0; nWorker < runtime.NumCPU(); nWorker++ {



        wg.Add(1)



        go func() {



            defer wg.Done()



            startPosition := make(int, 3)

            w := &writer{

                buf:            bytes.NewBuffer(nil),

                bytesToTrim:    0,

                currentLineLen: 0,

            }



            for sequence := range fnaSequences {



                select {

                case <-ctx.Done():

                    return

                default:

                }



                frameIndex := 0

                startPosition[0], startPosition[1], startPosition[2] = 0, 1, 2



                idSize := int(binary.LittleEndian.Uint32(sequence[0:4]))

                nuclSeqLength := len(sequence) - idSize



            Translate:

                for _, startPos := range startPosition {



                    if framesToGenerate[frameIndex] == 0 {

                        frameIndex++

                        continue

                    }



                    // sequence id should look like

                    // >sequenceID_<frame> comment

                    idEnd := bytes.IndexByte(sequence[4:idSize], ' ')

                    if idEnd != -1 {

                        w.buf.Write(sequence[4 : 4+idEnd])

                        w.buf.WriteByte('_')

                        w.buf.WriteByte(suffixes[frameIndex])

                        w.buf.Write(sequence[4+idEnd : idSize])

                    } else {

                        w.buf.Write(sequence[4:idSize])

                        w.buf.WriteByte('_')

                        w.buf.WriteByte(suffixes[frameIndex])

                    }

                    w.newLine()



                    // if in trim mode, nb of bytes to trim (nb of successive 'X', '*' and 'n'

                    // from right end of the sequence)

                    w.bytesToTrim = 0

                    w.currentLineLen = 0



                    // read the sequence 3 letters at a time, starting at a specific position

                    // corresponding to the frame

                    for pos := startPos + 2 + idSize; pos < len(sequence); pos += 3 {



                        if w.currentLineLen == maxLineSize {

                            w.newLine()

                        }

                        // create an uint32 from the codon, to retrieve the corresponding

                        // AA from the map

                        codonCode := uint32(sequence[pos-2]) | uint32(sequence[pos-1])<<8 | uint32(sequence[pos])<<16



                        b := arrayCode[codonCode]

                        if b != byte(0) {

                            w.addByte(b)

                        } else {

                            w.addUnknown()

                        }

                    }



                    // the last codon is only 2 nucleotid long, try to guess

                    // the corresponding AA

                    if (nuclSeqLength-startPos)%3 == 2 {



                        if w.currentLineLen == maxLineSize {

                            w.newLine()

                        }

                        codonCode := uint32(sequence[len(sequence)-2]) | uint32(sequence[len(sequence)-1])<<8



                        b := arrayCode[codonCode]

                        if b != byte(0) {

                            w.addByte(b)

                        } else {

                            w.addUnknown()

                        }

                    }



                    // the last codon is only 1 nucleotid long, no way to guess

                    // the corresponding AA

                    if (nuclSeqLength-startPos)%3 == 1 {

                        if w.currentLineLen == maxLineSize {

                            w.newLine()

                        }

                        w.addUnknown()

                    }



                    if trim && w.bytesToTrim > 0 {

                        // remove the last bytesToTrim bytes of the buffer

                        // as they are 'X', '*' or 'n'

                        w.buf.Truncate(w.buf.Len() - w.bytesToTrim)

                        w.currentLineLen -= w.bytesToTrim

                    }



                    if w.currentLineLen != 0 {

                        w.newLine()

                    }

                    frameIndex++

                }



                if reverse && frameIndex < 6 {



                    // get the complementary sequence.

                    // Basically, switch

                    //   A <-> T

                    //   C <-> G

                    // N is not modified

                    for i, n := range sequence[idSize:] {



                        switch n {

                        case aCode:

                            sequence[i+idSize] = tCode

                        case tCode:

                            // handle both tCode and uCode

                            sequence[i+idSize] = aCode

                        case cCode:

                            sequence[i+idSize] = gCode

                        case gCode:

                            sequence[i+idSize] = cCode

                        default:

                            //case N -> leave it

                        }

                    }

                    // reverse the sequence

                    for i, j := idSize, len(sequence)-1; i < j; i, j = i+1, j-1 {

                        sequence[i], sequence[j] = sequence[j], sequence[i]

                    }



                    if !alternative {

                        // Staden convention: Frame -1 is the reverse-complement of the sequence

                        // having the same codon phase as frame 1. Frame -2 is the same phase as

                        // frame 2. Frame -3 is the same phase as frame 3

                        //

                        // use the matrix to keep track of the forward frame as it depends on the

                        // length of the sequence

                        switch nuclSeqLength % 3 {

                        case 0:

                            startPosition[0], startPosition[1], startPosition[2] = 0, 2, 1

                        case 1:

                            startPosition[0], startPosition[1], startPosition[2] = 1, 0, 2

                        case 2:

                            startPosition[0], startPosition[1], startPosition[2] = 2, 1, 0

                        }

                    }

                    // run the same loop, but with the reverse-complemented sequence

                    goto Translate

                }



                if w.buf.Len() > maxBufferSize {

                    _, err := out.Write(w.buf.Bytes())

                    if err != nil {

                        select {

                        case errs <- fmt.Errorf("fail to write to output file: %v", err):

                        default:

                        }

                        cancel()

                        return

                    }

                    w.buf.Reset()

                }

                pool.Put(sequence)

            }



            if w.buf.Len() > 0 {

                _, err := out.Write(w.buf.Bytes())

                if err != nil {

                    select {

                    case errs <- fmt.Errorf("fail to write to output file: %v", err):

                    default:

                    }

                    cancel()

                    return

                }

            }

        }()

    }

    readSequenceFromFasta(ctx, inputSequence, fnaSequences)



    wg.Wait()

    select {

    case err, ok := <-errs:

        if ok {

            return err

        }

    default:

    }

    return nil

}



func readSequenceFromFasta(ctx context.Context, inputSequence io.Reader, fnaSequences chan encodedSequence) {



    feeder := &fastaChannelFeeder{

        idBuffer:       bytes.NewBuffer(nil),

        commentBuffer:  bytes.NewBuffer(nil),

        sequenceBuffer: bytes.NewBuffer(nil),

        fastaChan:      fnaSequences,

    }

    // fasta format is:

    //

    // >sequenceID some comments on sequence

    // ACAGGCAGAGACACGACAGACGACGACACAGGAGCAGACAGCAGCAGACGACCACATATT

    // TTTGCGGTCACATGACGACTTCGGCAGCGA

    //

    // see https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp

    // section 1 for details

    scanner := bufio.NewScanner(inputSequence)

Loop:

    for scanner.Scan() {



        line := scanner.Bytes()

        if len(line) == 0 {

            continue

        }

        if line[0] == '>' {



            if feeder.idBuffer.Len() > 0 {

                select {

                case <-ctx.Done():

                    break Loop

                default:

                }

                feeder.sendFasta()

            }

            feeder.reset()



            // parse the ID of the sequence. ID is formatted like this:

            // >sequenceID comments

            seqID := bytes.SplitN(line, byte{' '}, 2)

            feeder.idBuffer.Write(seqID[0])



            if len(seqID) > 1 {

                feeder.commentBuffer.WriteByte(' ')

                feeder.commentBuffer.Write(seqID[1])

            }

        } else {

            // if the line doesn't start with '>', then it's a part of the

            // nucleotide sequence, so write it to the buffer

            feeder.sequenceBuffer.Write(line)

        }

    }

    // don't forget to push last sequence

    select {

    case <-ctx.Done():

    default:

        feeder.sendFasta()

    }

    close(fnaSequences)

}



// a type to hold an encoded fasta sequence

//

//  s[0:4] stores the size of the sequence id + the size of the comment as an uint32 (little endian)

//  s[4:idSize] stores the sequence id, and the comment id there is one

//  s[idSize:] stores the nucl sequence

type encodedSequence byte



var pool = sync.Pool{

    New: func() interface{} {

        return make(encodedSequence, 512)

    },

}



func getSizedSlice(idSize, requiredSize int) encodedSequence {

    s := pool.Get().(encodedSequence)

    binary.LittleEndian.PutUint32(s[0:4], uint32(idSize))



    for len(s) < requiredSize {

        s = append(s, byte(0))

    }

    return s[0:requiredSize]

}



func (f *fastaChannelFeeder) sendFasta() {



    idSize := 4 + f.idBuffer.Len() + f.commentBuffer.Len()

    requiredSize := idSize + f.sequenceBuffer.Len()



    s := getSizedSlice(idSize, requiredSize)



    if f.commentBuffer.Len() > 0 {

        copy(s[idSize-f.commentBuffer.Len():idSize], f.commentBuffer.Bytes())

    }



    copy(s[4:4+f.idBuffer.Len()], f.idBuffer.Bytes())



    // convert the sequence of bytes to an array of uint8 codes,

    // so a codon (3 nucleotides | 3 bytes ) can be represented

    // as an uint32

    for i, b := range f.sequenceBuffer.Bytes() {



        switch b {

        case 'A':

            s[i+idSize] = aCode

        case 'C':

            s[i+idSize] = cCode

        case 'G':

            s[i+idSize] = gCode

        case 'T', 'U':

            s[i+idSize] = tCode

        case 'N':

            s[i+idSize] = nCode

        default:

            fmt.Printf("WARNING: invalid char in sequence %s: %s, ignoring", s[4:4+idSize], string(b))

        }

    }

    f.fastaChan <- s

}



type fastaChannelFeeder struct {

    idBuffer, commentBuffer, sequenceBuffer *bytes.Buffer

    fastaChan                               chan encodedSequence

}



func (f *fastaChannelFeeder) reset() {

    f.idBuffer.Reset()

    f.sequenceBuffer.Reset()

    f.commentBuffer.Reset()

}

asked 9 mins ago

felix

69839

add a comment |

up vote
0
down vote

favorite

Goal of the program

Fasta format

A fasta file looks like this:

>sequenceId comment

nucleic sequence

for example:

>Seq1 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

CCTTTATCTAATCTTTGGAGCATGAGCTGGCATAGTTGGAACCGCCCTCAGCCTCCTCATCCGTGCAGAA

CCCAGTCCTGTACCAACACCTCTTCTGATTCTTCGGCCATCCAGAAGTCTATATCCTCATTTTAC



>Seq2 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

GGTAGGTACCGCCCTAAGNCTCCTAATCCGAGCAGAACTANGCCAACCCGGAGCCCTTCTGGGAGACGAC

AATCAACATAAAA

A nucleic sequence is a string composed of the letters A, C, T, G, U, N

Expected output

A combinaison of 3 nucleic acid, named codon, gives a specif amino acid, for exemple GCT is the code for Alanine, symbolised by the letter A

With the Seq1 define above:

codon      CCT TTA TCT AAT CTT TGG AGC ATG ...

amino acid  P   L   S   N   L   W   S   M  ...

The expected output is:

>Seq1_1 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

PLSNLWSMSWHSWNRPQPPHPCRTQSCTNTSSDSSAIQKSISSFY

>Seq2_1 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

GRYRPKXPNPSRTXPTRSPSGRRQST*X

Specific rules

The program takes 4 parameters:

clean : if true, write STOP codon as X instead of *

trim : if true, remove all X and * chars from the right end of the amino - acid sequence

alternative: different way to compute reverse frame

frame: a string value in ["1", "2", "3", "F", "-1", "-2", "-3", "R", "6" ]

The frame define the position in the nucleic sequence to start from:

-frame 1

|-frame 2

||-frame 3

|||

CCTTTATCTAATCTTTGGAGCATGAGCTGGCATAGTTGGAACCGCCCTCAGCCTCCTCATCCGTGCAGAA

CCCAGTCCTGTACCAACACCTCTTCTGATTCTTCGGCCATCCAGAAGTCTATATCCTCATTTTAC

                                                              |||

                                                              ||-frame -1

                                                              |-frame -2

                                                              -frame -3





frame "F" = "1", "2", "3"

frame "R" = "-1", "-2", "-3"

frame "6" = "1", "2", "3", "-1", "-2", "-3"

In the output file, we add the frame used to the sequenceId: sequenceId = sequenceId_frame

For example if the program is used with frame=6, the expected output is

>Seq1_1 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

PLSNLWSMSWHSWNRPQPPHPCRTQSCTNTSSDSSAIQKSISSFY

>Seq1_2 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

LYLIFGA*AGIVGTALSLLIRAEPSPVPTPLLILRPSRSLYPHFT

>Seq1_3 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

FI*SLEHELA*LEPPSASSSVQNPVLYQHLF*FFGHPEVYILILX

>Seq1_4 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

VK*GYRLLDGRRIRRGVGTGLGSARMRRLRAVPTMPAHAPKIR*R

>Seq1_5 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

KMRI*TSGWPKNQKRCWYRTGFCTDEEAEGGSNYASSCSKD*IKX

>Seq1_6 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

*NEDIDFWMAEESEEVLVQDWVLHG*GG*GRFQLCQLMLQRLDKG

>Seq2_1 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

GRYRPKXPNPSRTXPTRSPSGRRQST*X

>Seq2_2 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

VGTALXLLIRAELXQPGALLGDDNQHKX

>Seq2_3 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

*VPP*XS*SEQNXANPEPFWETTINIK

>Seq2_4 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

LC*LSSPRRAPGWXSSARIRXLRAVPT

>Seq2_5 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

FMLIVVSQKGSGLA*FCSD*EX*GGTYX

>Seq2_6 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

FYVDCRLPEGLRVGXVLLGLGXLGRYLP

Improvements

Is the code easy to read / understand ?

Is the code idiomatic ?

Any other improvements (performance, doc...)

( full project with tests + flags handling is avaible on github: gotranseq)

code :

package transeq



import (

    "bufio"

    "bytes"

    "context"

    "encoding/binary"

    "fmt"

    "io"

    "runtime"

    "sync"

)



var (

    letterCode = map[byte]uint8{

        'A': aCode,

        'C': cCode,

        'T': tCode,

        'G': gCode,

        'N': nCode,

        'U': uCode,

    }

    standard = map[string]byte{

        "TTT": 'F',

        "TCT": 'S',

        "TAT": 'Y',

        "TGT": 'C',

        "TTC": 'F',

        "TCC": 'S',

        "TAC": 'Y',

        "TGC": 'C',

        "TTA": 'L',

        "TCA": 'S',

        "TAA": '*',

        "TGA": '*',

        "TTG": 'L',

        "TCG": 'S',

        "TAG": '*',

        "TGG": 'W',

        "CTT": 'L',

        "CCT": 'P',

        "CAT": 'H',

        "CGT": 'R',

        "CTC": 'L',

        "CCC": 'P',

        "CAC": 'H',

        "CGC": 'R',

        "CTA": 'L',

        "CCA": 'P',

        "CAA": 'Q',

        "CGA": 'R',

        "CTG": 'L',

        "CCG": 'P',

        "CAG": 'Q',

        "CGG": 'R',

        "ATT": 'I',

        "ACT": 'T',

        "AAT": 'N',

        "AGT": 'S',

        "ATC": 'I',

        "ACC": 'T',

        "AAC": 'N',

        "AGC": 'S',

        "ATA": 'I',

        "ACA": 'T',

        "AAA": 'K',

        "AGA": 'R',

        "ATG": 'M',

        "ACG": 'T',

        "AAG": 'K',

        "AGG": 'R',

        "GTT": 'V',

        "GCT": 'A',

        "GAT": 'D',

        "GGT": 'G',

        "GTC": 'V',

        "GCC": 'A',

        "GAC": 'D',

        "GGC": 'G',

        "GTA": 'V',

        "GCA": 'A',

        "GAA": 'E',

        "GGA": 'G',

        "GTG": 'V',

        "GCG": 'A',

        "GAG": 'E',

        "GGG": 'G',

    }

)



const (

    nCode = uint8(0)

    aCode = uint8(1)

    cCode = uint8(2)

    tCode = uint8(3)

    uCode = uint8(3)

    gCode = uint8(4)



    stopByte = '*'

    unknown  = 'X'

    // Length of the array to store code/bytes

    // uses gCode because it's the biggest uint8 of all codes

    arrayCodeSize = (uint32(gCode) | uint32(gCode)<<8 | uint32(gCode)<<16) + 1

)



func createCodeArray(clean bool) byte {



    resultMap := map[uint32]byte{}

    twoLetterMap := map[string]byte{}



    tmpCode := make(uint8, 4)



    for codon, aaCode := range standard {

        // generate 3 letter code

        for i := 0; i < 3; i++ {

            tmpCode[i] = letterCode[codon[i]]

        }

        // each codon is represented by an unique uint32:

        // each possible nucleotide is represented by an uint8 (255 possibility)

        // the three first bytes are the the code for each nucleotide

        // last byte is unused ( eq to uint8(0) )

        // example:

        // codon 'ACG' ==> uint32(aCode) | uint32(cCode)<<8 | uint32(gCode)<<16

        uint32Code := uint32(tmpCode[0]) | uint32(tmpCode[1])<<8 | uint32(tmpCode[2])<<16

        resultMap[uint32Code] = aaCode



        // generate 2 letter code

        codes, ok := twoLetterMap[codon[0:2]]

        if !ok {

            twoLetterMap[codon[0:2]] = byte{aaCode}

        } else {

            twoLetterMap[codon[0:2]] = append(codes, aaCode)

        }

    }

    for twoLetterCodon, codes := range twoLetterMap {

        uniqueAA := true

        for i := 0; i < len(codes); i++ {



            if codes[i] != codes[0] {

                uniqueAA = false

            }

        }

        if uniqueAA {

            first := letterCode[twoLetterCodon[0]]

            second := letterCode[twoLetterCodon[1]]



            uint32Code := uint32(first) | uint32(second)<<8

            resultMap[uint32Code] = codes[0]

        }

    }

    // if clean is specified, we want to replace all '*' by 'X' in the output

    // sequence, so replace all occurrences of '*' directly in the ref map

    if clean {

        for k, v := range resultMap {

            if v == stopByte {

                resultMap[k] = unknown

            }

        }

    }

    r := make(byte, arrayCodeSize)

    for k, v := range resultMap {

        r[k] = v

    }

    return r

}



func computeFrames(frameName string) (frames int, reverse bool, err error) {



    frames = make(int, 6)

    reverse = false



    switch frameName {

    case "1":

        frames[0] = 1

    case "2":

        frames[1] = 1

    case "3":

        frames[2] = 1

    case "F":

        for i := 0; i < 3; i++ {

            frames[i] = 1

        }

    case "-1":

        frames[3] = 1

        reverse = true

    case "-2":

        frames[4] = 1

        reverse = true

    case "-3":

        frames[5] = 1

        reverse = true

    case "R":

        for i := 3; i < 6; i++ {

            frames[i] = 1

        }

        reverse = true

    case "6":

        for i := range frames {

            frames[i] = 1

        }

        reverse = true

    default:

        err = fmt.Errorf("wrong value for -f | --frame parameter: %s", frameName)

    }

    return frames, reverse, err

}



type writer struct {

    buf            *bytes.Buffer

    currentLineLen int

    bytesToTrim    int

}



func (w *writer) addByte(b byte) {

    w.buf.WriteByte(b)

    w.currentLineLen++

    if b == stopByte || b == unknown {

        w.bytesToTrim++

    } else {

        w.bytesToTrim = 0

    }

}



func (w *writer) addUnknown() {

    w.buf.WriteByte(unknown)

    w.currentLineLen++

    w.bytesToTrim++

}



func (w *writer) newLine() {

    w.buf.WriteByte('n')

    w.currentLineLen = 0

    w.bytesToTrim++

}



const (

    // size of the buffer for writing to file

    maxBufferSize = 1024 * 1024 * 30

    // max line size for sequence

    maxLineSize = 60

    // suffixes ta add to sequence id for each frame

    suffixes = "123456"

)



// Translate read a fata file, translate each sequence to the corresponding prot sequence in the specified frame

func Translate(inputSequence io.Reader, out io.Writer, frame string, clean, trim, alternative bool) error {



    arrayCode := createCodeArray(clean)

    framesToGenerate, reverse, err := computeFrames(frame)

    if err != nil {

        return err

    }



    fnaSequences := make(chan encodedSequence, 10)

    errs := make(chan error, 1)



    var wg sync.WaitGroup

    ctx, cancel := context.WithCancel(context.Background())

    defer cancel()



    for nWorker := 0; nWorker < runtime.NumCPU(); nWorker++ {



        wg.Add(1)



        go func() {



            defer wg.Done()



            startPosition := make(int, 3)

            w := &writer{

                buf:            bytes.NewBuffer(nil),

                bytesToTrim:    0,

                currentLineLen: 0,

            }



            for sequence := range fnaSequences {



                select {

                case <-ctx.Done():

                    return

                default:

                }



                frameIndex := 0

                startPosition[0], startPosition[1], startPosition[2] = 0, 1, 2



                idSize := int(binary.LittleEndian.Uint32(sequence[0:4]))

                nuclSeqLength := len(sequence) - idSize



            Translate:

                for _, startPos := range startPosition {



                    if framesToGenerate[frameIndex] == 0 {

                        frameIndex++

                        continue

                    }



                    // sequence id should look like

                    // >sequenceID_<frame> comment

                    idEnd := bytes.IndexByte(sequence[4:idSize], ' ')

                    if idEnd != -1 {

                        w.buf.Write(sequence[4 : 4+idEnd])

                        w.buf.WriteByte('_')

                        w.buf.WriteByte(suffixes[frameIndex])

                        w.buf.Write(sequence[4+idEnd : idSize])

                    } else {

                        w.buf.Write(sequence[4:idSize])

                        w.buf.WriteByte('_')

                        w.buf.WriteByte(suffixes[frameIndex])

                    }

                    w.newLine()



                    // if in trim mode, nb of bytes to trim (nb of successive 'X', '*' and 'n'

                    // from right end of the sequence)

                    w.bytesToTrim = 0

                    w.currentLineLen = 0



                    // read the sequence 3 letters at a time, starting at a specific position

                    // corresponding to the frame

                    for pos := startPos + 2 + idSize; pos < len(sequence); pos += 3 {



                        if w.currentLineLen == maxLineSize {

                            w.newLine()

                        }

                        // create an uint32 from the codon, to retrieve the corresponding

                        // AA from the map

                        codonCode := uint32(sequence[pos-2]) | uint32(sequence[pos-1])<<8 | uint32(sequence[pos])<<16



                        b := arrayCode[codonCode]

                        if b != byte(0) {

                            w.addByte(b)

                        } else {

                            w.addUnknown()

                        }

                    }



                    // the last codon is only 2 nucleotid long, try to guess

                    // the corresponding AA

                    if (nuclSeqLength-startPos)%3 == 2 {



                        if w.currentLineLen == maxLineSize {

                            w.newLine()

                        }

                        codonCode := uint32(sequence[len(sequence)-2]) | uint32(sequence[len(sequence)-1])<<8



                        b := arrayCode[codonCode]

                        if b != byte(0) {

                            w.addByte(b)

                        } else {

                            w.addUnknown()

                        }

                    }



                    // the last codon is only 1 nucleotid long, no way to guess

                    // the corresponding AA

                    if (nuclSeqLength-startPos)%3 == 1 {

                        if w.currentLineLen == maxLineSize {

                            w.newLine()

                        }

                        w.addUnknown()

                    }



                    if trim && w.bytesToTrim > 0 {

                        // remove the last bytesToTrim bytes of the buffer

                        // as they are 'X', '*' or 'n'

                        w.buf.Truncate(w.buf.Len() - w.bytesToTrim)

                        w.currentLineLen -= w.bytesToTrim

                    }



                    if w.currentLineLen != 0 {

                        w.newLine()

                    }

                    frameIndex++

                }



                if reverse && frameIndex < 6 {



                    // get the complementary sequence.

                    // Basically, switch

                    //   A <-> T

                    //   C <-> G

                    // N is not modified

                    for i, n := range sequence[idSize:] {



                        switch n {

                        case aCode:

                            sequence[i+idSize] = tCode

                        case tCode:

                            // handle both tCode and uCode

                            sequence[i+idSize] = aCode

                        case cCode:

                            sequence[i+idSize] = gCode

                        case gCode:

                            sequence[i+idSize] = cCode

                        default:

                            //case N -> leave it

                        }

                    }

                    // reverse the sequence

                    for i, j := idSize, len(sequence)-1; i < j; i, j = i+1, j-1 {

                        sequence[i], sequence[j] = sequence[j], sequence[i]

                    }



                    if !alternative {

                        // Staden convention: Frame -1 is the reverse-complement of the sequence

                        // having the same codon phase as frame 1. Frame -2 is the same phase as

                        // frame 2. Frame -3 is the same phase as frame 3

                        //

                        // use the matrix to keep track of the forward frame as it depends on the

                        // length of the sequence

                        switch nuclSeqLength % 3 {

                        case 0:

                            startPosition[0], startPosition[1], startPosition[2] = 0, 2, 1

                        case 1:

                            startPosition[0], startPosition[1], startPosition[2] = 1, 0, 2

                        case 2:

                            startPosition[0], startPosition[1], startPosition[2] = 2, 1, 0

                        }

                    }

                    // run the same loop, but with the reverse-complemented sequence

                    goto Translate

                }



                if w.buf.Len() > maxBufferSize {

                    _, err := out.Write(w.buf.Bytes())

                    if err != nil {

                        select {

                        case errs <- fmt.Errorf("fail to write to output file: %v", err):

                        default:

                        }

                        cancel()

                        return

                    }

                    w.buf.Reset()

                }

                pool.Put(sequence)

            }



            if w.buf.Len() > 0 {

                _, err := out.Write(w.buf.Bytes())

                if err != nil {

                    select {

                    case errs <- fmt.Errorf("fail to write to output file: %v", err):

                    default:

                    }

                    cancel()

                    return

                }

            }

        }()

    }

    readSequenceFromFasta(ctx, inputSequence, fnaSequences)



    wg.Wait()

    select {

    case err, ok := <-errs:

        if ok {

            return err

        }

    default:

    }

    return nil

}



func readSequenceFromFasta(ctx context.Context, inputSequence io.Reader, fnaSequences chan encodedSequence) {



    feeder := &fastaChannelFeeder{

        idBuffer:       bytes.NewBuffer(nil),

        commentBuffer:  bytes.NewBuffer(nil),

        sequenceBuffer: bytes.NewBuffer(nil),

        fastaChan:      fnaSequences,

    }

    // fasta format is:

    //

    // >sequenceID some comments on sequence

    // ACAGGCAGAGACACGACAGACGACGACACAGGAGCAGACAGCAGCAGACGACCACATATT

    // TTTGCGGTCACATGACGACTTCGGCAGCGA

    //

    // see https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp

    // section 1 for details

    scanner := bufio.NewScanner(inputSequence)

Loop:

    for scanner.Scan() {



        line := scanner.Bytes()

        if len(line) == 0 {

            continue

        }

        if line[0] == '>' {



            if feeder.idBuffer.Len() > 0 {

                select {

                case <-ctx.Done():

                    break Loop

                default:

                }

                feeder.sendFasta()

            }

            feeder.reset()



            // parse the ID of the sequence. ID is formatted like this:

            // >sequenceID comments

            seqID := bytes.SplitN(line, byte{' '}, 2)

            feeder.idBuffer.Write(seqID[0])



            if len(seqID) > 1 {

                feeder.commentBuffer.WriteByte(' ')

                feeder.commentBuffer.Write(seqID[1])

            }

        } else {

            // if the line doesn't start with '>', then it's a part of the

            // nucleotide sequence, so write it to the buffer

            feeder.sequenceBuffer.Write(line)

        }

    }

    // don't forget to push last sequence

    select {

    case <-ctx.Done():

    default:

        feeder.sendFasta()

    }

    close(fnaSequences)

}



// a type to hold an encoded fasta sequence

//

//  s[0:4] stores the size of the sequence id + the size of the comment as an uint32 (little endian)

//  s[4:idSize] stores the sequence id, and the comment id there is one

//  s[idSize:] stores the nucl sequence

type encodedSequence byte



var pool = sync.Pool{

    New: func() interface{} {

        return make(encodedSequence, 512)

    },

}



func getSizedSlice(idSize, requiredSize int) encodedSequence {

    s := pool.Get().(encodedSequence)

    binary.LittleEndian.PutUint32(s[0:4], uint32(idSize))



    for len(s) < requiredSize {

        s = append(s, byte(0))

    }

    return s[0:requiredSize]

}



func (f *fastaChannelFeeder) sendFasta() {



    idSize := 4 + f.idBuffer.Len() + f.commentBuffer.Len()

    requiredSize := idSize + f.sequenceBuffer.Len()



    s := getSizedSlice(idSize, requiredSize)



    if f.commentBuffer.Len() > 0 {

        copy(s[idSize-f.commentBuffer.Len():idSize], f.commentBuffer.Bytes())

    }



    copy(s[4:4+f.idBuffer.Len()], f.idBuffer.Bytes())



    // convert the sequence of bytes to an array of uint8 codes,

    // so a codon (3 nucleotides | 3 bytes ) can be represented

    // as an uint32

    for i, b := range f.sequenceBuffer.Bytes() {



        switch b {

        case 'A':

            s[i+idSize] = aCode

        case 'C':

            s[i+idSize] = cCode

        case 'G':

            s[i+idSize] = gCode

        case 'T', 'U':

            s[i+idSize] = tCode

        case 'N':

            s[i+idSize] = nCode

        default:

            fmt.Printf("WARNING: invalid char in sequence %s: %s, ignoring", s[4:4+idSize], string(b))

        }

    }

    f.fastaChan <- s

}



type fastaChannelFeeder struct {

    idBuffer, commentBuffer, sequenceBuffer *bytes.Buffer

    fastaChan                               chan encodedSequence

}



func (f *fastaChannelFeeder) reset() {

    f.idBuffer.Reset()

    f.sequenceBuffer.Reset()

    f.commentBuffer.Reset()

}

asked 9 mins ago

felix

69839

Goal of the program

Fasta format

A fasta file looks like this:

>sequenceId comment

nucleic sequence

for example:

>Seq1 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

CCTTTATCTAATCTTTGGAGCATGAGCTGGCATAGTTGGAACCGCCCTCAGCCTCCTCATCCGTGCAGAA

CCCAGTCCTGTACCAACACCTCTTCTGATTCTTCGGCCATCCAGAAGTCTATATCCTCATTTTAC



>Seq2 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

GGTAGGTACCGCCCTAAGNCTCCTAATCCGAGCAGAACTANGCCAACCCGGAGCCCTTCTGGGAGACGAC

AATCAACATAAAA

A nucleic sequence is a string composed of the letters A, C, T, G, U, N

Expected output

A combinaison of 3 nucleic acid, named codon, gives a specif amino acid, for exemple GCT is the code for Alanine, symbolised by the letter A

With the Seq1 define above:

codon      CCT TTA TCT AAT CTT TGG AGC ATG ...

amino acid  P   L   S   N   L   W   S   M  ...

The expected output is:

>Seq1_1 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

PLSNLWSMSWHSWNRPQPPHPCRTQSCTNTSSDSSAIQKSISSFY

>Seq2_1 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

GRYRPKXPNPSRTXPTRSPSGRRQST*X

Specific rules

The program takes 4 parameters:

clean : if true, write STOP codon as X instead of *

trim : if true, remove all X and * chars from the right end of the amino - acid sequence

alternative: different way to compute reverse frame

frame: a string value in ["1", "2", "3", "F", "-1", "-2", "-3", "R", "6" ]

The frame define the position in the nucleic sequence to start from:

-frame 1

|-frame 2

||-frame 3

|||

CCTTTATCTAATCTTTGGAGCATGAGCTGGCATAGTTGGAACCGCCCTCAGCCTCCTCATCCGTGCAGAA

CCCAGTCCTGTACCAACACCTCTTCTGATTCTTCGGCCATCCAGAAGTCTATATCCTCATTTTAC

                                                              |||

                                                              ||-frame -1

                                                              |-frame -2

                                                              -frame -3





frame "F" = "1", "2", "3"

frame "R" = "-1", "-2", "-3"

frame "6" = "1", "2", "3", "-1", "-2", "-3"

In the output file, we add the frame used to the sequenceId: sequenceId = sequenceId_frame

For example if the program is used with frame=6, the expected output is

>Seq1_1 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

PLSNLWSMSWHSWNRPQPPHPCRTQSCTNTSSDSSAIQKSISSFY

>Seq1_2 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

LYLIFGA*AGIVGTALSLLIRAEPSPVPTPLLILRPSRSLYPHFT

>Seq1_3 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

FI*SLEHELA*LEPPSASSSVQNPVLYQHLF*FFGHPEVYILILX

>Seq1_4 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

VK*GYRLLDGRRIRRGVGTGLGSARMRRLRAVPTMPAHAPKIR*R

>Seq1_5 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

KMRI*TSGWPKNQKRCWYRTGFCTDEEAEGGSNYASSCSKD*IKX

>Seq1_6 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

*NEDIDFWMAEESEEVLVQDWVLHG*GG*GRFQLCQLMLQRLDKG

>Seq2_1 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

GRYRPKXPNPSRTXPTRSPSGRRQST*X

>Seq2_2 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

VGTALXLLIRAELXQPGALLGDDNQHKX

>Seq2_3 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

*VPP*XS*SEQNXANPEPFWETTINIK

>Seq2_4 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

LC*LSSPRRAPGWXSSARIRXLRAVPT

>Seq2_5 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

FMLIVVSQKGSGLA*FCSD*EX*GGTYX

>Seq2_6 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

FYVDCRLPEGLRVGXVLLGLGXLGRYLP

Improvements

Is the code easy to read / understand ?

Is the code idiomatic ?

Any other improvements (performance, doc...)

( full project with tests + flags handling is avaible on github: gotranseq)

code :

package transeq



import (

    "bufio"

    "bytes"

    "context"

    "encoding/binary"

    "fmt"

    "io"

    "runtime"

    "sync"

)



var (

    letterCode = map[byte]uint8{

        'A': aCode,

        'C': cCode,

        'T': tCode,

        'G': gCode,

        'N': nCode,

        'U': uCode,

    }

    standard = map[string]byte{

        "TTT": 'F',

        "TCT": 'S',

        "TAT": 'Y',

        "TGT": 'C',

        "TTC": 'F',

        "TCC": 'S',

        "TAC": 'Y',

        "TGC": 'C',

        "TTA": 'L',

        "TCA": 'S',

        "TAA": '*',

        "TGA": '*',

        "TTG": 'L',

        "TCG": 'S',

        "TAG": '*',

        "TGG": 'W',

        "CTT": 'L',

        "CCT": 'P',

        "CAT": 'H',

        "CGT": 'R',

        "CTC": 'L',

        "CCC": 'P',

        "CAC": 'H',

        "CGC": 'R',

        "CTA": 'L',

        "CCA": 'P',

        "CAA": 'Q',

        "CGA": 'R',

        "CTG": 'L',

        "CCG": 'P',

        "CAG": 'Q',

        "CGG": 'R',

        "ATT": 'I',

        "ACT": 'T',

        "AAT": 'N',

        "AGT": 'S',

        "ATC": 'I',

        "ACC": 'T',

        "AAC": 'N',

        "AGC": 'S',

        "ATA": 'I',

        "ACA": 'T',

        "AAA": 'K',

        "AGA": 'R',

        "ATG": 'M',

        "ACG": 'T',

        "AAG": 'K',

        "AGG": 'R',

        "GTT": 'V',

        "GCT": 'A',

        "GAT": 'D',

        "GGT": 'G',

        "GTC": 'V',

        "GCC": 'A',

        "GAC": 'D',

        "GGC": 'G',

        "GTA": 'V',

        "GCA": 'A',

        "GAA": 'E',

        "GGA": 'G',

        "GTG": 'V',

        "GCG": 'A',

        "GAG": 'E',

        "GGG": 'G',

    }

)



const (

    nCode = uint8(0)

    aCode = uint8(1)

    cCode = uint8(2)

    tCode = uint8(3)

    uCode = uint8(3)

    gCode = uint8(4)



    stopByte = '*'

    unknown  = 'X'

    // Length of the array to store code/bytes

    // uses gCode because it's the biggest uint8 of all codes

    arrayCodeSize = (uint32(gCode) | uint32(gCode)<<8 | uint32(gCode)<<16) + 1

)



func createCodeArray(clean bool) byte {



    resultMap := map[uint32]byte{}

    twoLetterMap := map[string]byte{}



    tmpCode := make(uint8, 4)



    for codon, aaCode := range standard {

        // generate 3 letter code

        for i := 0; i < 3; i++ {

            tmpCode[i] = letterCode[codon[i]]

        }

        // each codon is represented by an unique uint32:

        // each possible nucleotide is represented by an uint8 (255 possibility)

        // the three first bytes are the the code for each nucleotide

        // last byte is unused ( eq to uint8(0) )

        // example:

        // codon 'ACG' ==> uint32(aCode) | uint32(cCode)<<8 | uint32(gCode)<<16

        uint32Code := uint32(tmpCode[0]) | uint32(tmpCode[1])<<8 | uint32(tmpCode[2])<<16

        resultMap[uint32Code] = aaCode



        // generate 2 letter code

        codes, ok := twoLetterMap[codon[0:2]]

        if !ok {

            twoLetterMap[codon[0:2]] = byte{aaCode}

        } else {

            twoLetterMap[codon[0:2]] = append(codes, aaCode)

        }

    }

    for twoLetterCodon, codes := range twoLetterMap {

        uniqueAA := true

        for i := 0; i < len(codes); i++ {



            if codes[i] != codes[0] {

                uniqueAA = false

            }

        }

        if uniqueAA {

            first := letterCode[twoLetterCodon[0]]

            second := letterCode[twoLetterCodon[1]]



            uint32Code := uint32(first) | uint32(second)<<8

            resultMap[uint32Code] = codes[0]

        }

    }

    // if clean is specified, we want to replace all '*' by 'X' in the output

    // sequence, so replace all occurrences of '*' directly in the ref map

    if clean {

        for k, v := range resultMap {

            if v == stopByte {

                resultMap[k] = unknown

            }

        }

    }

    r := make(byte, arrayCodeSize)

    for k, v := range resultMap {

        r[k] = v

    }

    return r

}



func computeFrames(frameName string) (frames int, reverse bool, err error) {



    frames = make(int, 6)

    reverse = false



    switch frameName {

    case "1":

        frames[0] = 1

    case "2":

        frames[1] = 1

    case "3":

        frames[2] = 1

    case "F":

        for i := 0; i < 3; i++ {

            frames[i] = 1

        }

    case "-1":

        frames[3] = 1

        reverse = true

    case "-2":

        frames[4] = 1

        reverse = true

    case "-3":

        frames[5] = 1

        reverse = true

    case "R":

        for i := 3; i < 6; i++ {

            frames[i] = 1

        }

        reverse = true

    case "6":

        for i := range frames {

            frames[i] = 1

        }

        reverse = true

    default:

        err = fmt.Errorf("wrong value for -f | --frame parameter: %s", frameName)

    }

    return frames, reverse, err

}



type writer struct {

    buf            *bytes.Buffer

    currentLineLen int

    bytesToTrim    int

}



func (w *writer) addByte(b byte) {

    w.buf.WriteByte(b)

    w.currentLineLen++

    if b == stopByte || b == unknown {

        w.bytesToTrim++

    } else {

        w.bytesToTrim = 0

    }

}



func (w *writer) addUnknown() {

    w.buf.WriteByte(unknown)

    w.currentLineLen++

    w.bytesToTrim++

}



func (w *writer) newLine() {

    w.buf.WriteByte('n')

    w.currentLineLen = 0

    w.bytesToTrim++

}



const (

    // size of the buffer for writing to file

    maxBufferSize = 1024 * 1024 * 30

    // max line size for sequence

    maxLineSize = 60

    // suffixes ta add to sequence id for each frame

    suffixes = "123456"

)



// Translate read a fata file, translate each sequence to the corresponding prot sequence in the specified frame

func Translate(inputSequence io.Reader, out io.Writer, frame string, clean, trim, alternative bool) error {



    arrayCode := createCodeArray(clean)

    framesToGenerate, reverse, err := computeFrames(frame)

    if err != nil {

        return err

    }



    fnaSequences := make(chan encodedSequence, 10)

    errs := make(chan error, 1)



    var wg sync.WaitGroup

    ctx, cancel := context.WithCancel(context.Background())

    defer cancel()



    for nWorker := 0; nWorker < runtime.NumCPU(); nWorker++ {



        wg.Add(1)



        go func() {



            defer wg.Done()



            startPosition := make(int, 3)

            w := &writer{

                buf:            bytes.NewBuffer(nil),

                bytesToTrim:    0,

                currentLineLen: 0,

            }



            for sequence := range fnaSequences {



                select {

                case <-ctx.Done():

                    return

                default:

                }



                frameIndex := 0

                startPosition[0], startPosition[1], startPosition[2] = 0, 1, 2



                idSize := int(binary.LittleEndian.Uint32(sequence[0:4]))

                nuclSeqLength := len(sequence) - idSize



            Translate:

                for _, startPos := range startPosition {



                    if framesToGenerate[frameIndex] == 0 {

                        frameIndex++

                        continue

                    }



                    // sequence id should look like

                    // >sequenceID_<frame> comment

                    idEnd := bytes.IndexByte(sequence[4:idSize], ' ')

                    if idEnd != -1 {

                        w.buf.Write(sequence[4 : 4+idEnd])

                        w.buf.WriteByte('_')

                        w.buf.WriteByte(suffixes[frameIndex])

                        w.buf.Write(sequence[4+idEnd : idSize])

                    } else {

                        w.buf.Write(sequence[4:idSize])

                        w.buf.WriteByte('_')

                        w.buf.WriteByte(suffixes[frameIndex])

                    }

                    w.newLine()



                    // if in trim mode, nb of bytes to trim (nb of successive 'X', '*' and 'n'

                    // from right end of the sequence)

                    w.bytesToTrim = 0

                    w.currentLineLen = 0



                    // read the sequence 3 letters at a time, starting at a specific position

                    // corresponding to the frame

                    for pos := startPos + 2 + idSize; pos < len(sequence); pos += 3 {



                        if w.currentLineLen == maxLineSize {

                            w.newLine()

                        }

                        // create an uint32 from the codon, to retrieve the corresponding

                        // AA from the map

                        codonCode := uint32(sequence[pos-2]) | uint32(sequence[pos-1])<<8 | uint32(sequence[pos])<<16



                        b := arrayCode[codonCode]

                        if b != byte(0) {

                            w.addByte(b)

                        } else {

                            w.addUnknown()

                        }

                    }



                    // the last codon is only 2 nucleotid long, try to guess

                    // the corresponding AA

                    if (nuclSeqLength-startPos)%3 == 2 {



                        if w.currentLineLen == maxLineSize {

                            w.newLine()

                        }

                        codonCode := uint32(sequence[len(sequence)-2]) | uint32(sequence[len(sequence)-1])<<8



                        b := arrayCode[codonCode]

                        if b != byte(0) {

                            w.addByte(b)

                        } else {

                            w.addUnknown()

                        }

                    }



                    // the last codon is only 1 nucleotid long, no way to guess

                    // the corresponding AA

                    if (nuclSeqLength-startPos)%3 == 1 {

                        if w.currentLineLen == maxLineSize {

                            w.newLine()

                        }

                        w.addUnknown()

                    }



                    if trim && w.bytesToTrim > 0 {

                        // remove the last bytesToTrim bytes of the buffer

                        // as they are 'X', '*' or 'n'

                        w.buf.Truncate(w.buf.Len() - w.bytesToTrim)

                        w.currentLineLen -= w.bytesToTrim

                    }



                    if w.currentLineLen != 0 {

                        w.newLine()

                    }

                    frameIndex++

                }



                if reverse && frameIndex < 6 {



                    // get the complementary sequence.

                    // Basically, switch

                    //   A <-> T

                    //   C <-> G

                    // N is not modified

                    for i, n := range sequence[idSize:] {



                        switch n {

                        case aCode:

                            sequence[i+idSize] = tCode

                        case tCode:

                            // handle both tCode and uCode

                            sequence[i+idSize] = aCode

                        case cCode:

                            sequence[i+idSize] = gCode

                        case gCode:

                            sequence[i+idSize] = cCode

                        default:

                            //case N -> leave it

                        }

                    }

                    // reverse the sequence

                    for i, j := idSize, len(sequence)-1; i < j; i, j = i+1, j-1 {

                        sequence[i], sequence[j] = sequence[j], sequence[i]

                    }



                    if !alternative {

                        // Staden convention: Frame -1 is the reverse-complement of the sequence

                        // having the same codon phase as frame 1. Frame -2 is the same phase as

                        // frame 2. Frame -3 is the same phase as frame 3

                        //

                        // use the matrix to keep track of the forward frame as it depends on the

                        // length of the sequence

                        switch nuclSeqLength % 3 {

                        case 0:

                            startPosition[0], startPosition[1], startPosition[2] = 0, 2, 1

                        case 1:

                            startPosition[0], startPosition[1], startPosition[2] = 1, 0, 2

                        case 2:

                            startPosition[0], startPosition[1], startPosition[2] = 2, 1, 0

                        }

                    }

                    // run the same loop, but with the reverse-complemented sequence

                    goto Translate

                }



                if w.buf.Len() > maxBufferSize {

                    _, err := out.Write(w.buf.Bytes())

                    if err != nil {

                        select {

                        case errs <- fmt.Errorf("fail to write to output file: %v", err):

                        default:

                        }

                        cancel()

                        return

                    }

                    w.buf.Reset()

                }

                pool.Put(sequence)

            }



            if w.buf.Len() > 0 {

                _, err := out.Write(w.buf.Bytes())

                if err != nil {

                    select {

                    case errs <- fmt.Errorf("fail to write to output file: %v", err):

                    default:

                    }

                    cancel()

                    return

                }

            }

        }()

    }

    readSequenceFromFasta(ctx, inputSequence, fnaSequences)



    wg.Wait()

    select {

    case err, ok := <-errs:

        if ok {

            return err

        }

    default:

    }

    return nil

}



func readSequenceFromFasta(ctx context.Context, inputSequence io.Reader, fnaSequences chan encodedSequence) {



    feeder := &fastaChannelFeeder{

        idBuffer:       bytes.NewBuffer(nil),

        commentBuffer:  bytes.NewBuffer(nil),

        sequenceBuffer: bytes.NewBuffer(nil),

        fastaChan:      fnaSequences,

    }

    // fasta format is:

    //

    // >sequenceID some comments on sequence

    // ACAGGCAGAGACACGACAGACGACGACACAGGAGCAGACAGCAGCAGACGACCACATATT

    // TTTGCGGTCACATGACGACTTCGGCAGCGA

    //

    // see https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp

    // section 1 for details

    scanner := bufio.NewScanner(inputSequence)

Loop:

    for scanner.Scan() {



        line := scanner.Bytes()

        if len(line) == 0 {

            continue

        }

        if line[0] == '>' {



            if feeder.idBuffer.Len() > 0 {

                select {

                case <-ctx.Done():

                    break Loop

                default:

                }

                feeder.sendFasta()

            }

            feeder.reset()



            // parse the ID of the sequence. ID is formatted like this:

            // >sequenceID comments

            seqID := bytes.SplitN(line, byte{' '}, 2)

            feeder.idBuffer.Write(seqID[0])



            if len(seqID) > 1 {

                feeder.commentBuffer.WriteByte(' ')

                feeder.commentBuffer.Write(seqID[1])

            }

        } else {

            // if the line doesn't start with '>', then it's a part of the

            // nucleotide sequence, so write it to the buffer

            feeder.sequenceBuffer.Write(line)

        }

    }

    // don't forget to push last sequence

    select {

    case <-ctx.Done():

    default:

        feeder.sendFasta()

    }

    close(fnaSequences)

}



// a type to hold an encoded fasta sequence

//

//  s[0:4] stores the size of the sequence id + the size of the comment as an uint32 (little endian)

//  s[4:idSize] stores the sequence id, and the comment id there is one

//  s[idSize:] stores the nucl sequence

type encodedSequence byte



var pool = sync.Pool{

    New: func() interface{} {

        return make(encodedSequence, 512)

    },

}



func getSizedSlice(idSize, requiredSize int) encodedSequence {

    s := pool.Get().(encodedSequence)

    binary.LittleEndian.PutUint32(s[0:4], uint32(idSize))



    for len(s) < requiredSize {

        s = append(s, byte(0))

    }

    return s[0:requiredSize]

}



func (f *fastaChannelFeeder) sendFasta() {



    idSize := 4 + f.idBuffer.Len() + f.commentBuffer.Len()

    requiredSize := idSize + f.sequenceBuffer.Len()



    s := getSizedSlice(idSize, requiredSize)



    if f.commentBuffer.Len() > 0 {

        copy(s[idSize-f.commentBuffer.Len():idSize], f.commentBuffer.Bytes())

    }



    copy(s[4:4+f.idBuffer.Len()], f.idBuffer.Bytes())



    // convert the sequence of bytes to an array of uint8 codes,

    // so a codon (3 nucleotides | 3 bytes ) can be represented

    // as an uint32

    for i, b := range f.sequenceBuffer.Bytes() {



        switch b {

        case 'A':

            s[i+idSize] = aCode

        case 'C':

            s[i+idSize] = cCode

        case 'G':

            s[i+idSize] = gCode

        case 'T', 'U':

            s[i+idSize] = tCode

        case 'N':

            s[i+idSize] = nCode

        default:

            fmt.Printf("WARNING: invalid char in sequence %s: %s, ignoring", s[4:4+idSize], string(b))

        }

    }

    f.fastaChan <- s

}



type fastaChannelFeeder struct {

    idBuffer, commentBuffer, sequenceBuffer *bytes.Buffer

    fastaChan                               chan encodedSequence

}



func (f *fastaChannelFeeder) reset() {

    f.idBuffer.Reset()

    f.sequenceBuffer.Reset()

    f.commentBuffer.Reset()

}

performance go bioinformatics multiprocessing

asked 9 mins ago

felix

69839

asked 9 mins ago

felix

69839

asked 9 mins ago

felix

69839

asked 9 mins ago

felix

69839

asked 9 mins ago

felix

69839

add a comment |

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f209750%2ftranslate-nucleic-sequence-in-its-corresponding-amino-acid-sequence%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Code Review Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Cfrtjryk