Simple flex-based lexer











up vote
4
down vote

favorite












I am trying to learn flex and have created this simple program. The rule for comments works correctly for single line comments such as:



// this is a comment


and this:



/* this is also a comment */


My code:



ID [A-Z][A-Za-z0-9]
KEYWORD if|else|then|for|fi|loop|pool|proc|func
OPERATOR "+"|"-"|"/"|"*"|"&"|"%"
PUNCTUATION ":"|","
%%
{ID} printf("An Id found:%sn",yytext);
{KEYWORD} printf("An Keyword found:%sn",yytext);
{OPERATOR} printf("An Operator found:%sn",yytext);
{PUNCTUATION} printf("An Punctuation found:%sn",yytext);
[/][/].*

"/*".*"*/"

%%
int yywrap(){
return 1;
}
main(){
yylex();
}


I'd be interested in comments on this and particularly ways to improve the code, such as for being able to detect multi-line comments.










share|improve this question




















  • 3




    Hello @setu. We realize you are pretty new to this Stack. Please make sure that your code works before posting here, in order to avoid your question being closed/deleted. As it stands I think you could post to Stack Overflow instead to get help with the bugs.
    – Phrancis
    Dec 16 '14 at 16:54










  • My code worked.But just to know the trick why worked.I have edited my post.
    – Setu Basak
    Dec 16 '14 at 16:54










  • I've edited to try to bring the question into line with site guidelines. Please make sure I haven't omitted too much for the question to still be useful.
    – Edward
    Dec 18 '14 at 17:56















up vote
4
down vote

favorite












I am trying to learn flex and have created this simple program. The rule for comments works correctly for single line comments such as:



// this is a comment


and this:



/* this is also a comment */


My code:



ID [A-Z][A-Za-z0-9]
KEYWORD if|else|then|for|fi|loop|pool|proc|func
OPERATOR "+"|"-"|"/"|"*"|"&"|"%"
PUNCTUATION ":"|","
%%
{ID} printf("An Id found:%sn",yytext);
{KEYWORD} printf("An Keyword found:%sn",yytext);
{OPERATOR} printf("An Operator found:%sn",yytext);
{PUNCTUATION} printf("An Punctuation found:%sn",yytext);
[/][/].*

"/*".*"*/"

%%
int yywrap(){
return 1;
}
main(){
yylex();
}


I'd be interested in comments on this and particularly ways to improve the code, such as for being able to detect multi-line comments.










share|improve this question




















  • 3




    Hello @setu. We realize you are pretty new to this Stack. Please make sure that your code works before posting here, in order to avoid your question being closed/deleted. As it stands I think you could post to Stack Overflow instead to get help with the bugs.
    – Phrancis
    Dec 16 '14 at 16:54










  • My code worked.But just to know the trick why worked.I have edited my post.
    – Setu Basak
    Dec 16 '14 at 16:54










  • I've edited to try to bring the question into line with site guidelines. Please make sure I haven't omitted too much for the question to still be useful.
    – Edward
    Dec 18 '14 at 17:56













up vote
4
down vote

favorite









up vote
4
down vote

favorite











I am trying to learn flex and have created this simple program. The rule for comments works correctly for single line comments such as:



// this is a comment


and this:



/* this is also a comment */


My code:



ID [A-Z][A-Za-z0-9]
KEYWORD if|else|then|for|fi|loop|pool|proc|func
OPERATOR "+"|"-"|"/"|"*"|"&"|"%"
PUNCTUATION ":"|","
%%
{ID} printf("An Id found:%sn",yytext);
{KEYWORD} printf("An Keyword found:%sn",yytext);
{OPERATOR} printf("An Operator found:%sn",yytext);
{PUNCTUATION} printf("An Punctuation found:%sn",yytext);
[/][/].*

"/*".*"*/"

%%
int yywrap(){
return 1;
}
main(){
yylex();
}


I'd be interested in comments on this and particularly ways to improve the code, such as for being able to detect multi-line comments.










share|improve this question















I am trying to learn flex and have created this simple program. The rule for comments works correctly for single line comments such as:



// this is a comment


and this:



/* this is also a comment */


My code:



ID [A-Z][A-Za-z0-9]
KEYWORD if|else|then|for|fi|loop|pool|proc|func
OPERATOR "+"|"-"|"/"|"*"|"&"|"%"
PUNCTUATION ":"|","
%%
{ID} printf("An Id found:%sn",yytext);
{KEYWORD} printf("An Keyword found:%sn",yytext);
{OPERATOR} printf("An Operator found:%sn",yytext);
{PUNCTUATION} printf("An Punctuation found:%sn",yytext);
[/][/].*

"/*".*"*/"

%%
int yywrap(){
return 1;
}
main(){
yylex();
}


I'd be interested in comments on this and particularly ways to improve the code, such as for being able to detect multi-line comments.







compiler lexer






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 16 at 15:03









MCCCS

1034




1034










asked Dec 16 '14 at 16:41









Setu Basak

17527




17527








  • 3




    Hello @setu. We realize you are pretty new to this Stack. Please make sure that your code works before posting here, in order to avoid your question being closed/deleted. As it stands I think you could post to Stack Overflow instead to get help with the bugs.
    – Phrancis
    Dec 16 '14 at 16:54










  • My code worked.But just to know the trick why worked.I have edited my post.
    – Setu Basak
    Dec 16 '14 at 16:54










  • I've edited to try to bring the question into line with site guidelines. Please make sure I haven't omitted too much for the question to still be useful.
    – Edward
    Dec 18 '14 at 17:56














  • 3




    Hello @setu. We realize you are pretty new to this Stack. Please make sure that your code works before posting here, in order to avoid your question being closed/deleted. As it stands I think you could post to Stack Overflow instead to get help with the bugs.
    – Phrancis
    Dec 16 '14 at 16:54










  • My code worked.But just to know the trick why worked.I have edited my post.
    – Setu Basak
    Dec 16 '14 at 16:54










  • I've edited to try to bring the question into line with site guidelines. Please make sure I haven't omitted too much for the question to still be useful.
    – Edward
    Dec 18 '14 at 17:56








3




3




Hello @setu. We realize you are pretty new to this Stack. Please make sure that your code works before posting here, in order to avoid your question being closed/deleted. As it stands I think you could post to Stack Overflow instead to get help with the bugs.
– Phrancis
Dec 16 '14 at 16:54




Hello @setu. We realize you are pretty new to this Stack. Please make sure that your code works before posting here, in order to avoid your question being closed/deleted. As it stands I think you could post to Stack Overflow instead to get help with the bugs.
– Phrancis
Dec 16 '14 at 16:54












My code worked.But just to know the trick why worked.I have edited my post.
– Setu Basak
Dec 16 '14 at 16:54




My code worked.But just to know the trick why worked.I have edited my post.
– Setu Basak
Dec 16 '14 at 16:54












I've edited to try to bring the question into line with site guidelines. Please make sure I haven't omitted too much for the question to still be useful.
– Edward
Dec 18 '14 at 17:56




I've edited to try to bring the question into line with site guidelines. Please make sure I haven't omitted too much for the question to still be useful.
– Edward
Dec 18 '14 at 17:56










1 Answer
1






active

oldest

votes

















up vote
7
down vote













The code looks OK for what it does so far, but there are some things you might want to do to improve it:



Always use {} for production rules



It's not technically wrong to simply have printf(...) to the right of a rule, but when your lexer gets more complex (and when you start also using a parser) you may find it easier to troubleshoot if you always use {} to enclose production rules -- even empty ones.



Think about explicitly handling whitespace



It's very common for a parser to need to ignore whitespace. If that's the case, it's usually good to do so explicitly with a rule just above the error-handling rule(s) I mention below.



[ tn]+   { /* ignore whitespace */ }


Consider a "catch-all" rule for illegal tokens



Right now, pretty much any random character will be accepted. This might be fine, but especially while you're learning, you may find it useful to put a catch-all rule at the bottom of your list of rules:



.   { printf("Bad character: %sn", yytext); }


Consider adding support for multiline comments



As your original (pre-edit) code had it, handling multiline comments is different but not too difficult. You can add this to your definitions (the first part of a flex file):



%x c_comment


Then add these rules to the rules section (second part of a flex file):



"/*"   { BEGIN(c_comment); }
<c_comment>[^*]* { }
<c_comment>"*"+[^*/]* { }
<c_comment>"*/" { printf("Ignored a multiline commentn"); BEGIN(INITIAL); }


This defines a start condition called c_comment and switches into that condition when it finds the opening pair of characters for a comment. The next rule ignores everyting that is not a * character. The next line ignores all * characters that are not followed by a /. The point to these two rules is to match as many characters as possible. For performance reasons, you would generally want to write your lexer so that it matches strings that are as long as possible for each rule. This helps the lexer go faster.



Finally, the last rule finds the closing pair of characters and switches back into the initial context. You will also often see BEGIN(0) for that -- the statements are identical in function, but I prefer the more verbose BEGIN(INITIAL) form because I think it's easier to understand.






share|improve this answer





















    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "196"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














     

    draft saved


    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f73842%2fsimple-flex-based-lexer%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    7
    down vote













    The code looks OK for what it does so far, but there are some things you might want to do to improve it:



    Always use {} for production rules



    It's not technically wrong to simply have printf(...) to the right of a rule, but when your lexer gets more complex (and when you start also using a parser) you may find it easier to troubleshoot if you always use {} to enclose production rules -- even empty ones.



    Think about explicitly handling whitespace



    It's very common for a parser to need to ignore whitespace. If that's the case, it's usually good to do so explicitly with a rule just above the error-handling rule(s) I mention below.



    [ tn]+   { /* ignore whitespace */ }


    Consider a "catch-all" rule for illegal tokens



    Right now, pretty much any random character will be accepted. This might be fine, but especially while you're learning, you may find it useful to put a catch-all rule at the bottom of your list of rules:



    .   { printf("Bad character: %sn", yytext); }


    Consider adding support for multiline comments



    As your original (pre-edit) code had it, handling multiline comments is different but not too difficult. You can add this to your definitions (the first part of a flex file):



    %x c_comment


    Then add these rules to the rules section (second part of a flex file):



    "/*"   { BEGIN(c_comment); }
    <c_comment>[^*]* { }
    <c_comment>"*"+[^*/]* { }
    <c_comment>"*/" { printf("Ignored a multiline commentn"); BEGIN(INITIAL); }


    This defines a start condition called c_comment and switches into that condition when it finds the opening pair of characters for a comment. The next rule ignores everyting that is not a * character. The next line ignores all * characters that are not followed by a /. The point to these two rules is to match as many characters as possible. For performance reasons, you would generally want to write your lexer so that it matches strings that are as long as possible for each rule. This helps the lexer go faster.



    Finally, the last rule finds the closing pair of characters and switches back into the initial context. You will also often see BEGIN(0) for that -- the statements are identical in function, but I prefer the more verbose BEGIN(INITIAL) form because I think it's easier to understand.






    share|improve this answer

























      up vote
      7
      down vote













      The code looks OK for what it does so far, but there are some things you might want to do to improve it:



      Always use {} for production rules



      It's not technically wrong to simply have printf(...) to the right of a rule, but when your lexer gets more complex (and when you start also using a parser) you may find it easier to troubleshoot if you always use {} to enclose production rules -- even empty ones.



      Think about explicitly handling whitespace



      It's very common for a parser to need to ignore whitespace. If that's the case, it's usually good to do so explicitly with a rule just above the error-handling rule(s) I mention below.



      [ tn]+   { /* ignore whitespace */ }


      Consider a "catch-all" rule for illegal tokens



      Right now, pretty much any random character will be accepted. This might be fine, but especially while you're learning, you may find it useful to put a catch-all rule at the bottom of your list of rules:



      .   { printf("Bad character: %sn", yytext); }


      Consider adding support for multiline comments



      As your original (pre-edit) code had it, handling multiline comments is different but not too difficult. You can add this to your definitions (the first part of a flex file):



      %x c_comment


      Then add these rules to the rules section (second part of a flex file):



      "/*"   { BEGIN(c_comment); }
      <c_comment>[^*]* { }
      <c_comment>"*"+[^*/]* { }
      <c_comment>"*/" { printf("Ignored a multiline commentn"); BEGIN(INITIAL); }


      This defines a start condition called c_comment and switches into that condition when it finds the opening pair of characters for a comment. The next rule ignores everyting that is not a * character. The next line ignores all * characters that are not followed by a /. The point to these two rules is to match as many characters as possible. For performance reasons, you would generally want to write your lexer so that it matches strings that are as long as possible for each rule. This helps the lexer go faster.



      Finally, the last rule finds the closing pair of characters and switches back into the initial context. You will also often see BEGIN(0) for that -- the statements are identical in function, but I prefer the more verbose BEGIN(INITIAL) form because I think it's easier to understand.






      share|improve this answer























        up vote
        7
        down vote










        up vote
        7
        down vote









        The code looks OK for what it does so far, but there are some things you might want to do to improve it:



        Always use {} for production rules



        It's not technically wrong to simply have printf(...) to the right of a rule, but when your lexer gets more complex (and when you start also using a parser) you may find it easier to troubleshoot if you always use {} to enclose production rules -- even empty ones.



        Think about explicitly handling whitespace



        It's very common for a parser to need to ignore whitespace. If that's the case, it's usually good to do so explicitly with a rule just above the error-handling rule(s) I mention below.



        [ tn]+   { /* ignore whitespace */ }


        Consider a "catch-all" rule for illegal tokens



        Right now, pretty much any random character will be accepted. This might be fine, but especially while you're learning, you may find it useful to put a catch-all rule at the bottom of your list of rules:



        .   { printf("Bad character: %sn", yytext); }


        Consider adding support for multiline comments



        As your original (pre-edit) code had it, handling multiline comments is different but not too difficult. You can add this to your definitions (the first part of a flex file):



        %x c_comment


        Then add these rules to the rules section (second part of a flex file):



        "/*"   { BEGIN(c_comment); }
        <c_comment>[^*]* { }
        <c_comment>"*"+[^*/]* { }
        <c_comment>"*/" { printf("Ignored a multiline commentn"); BEGIN(INITIAL); }


        This defines a start condition called c_comment and switches into that condition when it finds the opening pair of characters for a comment. The next rule ignores everyting that is not a * character. The next line ignores all * characters that are not followed by a /. The point to these two rules is to match as many characters as possible. For performance reasons, you would generally want to write your lexer so that it matches strings that are as long as possible for each rule. This helps the lexer go faster.



        Finally, the last rule finds the closing pair of characters and switches back into the initial context. You will also often see BEGIN(0) for that -- the statements are identical in function, but I prefer the more verbose BEGIN(INITIAL) form because I think it's easier to understand.






        share|improve this answer












        The code looks OK for what it does so far, but there are some things you might want to do to improve it:



        Always use {} for production rules



        It's not technically wrong to simply have printf(...) to the right of a rule, but when your lexer gets more complex (and when you start also using a parser) you may find it easier to troubleshoot if you always use {} to enclose production rules -- even empty ones.



        Think about explicitly handling whitespace



        It's very common for a parser to need to ignore whitespace. If that's the case, it's usually good to do so explicitly with a rule just above the error-handling rule(s) I mention below.



        [ tn]+   { /* ignore whitespace */ }


        Consider a "catch-all" rule for illegal tokens



        Right now, pretty much any random character will be accepted. This might be fine, but especially while you're learning, you may find it useful to put a catch-all rule at the bottom of your list of rules:



        .   { printf("Bad character: %sn", yytext); }


        Consider adding support for multiline comments



        As your original (pre-edit) code had it, handling multiline comments is different but not too difficult. You can add this to your definitions (the first part of a flex file):



        %x c_comment


        Then add these rules to the rules section (second part of a flex file):



        "/*"   { BEGIN(c_comment); }
        <c_comment>[^*]* { }
        <c_comment>"*"+[^*/]* { }
        <c_comment>"*/" { printf("Ignored a multiline commentn"); BEGIN(INITIAL); }


        This defines a start condition called c_comment and switches into that condition when it finds the opening pair of characters for a comment. The next rule ignores everyting that is not a * character. The next line ignores all * characters that are not followed by a /. The point to these two rules is to match as many characters as possible. For performance reasons, you would generally want to write your lexer so that it matches strings that are as long as possible for each rule. This helps the lexer go faster.



        Finally, the last rule finds the closing pair of characters and switches back into the initial context. You will also often see BEGIN(0) for that -- the statements are identical in function, but I prefer the more verbose BEGIN(INITIAL) form because I think it's easier to understand.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Dec 18 '14 at 18:38









        Edward

        45.4k376206




        45.4k376206






























             

            draft saved


            draft discarded



















































             


            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f73842%2fsimple-flex-based-lexer%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            List directoties down one level, excluding some named directories and files

            list processes belonging to a network namespace

            list systemd RuntimeDirectory mounts