Remove all html special chars












0















Before saving the input from users in the database, I am using the below function to replace all HTML special chars. All my users will always be using the English Language.



Option 1



function clean($var) {
$regEx="/[^a-zA-Z0-9 -_]/";
$var = preg_replace($regEx, "", $var);
return $var;
}


I want users to be able to only store




  • Letters (a to z) (Case Insensitive)

  • Numbers (0 to 9)

  • Space, Dash, Underscore


Will the above function is good for this job, or should I be using a more efficient/inbuilt function in PHP?



This is how I am using the function.



$userInput = htmlspecialchars(clean($userInput));


Option 2



function h($str_to_encode = ""){
// Pregmatch will replacte all HTML characters with Empty string
return preg_replace("/&#?[a-z0-9]{2,8};/i","", htmlspecialchars($str_to_encode));
}


Regex from : https://stackoverflow.com/a/657670/4050261










share|improve this question




















  • 3





    "All my users will always be using the English language." doesn't imply that only the letters from a-z are valid. Take a look at this: English words with diacritics.

    – insertusernamehere
    Feb 2 '18 at 10:32













  • You haven't told us much about why you're doing this, but I get the sense that this is almost certainly the wrong thing to do.

    – 200_success
    Feb 2 '18 at 15:03











  • @200_success, since it is a small project, want to reduce functionality for better security.

    – Adarsh
    Feb 2 '18 at 19:42
















0















Before saving the input from users in the database, I am using the below function to replace all HTML special chars. All my users will always be using the English Language.



Option 1



function clean($var) {
$regEx="/[^a-zA-Z0-9 -_]/";
$var = preg_replace($regEx, "", $var);
return $var;
}


I want users to be able to only store




  • Letters (a to z) (Case Insensitive)

  • Numbers (0 to 9)

  • Space, Dash, Underscore


Will the above function is good for this job, or should I be using a more efficient/inbuilt function in PHP?



This is how I am using the function.



$userInput = htmlspecialchars(clean($userInput));


Option 2



function h($str_to_encode = ""){
// Pregmatch will replacte all HTML characters with Empty string
return preg_replace("/&#?[a-z0-9]{2,8};/i","", htmlspecialchars($str_to_encode));
}


Regex from : https://stackoverflow.com/a/657670/4050261










share|improve this question




















  • 3





    "All my users will always be using the English language." doesn't imply that only the letters from a-z are valid. Take a look at this: English words with diacritics.

    – insertusernamehere
    Feb 2 '18 at 10:32













  • You haven't told us much about why you're doing this, but I get the sense that this is almost certainly the wrong thing to do.

    – 200_success
    Feb 2 '18 at 15:03











  • @200_success, since it is a small project, want to reduce functionality for better security.

    – Adarsh
    Feb 2 '18 at 19:42














0












0








0








Before saving the input from users in the database, I am using the below function to replace all HTML special chars. All my users will always be using the English Language.



Option 1



function clean($var) {
$regEx="/[^a-zA-Z0-9 -_]/";
$var = preg_replace($regEx, "", $var);
return $var;
}


I want users to be able to only store




  • Letters (a to z) (Case Insensitive)

  • Numbers (0 to 9)

  • Space, Dash, Underscore


Will the above function is good for this job, or should I be using a more efficient/inbuilt function in PHP?



This is how I am using the function.



$userInput = htmlspecialchars(clean($userInput));


Option 2



function h($str_to_encode = ""){
// Pregmatch will replacte all HTML characters with Empty string
return preg_replace("/&#?[a-z0-9]{2,8};/i","", htmlspecialchars($str_to_encode));
}


Regex from : https://stackoverflow.com/a/657670/4050261










share|improve this question
















Before saving the input from users in the database, I am using the below function to replace all HTML special chars. All my users will always be using the English Language.



Option 1



function clean($var) {
$regEx="/[^a-zA-Z0-9 -_]/";
$var = preg_replace($regEx, "", $var);
return $var;
}


I want users to be able to only store




  • Letters (a to z) (Case Insensitive)

  • Numbers (0 to 9)

  • Space, Dash, Underscore


Will the above function is good for this job, or should I be using a more efficient/inbuilt function in PHP?



This is how I am using the function.



$userInput = htmlspecialchars(clean($userInput));


Option 2



function h($str_to_encode = ""){
// Pregmatch will replacte all HTML characters with Empty string
return preg_replace("/&#?[a-z0-9]{2,8};/i","", htmlspecialchars($str_to_encode));
}


Regex from : https://stackoverflow.com/a/657670/4050261







php html regex comparative-review






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Feb 2 '18 at 14:56









200_success

129k15152414




129k15152414










asked Feb 2 '18 at 9:56









AdarshAdarsh

1465




1465








  • 3





    "All my users will always be using the English language." doesn't imply that only the letters from a-z are valid. Take a look at this: English words with diacritics.

    – insertusernamehere
    Feb 2 '18 at 10:32













  • You haven't told us much about why you're doing this, but I get the sense that this is almost certainly the wrong thing to do.

    – 200_success
    Feb 2 '18 at 15:03











  • @200_success, since it is a small project, want to reduce functionality for better security.

    – Adarsh
    Feb 2 '18 at 19:42














  • 3





    "All my users will always be using the English language." doesn't imply that only the letters from a-z are valid. Take a look at this: English words with diacritics.

    – insertusernamehere
    Feb 2 '18 at 10:32













  • You haven't told us much about why you're doing this, but I get the sense that this is almost certainly the wrong thing to do.

    – 200_success
    Feb 2 '18 at 15:03











  • @200_success, since it is a small project, want to reduce functionality for better security.

    – Adarsh
    Feb 2 '18 at 19:42








3




3





"All my users will always be using the English language." doesn't imply that only the letters from a-z are valid. Take a look at this: English words with diacritics.

– insertusernamehere
Feb 2 '18 at 10:32







"All my users will always be using the English language." doesn't imply that only the letters from a-z are valid. Take a look at this: English words with diacritics.

– insertusernamehere
Feb 2 '18 at 10:32















You haven't told us much about why you're doing this, but I get the sense that this is almost certainly the wrong thing to do.

– 200_success
Feb 2 '18 at 15:03





You haven't told us much about why you're doing this, but I get the sense that this is almost certainly the wrong thing to do.

– 200_success
Feb 2 '18 at 15:03













@200_success, since it is a small project, want to reduce functionality for better security.

– Adarsh
Feb 2 '18 at 19:42





@200_success, since it is a small project, want to reduce functionality for better security.

– Adarsh
Feb 2 '18 at 19:42










2 Answers
2






active

oldest

votes


















1














Your regex [^a-zA-Z0-9 -_] matches everything that is not a to z, A to Z, 0 to 9 and space to _, this last range includes all character between hexa020 and hexa5F (ie for example !, ", #, $, % and many other), in a character class, - must be escaped or place at the beginning or at the end like:




  • [^a-zA-Z0-9 -_]

  • [^a-zA-Z0-9 _-]

  • [^-a-zA-Z0-9 _]


That said, you can simplify a bit:



[a-zA-Z0-9_] can be coded as w (depending on locale), so your regex becomes [^w -].



If you want to be unicode compatible, use:



[^pLpN_ -] where pL stands for any letter in any laguage and pN for any digit.






share|improve this answer



















  • 1





    You can also put the hyphen after a range or a shorthand character class: [^a-z-A-Z0-9 _], [^a-zA-Z-0-9 _], [^pL-pN_ ]. It's ugly, but possible...

    – Casimir et Hippolyte
    Feb 3 '18 at 20:52



















0














Similar to Toto's answer, I recommend:



function clean($var) {
return preg_replace("~[^w -]+~", "", $var);
}


This will replace all occurrences of one or more consecutive forbidden characters.



Adding the "one or more" (+) quantifier means longer potential matches and fewer total replacements. IOW, imagine a carton of a dozen eggs on yhe ground. If the task was to pick up 12 eggs, you could squat 12 times picking them up one at a time, or just squat once and pickup the carton.



I have eliminated the unnecessary inclusion of "single-use variables" as there is no benefit in retaining them for readability.



Following this custom function call, the call of htmlspecialchars() is useless because there won't be any chars to convert.



On the other hand, if you wanted to call htmlspecialchars_decode() prior to clean() there is reasonable logic to that decision, but it depends on the input that you are expecting.






share|improve this answer

























    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "196"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f186587%2fremove-all-html-special-chars%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    Your regex [^a-zA-Z0-9 -_] matches everything that is not a to z, A to Z, 0 to 9 and space to _, this last range includes all character between hexa020 and hexa5F (ie for example !, ", #, $, % and many other), in a character class, - must be escaped or place at the beginning or at the end like:




    • [^a-zA-Z0-9 -_]

    • [^a-zA-Z0-9 _-]

    • [^-a-zA-Z0-9 _]


    That said, you can simplify a bit:



    [a-zA-Z0-9_] can be coded as w (depending on locale), so your regex becomes [^w -].



    If you want to be unicode compatible, use:



    [^pLpN_ -] where pL stands for any letter in any laguage and pN for any digit.






    share|improve this answer



















    • 1





      You can also put the hyphen after a range or a shorthand character class: [^a-z-A-Z0-9 _], [^a-zA-Z-0-9 _], [^pL-pN_ ]. It's ugly, but possible...

      – Casimir et Hippolyte
      Feb 3 '18 at 20:52
















    1














    Your regex [^a-zA-Z0-9 -_] matches everything that is not a to z, A to Z, 0 to 9 and space to _, this last range includes all character between hexa020 and hexa5F (ie for example !, ", #, $, % and many other), in a character class, - must be escaped or place at the beginning or at the end like:




    • [^a-zA-Z0-9 -_]

    • [^a-zA-Z0-9 _-]

    • [^-a-zA-Z0-9 _]


    That said, you can simplify a bit:



    [a-zA-Z0-9_] can be coded as w (depending on locale), so your regex becomes [^w -].



    If you want to be unicode compatible, use:



    [^pLpN_ -] where pL stands for any letter in any laguage and pN for any digit.






    share|improve this answer



















    • 1





      You can also put the hyphen after a range or a shorthand character class: [^a-z-A-Z0-9 _], [^a-zA-Z-0-9 _], [^pL-pN_ ]. It's ugly, but possible...

      – Casimir et Hippolyte
      Feb 3 '18 at 20:52














    1












    1








    1







    Your regex [^a-zA-Z0-9 -_] matches everything that is not a to z, A to Z, 0 to 9 and space to _, this last range includes all character between hexa020 and hexa5F (ie for example !, ", #, $, % and many other), in a character class, - must be escaped or place at the beginning or at the end like:




    • [^a-zA-Z0-9 -_]

    • [^a-zA-Z0-9 _-]

    • [^-a-zA-Z0-9 _]


    That said, you can simplify a bit:



    [a-zA-Z0-9_] can be coded as w (depending on locale), so your regex becomes [^w -].



    If you want to be unicode compatible, use:



    [^pLpN_ -] where pL stands for any letter in any laguage and pN for any digit.






    share|improve this answer













    Your regex [^a-zA-Z0-9 -_] matches everything that is not a to z, A to Z, 0 to 9 and space to _, this last range includes all character between hexa020 and hexa5F (ie for example !, ", #, $, % and many other), in a character class, - must be escaped or place at the beginning or at the end like:




    • [^a-zA-Z0-9 -_]

    • [^a-zA-Z0-9 _-]

    • [^-a-zA-Z0-9 _]


    That said, you can simplify a bit:



    [a-zA-Z0-9_] can be coded as w (depending on locale), so your regex becomes [^w -].



    If you want to be unicode compatible, use:



    [^pLpN_ -] where pL stands for any letter in any laguage and pN for any digit.







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Feb 2 '18 at 13:05









    TotoToto

    5391613




    5391613








    • 1





      You can also put the hyphen after a range or a shorthand character class: [^a-z-A-Z0-9 _], [^a-zA-Z-0-9 _], [^pL-pN_ ]. It's ugly, but possible...

      – Casimir et Hippolyte
      Feb 3 '18 at 20:52














    • 1





      You can also put the hyphen after a range or a shorthand character class: [^a-z-A-Z0-9 _], [^a-zA-Z-0-9 _], [^pL-pN_ ]. It's ugly, but possible...

      – Casimir et Hippolyte
      Feb 3 '18 at 20:52








    1




    1





    You can also put the hyphen after a range or a shorthand character class: [^a-z-A-Z0-9 _], [^a-zA-Z-0-9 _], [^pL-pN_ ]. It's ugly, but possible...

    – Casimir et Hippolyte
    Feb 3 '18 at 20:52





    You can also put the hyphen after a range or a shorthand character class: [^a-z-A-Z0-9 _], [^a-zA-Z-0-9 _], [^pL-pN_ ]. It's ugly, but possible...

    – Casimir et Hippolyte
    Feb 3 '18 at 20:52













    0














    Similar to Toto's answer, I recommend:



    function clean($var) {
    return preg_replace("~[^w -]+~", "", $var);
    }


    This will replace all occurrences of one or more consecutive forbidden characters.



    Adding the "one or more" (+) quantifier means longer potential matches and fewer total replacements. IOW, imagine a carton of a dozen eggs on yhe ground. If the task was to pick up 12 eggs, you could squat 12 times picking them up one at a time, or just squat once and pickup the carton.



    I have eliminated the unnecessary inclusion of "single-use variables" as there is no benefit in retaining them for readability.



    Following this custom function call, the call of htmlspecialchars() is useless because there won't be any chars to convert.



    On the other hand, if you wanted to call htmlspecialchars_decode() prior to clean() there is reasonable logic to that decision, but it depends on the input that you are expecting.






    share|improve this answer






























      0














      Similar to Toto's answer, I recommend:



      function clean($var) {
      return preg_replace("~[^w -]+~", "", $var);
      }


      This will replace all occurrences of one or more consecutive forbidden characters.



      Adding the "one or more" (+) quantifier means longer potential matches and fewer total replacements. IOW, imagine a carton of a dozen eggs on yhe ground. If the task was to pick up 12 eggs, you could squat 12 times picking them up one at a time, or just squat once and pickup the carton.



      I have eliminated the unnecessary inclusion of "single-use variables" as there is no benefit in retaining them for readability.



      Following this custom function call, the call of htmlspecialchars() is useless because there won't be any chars to convert.



      On the other hand, if you wanted to call htmlspecialchars_decode() prior to clean() there is reasonable logic to that decision, but it depends on the input that you are expecting.






      share|improve this answer




























        0












        0








        0







        Similar to Toto's answer, I recommend:



        function clean($var) {
        return preg_replace("~[^w -]+~", "", $var);
        }


        This will replace all occurrences of one or more consecutive forbidden characters.



        Adding the "one or more" (+) quantifier means longer potential matches and fewer total replacements. IOW, imagine a carton of a dozen eggs on yhe ground. If the task was to pick up 12 eggs, you could squat 12 times picking them up one at a time, or just squat once and pickup the carton.



        I have eliminated the unnecessary inclusion of "single-use variables" as there is no benefit in retaining them for readability.



        Following this custom function call, the call of htmlspecialchars() is useless because there won't be any chars to convert.



        On the other hand, if you wanted to call htmlspecialchars_decode() prior to clean() there is reasonable logic to that decision, but it depends on the input that you are expecting.






        share|improve this answer















        Similar to Toto's answer, I recommend:



        function clean($var) {
        return preg_replace("~[^w -]+~", "", $var);
        }


        This will replace all occurrences of one or more consecutive forbidden characters.



        Adding the "one or more" (+) quantifier means longer potential matches and fewer total replacements. IOW, imagine a carton of a dozen eggs on yhe ground. If the task was to pick up 12 eggs, you could squat 12 times picking them up one at a time, or just squat once and pickup the carton.



        I have eliminated the unnecessary inclusion of "single-use variables" as there is no benefit in retaining them for readability.



        Following this custom function call, the call of htmlspecialchars() is useless because there won't be any chars to convert.



        On the other hand, if you wanted to call htmlspecialchars_decode() prior to clean() there is reasonable logic to that decision, but it depends on the input that you are expecting.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited 11 mins ago

























        answered 16 mins ago









        mickmackusamickmackusa

        1,057112




        1,057112






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Code Review Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f186587%2fremove-all-html-special-chars%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Morgemoulin

            Scott Moir

            Souastre