Remove all html special chars

Before saving the input from users in the database, I am using the below function to replace all HTML special chars. All my users will always be using the English Language.

Option 1

function clean($var) {

  $regEx="/[^a-zA-Z0-9 -_]/"; 

  $var = preg_replace($regEx, "", $var);

  return $var;

}

I want users to be able to only store

Letters (a to z) (Case Insensitive)

Numbers (0 to 9)

Space, Dash, Underscore

Will the above function is good for this job, or should I be using a more efficient/inbuilt function in PHP?

This is how I am using the function.

$userInput = htmlspecialchars(clean($userInput));

Option 2

function h($str_to_encode = ""){

    // Pregmatch will replacte all HTML characters with Empty string

    return preg_replace("/&#?[a-z0-9]{2,8};/i","", htmlspecialchars($str_to_encode));

}

Regex from : https://stackoverflow.com/a/657670/4050261

edited Feb 2 '18 at 14:56

200_success

129k15152414

asked Feb 2 '18 at 9:56

Adarsh

1465

3

"All my users will always be using the English language." doesn't imply that only the letters from a-z are valid. Take a look at this: English words with diacritics.

– insertusernamehere
Feb 2 '18 at 10:32

You haven't told us much about why you're doing this, but I get the sense that this is almost certainly the wrong thing to do.

– 200_success
Feb 2 '18 at 15:03

@200_success, since it is a small project, want to reduce functionality for better security.

– Adarsh
Feb 2 '18 at 19:42

add a comment |

Before saving the input from users in the database, I am using the below function to replace all HTML special chars. All my users will always be using the English Language.

Option 1

function clean($var) {

  $regEx="/[^a-zA-Z0-9 -_]/"; 

  $var = preg_replace($regEx, "", $var);

  return $var;

}

I want users to be able to only store

Letters (a to z) (Case Insensitive)

Numbers (0 to 9)

Space, Dash, Underscore

Will the above function is good for this job, or should I be using a more efficient/inbuilt function in PHP?

This is how I am using the function.

$userInput = htmlspecialchars(clean($userInput));

Option 2

function h($str_to_encode = ""){

    // Pregmatch will replacte all HTML characters with Empty string

    return preg_replace("/&#?[a-z0-9]{2,8};/i","", htmlspecialchars($str_to_encode));

}

Regex from : https://stackoverflow.com/a/657670/4050261

edited Feb 2 '18 at 14:56

200_success

129k15152414

asked Feb 2 '18 at 9:56

Adarsh

1465

3

"All my users will always be using the English language." doesn't imply that only the letters from a-z are valid. Take a look at this: English words with diacritics.

– insertusernamehere
Feb 2 '18 at 10:32

You haven't told us much about why you're doing this, but I get the sense that this is almost certainly the wrong thing to do.

– 200_success
Feb 2 '18 at 15:03

@200_success, since it is a small project, want to reduce functionality for better security.

– Adarsh
Feb 2 '18 at 19:42

add a comment |

Before saving the input from users in the database, I am using the below function to replace all HTML special chars. All my users will always be using the English Language.

Option 1

function clean($var) {

  $regEx="/[^a-zA-Z0-9 -_]/"; 

  $var = preg_replace($regEx, "", $var);

  return $var;

}

I want users to be able to only store

Letters (a to z) (Case Insensitive)

Numbers (0 to 9)

Space, Dash, Underscore

Will the above function is good for this job, or should I be using a more efficient/inbuilt function in PHP?

This is how I am using the function.

$userInput = htmlspecialchars(clean($userInput));

Option 2

function h($str_to_encode = ""){

    // Pregmatch will replacte all HTML characters with Empty string

    return preg_replace("/&#?[a-z0-9]{2,8};/i","", htmlspecialchars($str_to_encode));

}

Regex from : https://stackoverflow.com/a/657670/4050261

edited Feb 2 '18 at 14:56

200_success

129k15152414

asked Feb 2 '18 at 9:56

Adarsh

1465

Before saving the input from users in the database, I am using the below function to replace all HTML special chars. All my users will always be using the English Language.

Option 1

function clean($var) {

  $regEx="/[^a-zA-Z0-9 -_]/"; 

  $var = preg_replace($regEx, "", $var);

  return $var;

}

I want users to be able to only store

Letters (a to z) (Case Insensitive)

Numbers (0 to 9)

Space, Dash, Underscore

Will the above function is good for this job, or should I be using a more efficient/inbuilt function in PHP?

This is how I am using the function.

$userInput = htmlspecialchars(clean($userInput));

Option 2

function h($str_to_encode = ""){

    // Pregmatch will replacte all HTML characters with Empty string

    return preg_replace("/&#?[a-z0-9]{2,8};/i","", htmlspecialchars($str_to_encode));

}

Regex from : https://stackoverflow.com/a/657670/4050261

php html regex comparative-review

edited Feb 2 '18 at 14:56

200_success

129k15152414

asked Feb 2 '18 at 9:56

Adarsh

1465

edited Feb 2 '18 at 14:56

200_success

129k15152414

asked Feb 2 '18 at 9:56

Adarsh

1465

edited Feb 2 '18 at 14:56

200_success

129k15152414

edited Feb 2 '18 at 14:56

200_success

129k15152414

edited Feb 2 '18 at 14:56

200_success

129k15152414

asked Feb 2 '18 at 9:56

Adarsh

1465

asked Feb 2 '18 at 9:56

Adarsh

1465

asked Feb 2 '18 at 9:56

Adarsh

1465

3

"All my users will always be using the English language." doesn't imply that only the letters from a-z are valid. Take a look at this: English words with diacritics.

– insertusernamehere
Feb 2 '18 at 10:32

You haven't told us much about why you're doing this, but I get the sense that this is almost certainly the wrong thing to do.

– 200_success
Feb 2 '18 at 15:03

@200_success, since it is a small project, want to reduce functionality for better security.

– Adarsh
Feb 2 '18 at 19:42

add a comment |

3

"All my users will always be using the English language." doesn't imply that only the letters from a-z are valid. Take a look at this: English words with diacritics.

– insertusernamehere
Feb 2 '18 at 10:32

You haven't told us much about why you're doing this, but I get the sense that this is almost certainly the wrong thing to do.

– 200_success
Feb 2 '18 at 15:03

@200_success, since it is a small project, want to reduce functionality for better security.

– Adarsh
Feb 2 '18 at 19:42

"All my users will always be using the English language." doesn't imply that only the letters from a-z are valid. Take a look at this: English words with diacritics.

– insertusernamehere
Feb 2 '18 at 10:32

You haven't told us much about why you're doing this, but I get the sense that this is almost certainly the wrong thing to do.

– 200_success
Feb 2 '18 at 15:03

@200_success, since it is a small project, want to reduce functionality for better security.

– Adarsh
Feb 2 '18 at 19:42

add a comment |

2 Answers
2

active

oldest

votes

Your regex [^a-zA-Z0-9 -_] matches everything that is not a to z, A to Z, 0 to 9 and space to _, this last range includes all character between hexa020 and hexa5F (ie for example !, ", #, $, % and many other), in a character class, - must be escaped or place at the beginning or at the end like:

[^a-zA-Z0-9 -_]

[^a-zA-Z0-9 _-]

[^-a-zA-Z0-9 _]

That said, you can simplify a bit:

[a-zA-Z0-9_] can be coded as w (depending on locale), so your regex becomes [^w -].

If you want to be unicode compatible, use:

[^pLpN_ -] where pL stands for any letter in any laguage and pN for any digit.

answered Feb 2 '18 at 13:05

Toto

5391613

1

You can also put the hyphen after a range or a shorthand character class: [^a-z-A-Z0-9 _], [^a-zA-Z-0-9 _], [^pL-pN_ ]. It's ugly, but possible...

– Casimir et Hippolyte
Feb 3 '18 at 20:52

add a comment |

Similar to Toto's answer, I recommend:

function clean($var) {

    return preg_replace("~[^w -]+~", "", $var);

}

This will replace all occurrences of one or more consecutive forbidden characters.

Adding the "one or more" (+) quantifier means longer potential matches and fewer total replacements. IOW, imagine a carton of a dozen eggs on yhe ground. If the task was to pick up 12 eggs, you could squat 12 times picking them up one at a time, or just squat once and pickup the carton.

I have eliminated the unnecessary inclusion of "single-use variables" as there is no benefit in retaining them for readability.

Following this custom function call, the call of htmlspecialchars() is useless because there won't be any chars to convert.

On the other hand, if you wanted to call htmlspecialchars_decode() prior to clean() there is reasonable logic to that decision, but it depends on the input that you are expecting.

edited 11 mins ago

answered 16 mins ago

mickmackusa

1,057112

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f186587%2fremove-all-html-special-chars%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

[^a-zA-Z0-9 -_]

[^a-zA-Z0-9 _-]

[^-a-zA-Z0-9 _]

That said, you can simplify a bit:

[a-zA-Z0-9_] can be coded as w (depending on locale), so your regex becomes [^w -].

If you want to be unicode compatible, use:

[^pLpN_ -] where pL stands for any letter in any laguage and pN for any digit.

answered Feb 2 '18 at 13:05

Toto

5391613

1

You can also put the hyphen after a range or a shorthand character class: [^a-z-A-Z0-9 _], [^a-zA-Z-0-9 _], [^pL-pN_ ]. It's ugly, but possible...

– Casimir et Hippolyte
Feb 3 '18 at 20:52

add a comment |

[^a-zA-Z0-9 -_]

[^a-zA-Z0-9 _-]

[^-a-zA-Z0-9 _]

That said, you can simplify a bit:

[a-zA-Z0-9_] can be coded as w (depending on locale), so your regex becomes [^w -].

If you want to be unicode compatible, use:

[^pLpN_ -] where pL stands for any letter in any laguage and pN for any digit.

answered Feb 2 '18 at 13:05

Toto

5391613

1

You can also put the hyphen after a range or a shorthand character class: [^a-z-A-Z0-9 _], [^a-zA-Z-0-9 _], [^pL-pN_ ]. It's ugly, but possible...

– Casimir et Hippolyte
Feb 3 '18 at 20:52

add a comment |

[^a-zA-Z0-9 -_]

[^a-zA-Z0-9 _-]

[^-a-zA-Z0-9 _]

That said, you can simplify a bit:

[a-zA-Z0-9_] can be coded as w (depending on locale), so your regex becomes [^w -].

If you want to be unicode compatible, use:

[^pLpN_ -] where pL stands for any letter in any laguage and pN for any digit.

answered Feb 2 '18 at 13:05

Toto

5391613

[^a-zA-Z0-9 -_]

[^a-zA-Z0-9 _-]

[^-a-zA-Z0-9 _]

That said, you can simplify a bit:

[a-zA-Z0-9_] can be coded as w (depending on locale), so your regex becomes [^w -].

If you want to be unicode compatible, use:

[^pLpN_ -] where pL stands for any letter in any laguage and pN for any digit.

answered Feb 2 '18 at 13:05

Toto

5391613

answered Feb 2 '18 at 13:05

Toto

5391613

answered Feb 2 '18 at 13:05

Toto

5391613

answered Feb 2 '18 at 13:05

Toto

5391613

1

You can also put the hyphen after a range or a shorthand character class: [^a-z-A-Z0-9 _], [^a-zA-Z-0-9 _], [^pL-pN_ ]. It's ugly, but possible...

– Casimir et Hippolyte
Feb 3 '18 at 20:52

add a comment |

1

You can also put the hyphen after a range or a shorthand character class: [^a-z-A-Z0-9 _], [^a-zA-Z-0-9 _], [^pL-pN_ ]. It's ugly, but possible...

– Casimir et Hippolyte
Feb 3 '18 at 20:52

You can also put the hyphen after a range or a shorthand character class: [^a-z-A-Z0-9 _], [^a-zA-Z-0-9 _], [^pL-pN_ ]. It's ugly, but possible...

– Casimir et Hippolyte
Feb 3 '18 at 20:52

add a comment |

Similar to Toto's answer, I recommend:

function clean($var) {

    return preg_replace("~[^w -]+~", "", $var);

}

This will replace all occurrences of one or more consecutive forbidden characters.

I have eliminated the unnecessary inclusion of "single-use variables" as there is no benefit in retaining them for readability.

Following this custom function call, the call of htmlspecialchars() is useless because there won't be any chars to convert.

On the other hand, if you wanted to call htmlspecialchars_decode() prior to clean() there is reasonable logic to that decision, but it depends on the input that you are expecting.

edited 11 mins ago

answered 16 mins ago

mickmackusa

1,057112

add a comment |

Similar to Toto's answer, I recommend:

function clean($var) {

    return preg_replace("~[^w -]+~", "", $var);

}

This will replace all occurrences of one or more consecutive forbidden characters.

I have eliminated the unnecessary inclusion of "single-use variables" as there is no benefit in retaining them for readability.

Following this custom function call, the call of htmlspecialchars() is useless because there won't be any chars to convert.

On the other hand, if you wanted to call htmlspecialchars_decode() prior to clean() there is reasonable logic to that decision, but it depends on the input that you are expecting.

edited 11 mins ago

answered 16 mins ago

mickmackusa

1,057112

add a comment |

Similar to Toto's answer, I recommend:

function clean($var) {

    return preg_replace("~[^w -]+~", "", $var);

}

This will replace all occurrences of one or more consecutive forbidden characters.

I have eliminated the unnecessary inclusion of "single-use variables" as there is no benefit in retaining them for readability.

Following this custom function call, the call of htmlspecialchars() is useless because there won't be any chars to convert.

On the other hand, if you wanted to call htmlspecialchars_decode() prior to clean() there is reasonable logic to that decision, but it depends on the input that you are expecting.

edited 11 mins ago

answered 16 mins ago

mickmackusa

1,057112

Similar to Toto's answer, I recommend:

function clean($var) {

    return preg_replace("~[^w -]+~", "", $var);

}

This will replace all occurrences of one or more consecutive forbidden characters.

I have eliminated the unnecessary inclusion of "single-use variables" as there is no benefit in retaining them for readability.

Following this custom function call, the call of htmlspecialchars() is useless because there won't be any chars to convert.

On the other hand, if you wanted to call htmlspecialchars_decode() prior to clean() there is reasonable logic to that decision, but it depends on the input that you are expecting.

edited 11 mins ago

answered 16 mins ago

mickmackusa

1,057112

edited 11 mins ago

answered 16 mins ago

mickmackusa

1,057112

answered 16 mins ago

mickmackusa

1,057112

answered 16 mins ago

mickmackusa

1,057112

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Code Review Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Cfrtjryk