Difference between [0-9], [[:digit:]] and d
up vote
28
down vote
favorite
In the Wikipedia article on Regular expressions, it seems that [[:digit:]]
= [0-9]
= d
.
What are the circumstances where they do not equal? What is the difference?
After some research, I think one difference is that bracket expression [:expr:]
is locale dependent.
regular-expression wildcards
add a comment |
up vote
28
down vote
favorite
In the Wikipedia article on Regular expressions, it seems that [[:digit:]]
= [0-9]
= d
.
What are the circumstances where they do not equal? What is the difference?
After some research, I think one difference is that bracket expression [:expr:]
is locale dependent.
regular-expression wildcards
3
Doesn't the Wikipedia article that you linked to answer your question? Different regular expression processors/engines support different syntaxes for character classes (among other things).
– igal
Jan 2 at 3:34
@igal wiki says there is difference but doesn't give much detail. I'm asking the detail, something like isaac, thrig said. I'm pretty interested in their difference in grep, sed, awk... whether GNU version or not.
– harbinn
Jan 2 at 7:01
add a comment |
up vote
28
down vote
favorite
up vote
28
down vote
favorite
In the Wikipedia article on Regular expressions, it seems that [[:digit:]]
= [0-9]
= d
.
What are the circumstances where they do not equal? What is the difference?
After some research, I think one difference is that bracket expression [:expr:]
is locale dependent.
regular-expression wildcards
In the Wikipedia article on Regular expressions, it seems that [[:digit:]]
= [0-9]
= d
.
What are the circumstances where they do not equal? What is the difference?
After some research, I think one difference is that bracket expression [:expr:]
is locale dependent.
regular-expression wildcards
regular-expression wildcards
edited Jan 2 at 14:28
muru
35.2k582155
35.2k582155
asked Jan 2 at 3:01
harbinn
32729
32729
3
Doesn't the Wikipedia article that you linked to answer your question? Different regular expression processors/engines support different syntaxes for character classes (among other things).
– igal
Jan 2 at 3:34
@igal wiki says there is difference but doesn't give much detail. I'm asking the detail, something like isaac, thrig said. I'm pretty interested in their difference in grep, sed, awk... whether GNU version or not.
– harbinn
Jan 2 at 7:01
add a comment |
3
Doesn't the Wikipedia article that you linked to answer your question? Different regular expression processors/engines support different syntaxes for character classes (among other things).
– igal
Jan 2 at 3:34
@igal wiki says there is difference but doesn't give much detail. I'm asking the detail, something like isaac, thrig said. I'm pretty interested in their difference in grep, sed, awk... whether GNU version or not.
– harbinn
Jan 2 at 7:01
3
3
Doesn't the Wikipedia article that you linked to answer your question? Different regular expression processors/engines support different syntaxes for character classes (among other things).
– igal
Jan 2 at 3:34
Doesn't the Wikipedia article that you linked to answer your question? Different regular expression processors/engines support different syntaxes for character classes (among other things).
– igal
Jan 2 at 3:34
@igal wiki says there is difference but doesn't give much detail. I'm asking the detail, something like isaac, thrig said. I'm pretty interested in their difference in grep, sed, awk... whether GNU version or not.
– harbinn
Jan 2 at 7:01
@igal wiki says there is difference but doesn't give much detail. I'm asking the detail, something like isaac, thrig said. I'm pretty interested in their difference in grep, sed, awk... whether GNU version or not.
– harbinn
Jan 2 at 7:01
add a comment |
4 Answers
4
active
oldest
votes
up vote
36
down vote
Yes, it is [[:digit:]]
~ [0-9]
~ d
(where ~ means aproximate).
In most programming languages (where it is supported) d
≡ [[:digit:]]
(identical).
The d
is less common than [[:digit:]]
(not in POSIX but it is in GNU grep -P
).
There are many digits in UNICODE, for example:
123456789 # Hindu-Arabic
Arabic numerals٠١٢٣٤٥٦٧٨٩ # ARABIC-INDIC
۰۱۲۳۴۵۶۷۸۹ # EXTENDED ARABIC-INDIC/PERSIAN
߀߁߂߃߄߅߆߇߈߉ # NKO DIGIT
०१२३४५६७८९ # DEVANAGARI
All of which may be included in [[:digit:]]
or d
.
Instead, [0-9]
is generally only the ASCII digits 0123456789
.
There are many languages: Perl, Java, Python, C. In which [[:digit:]]
(and d
) calls for an extended meaning. For example, this perl code will match all the digits from above:
$ a='0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९'
$ echo "$a" | perl -C -pe 's/[^d]//g;' ; echo
0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९
Which is equivalent to select all characters that have the Unicode properties of Numeric
and digits
:
$ echo "$a" | perl -C -pe 's/[^p{Nd}]//g;' ; echo
0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९
Which grep could reproduce (the specific version of pcre may have a diferent internal list of numeric code points than Perl):
$ echo "$a" | grep -oP 'p{Nd}+'
0123456789
٠١٢٣٤٥٦٧٨٩
۰۱۲۳۴۵۶۷۸۹
߀߁߂߃߄߅߆߇߈߉
०१२३४५६७८९
Change it to [0-9] to see:
$ echo "$a" | grep -o '[0-9]+'
0123456789
POSIX
For the specific POSIX BRE or ERE:
The d
is not supported (not in POSIX but is in GNU grep -P
).
[[:digit:]]
is required by POSIX to correspond to the digit character class, which in turn is required by ISO C to be the characters 0 through 9 and nothing else. So only in C locale all [0-9]
, [0123456789]
, d
and [[:digit:]]
mean exactly the same. The [0123456789]
has no possible misinterpretations, [[:digit:]]
is available in more utilities and it is common to mean only [0123456789]
. The d
is supported by few utilities.
As for [0-9]
, the meaning of range expressions is only defined by POSIX in the C locale; in other locales it might be different (might be codepoint order or collation order or something else).
shells
Some implementations may understand a range to be something different than plain ASCII order (ksh93 for example):
$ LC_ALL=en_US.utf8 ksh -c 'a="'"$a"'";echo "${a//[0-9]}"'
۹ ߀߁߂߃߄߅߆߇߈߉ ९
And that is a sure source of bugs waiting to happen.
In practice on POSIX systems,iswctype()
and BRE/ERE/wildcards in POSIX utilities, [0-9] and [[:digit:]] match on 0123456789 only. And that will be made explicit in the next revision of the standard
– Stéphane Chazelas
May 15 at 19:39
I wasn't aware thatperl
'sd
in Unicode mode matched on decimal digits from other scripts. Thanks for that. With PCRE, see(*UCP)
as in GNUgrep -Po '(*UCP)d'
orgrep -Po '(*UCP)[[:digit:]]
for classes to be based on Unicode properties.
– Stéphane Chazelas
May 15 at 19:46
I agree that the[:digit:]
syntax would suggest that you want to use localization, that is whatever the user considers as being a digit. I never use[:digit:]
because in practice that's the same as[0-9]
and in any case, invariably I want to match on 0123456789, I never mean to match on٠١٢٣٤٥٦٧٨٩
, and I can't think of a use case where one would want to match on a decimal digit in any script with POSIX utilities. See also the current discussion about[:blank:]
on the zsh ML. Those character classes are a bit of a mess.
– Stéphane Chazelas
May 15 at 20:38
add a comment |
up vote
12
down vote
This depends on how you define a digit; [0-9]
tends to be just the ASCII ones (or possibly something else that is neither ASCII nor a superset of ASCII but the same 10 digits as in ASCII only with different bit representations (EBCDIC)); d
on the other hand could either be just the plain digits (old versions of Perl, or modern versions of Perl with the /a
regular expression flag enabled) or it could be a Unicode match of p{Digit}
which is rather a larger set of digits than [0-9]
or /d/a
match.
$ perl -E 'say "match" if 42 =~ m/d/'
match
$ perl -E 'say "match" if "N{U+09EA}" =~ m/d/'
match
$ perl -E 'say "match" if "N{U+09EA}" =~ m/d/a'
$ perl -E 'say "match" if "N{U+09EA}" =~ m/[0-9]/'
$
perldoc perlrecharclass
for more information, or consult the documentation for the language in question to see how it behaves.
But wait, there's more! The locale may also vary what d
matches, so d
could match fewer digits than the complete Unicode set of such, and (hopefully, usually) also includes [0-9]
. This is similar to the difference in C between isdigit(3)
([0-9]
) and isnumber(3)
([0-9
plus whatever else from the locale).
There may be calls that can be made to obtain the value of the digit, even if it is not [0-9]
:
$ perl -MUnicode::UCD=num -E 'say num(4)'
4
$ perl -MUnicode::UCD=num -E 'say num("N{U+09EA}")'
4
$
I thinkisnumber()
is a BSD thing, at least based on the man page it seems so
– ilkkachu
Jan 2 at 18:06
I do have something of a BSD bias, yes
– thrig
Jan 2 at 19:18
The /a flag is an specific limiter to reduce the list of Unicode digits to match only …the /a modifier can be used to force d to match just the ASCII 0 through 9.. As such, it is forcing to match exactly the same and only[0-9]
.
– Isaac
Jun 4 at 22:16
add a comment |
up vote
4
down vote
Different meaning of [0-9]
, [[:digit:]]
and d
are presented in other answers. Here I would like to add differences in implementation of regex engine.
[[:digit:]] d
grep -E ✓ ×
grep -P ✓ ✓
sed ✓ ×
sed -E ✓ ×
So [[:digit:]]
always works, d
depends. In grep's manual it's mentioned that [[:digit:]]
is just 0-9
in the C
locale.
PS1: If you know more, please expand the table.
PS2: GNU grep 3.1 and GNU 4.4 is used for test.
2
1) There are many versions ofgrep
andsed
, with the biggest difference probably between the GNU versions vs. others. This answer might be more useful if it mentioned which version ofgrep
andsed
it refers to. Or what the source of that table is, for that matter. 2) that table might as well be transcribed to text, since it doesn't contain anything that requires it to be an image
– ilkkachu
Jan 2 at 14:01
@ilkkachu 1) latest GNU grep 3.1 and GNU 4.4 is used for test. 2) I don't how to create table. It seems that @ muru has converted the table to a pretty text form.
– harbinn
Jan 2 at 14:39
@harbinn Please edit that into your answer.
– Dan D.
Jan 3 at 4:56
@DanD. the version info added. thx for attention
– harbinn
Jan 4 at 0:43
1
Note that the python built inre
module does not support [[:digit:]] but the add in libraryregex
does support it so I would niggle a little at the always works. It always works in posix complaint situations.
– Steve Barnes
Jan 5 at 19:08
add a comment |
up vote
3
down vote
The theoretical differences have already been pretty well explained in the other answers, so it remains to explain the practical differences.
Here are some of the more common use cases for matching a digit:
One-shot data extraction
Often, when you want to crunch some numbers, the numbers themselves are in an awkwardly formatted text file. You want to extract them for use in your program. You can probably tell the number format (by looking at the file) and your current locale, so it's ok to use any of the forms, as long as it gets the job done. d
requires the fewest keystrokes, so it's very commonly used.
Input sanitizing
You have some untrusted user input (maybe from a web form), and you need to make certain it doesn't contain any surprises. Maybe you want to store it in a numeric field in a database, or use as a parameter to a shell command to run on a server. In this case, you really want [0-9]
, since it's the most restrictive and predictable one.
Data validation
You have a bit of data that you are not going to use for anything "dangerous", but it would nice to know if it's a number. For example, your program allows the user to input an address, and you want to highlight a possible typo if the input doesn't contain a house number. In this case, you probably want to be as broad as possible, so [[:digit:]]
is the way to go.
Those would seem to be the three most common use cases for digit matching. If you think I missed an important one, please drop a comment.
nice job, Is security problem related, such as ReDoS or others
– frams
Jan 4 at 0:56
add a comment |
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
36
down vote
Yes, it is [[:digit:]]
~ [0-9]
~ d
(where ~ means aproximate).
In most programming languages (where it is supported) d
≡ [[:digit:]]
(identical).
The d
is less common than [[:digit:]]
(not in POSIX but it is in GNU grep -P
).
There are many digits in UNICODE, for example:
123456789 # Hindu-Arabic
Arabic numerals٠١٢٣٤٥٦٧٨٩ # ARABIC-INDIC
۰۱۲۳۴۵۶۷۸۹ # EXTENDED ARABIC-INDIC/PERSIAN
߀߁߂߃߄߅߆߇߈߉ # NKO DIGIT
०१२३४५६७८९ # DEVANAGARI
All of which may be included in [[:digit:]]
or d
.
Instead, [0-9]
is generally only the ASCII digits 0123456789
.
There are many languages: Perl, Java, Python, C. In which [[:digit:]]
(and d
) calls for an extended meaning. For example, this perl code will match all the digits from above:
$ a='0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९'
$ echo "$a" | perl -C -pe 's/[^d]//g;' ; echo
0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९
Which is equivalent to select all characters that have the Unicode properties of Numeric
and digits
:
$ echo "$a" | perl -C -pe 's/[^p{Nd}]//g;' ; echo
0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९
Which grep could reproduce (the specific version of pcre may have a diferent internal list of numeric code points than Perl):
$ echo "$a" | grep -oP 'p{Nd}+'
0123456789
٠١٢٣٤٥٦٧٨٩
۰۱۲۳۴۵۶۷۸۹
߀߁߂߃߄߅߆߇߈߉
०१२३४५६७८९
Change it to [0-9] to see:
$ echo "$a" | grep -o '[0-9]+'
0123456789
POSIX
For the specific POSIX BRE or ERE:
The d
is not supported (not in POSIX but is in GNU grep -P
).
[[:digit:]]
is required by POSIX to correspond to the digit character class, which in turn is required by ISO C to be the characters 0 through 9 and nothing else. So only in C locale all [0-9]
, [0123456789]
, d
and [[:digit:]]
mean exactly the same. The [0123456789]
has no possible misinterpretations, [[:digit:]]
is available in more utilities and it is common to mean only [0123456789]
. The d
is supported by few utilities.
As for [0-9]
, the meaning of range expressions is only defined by POSIX in the C locale; in other locales it might be different (might be codepoint order or collation order or something else).
shells
Some implementations may understand a range to be something different than plain ASCII order (ksh93 for example):
$ LC_ALL=en_US.utf8 ksh -c 'a="'"$a"'";echo "${a//[0-9]}"'
۹ ߀߁߂߃߄߅߆߇߈߉ ९
And that is a sure source of bugs waiting to happen.
In practice on POSIX systems,iswctype()
and BRE/ERE/wildcards in POSIX utilities, [0-9] and [[:digit:]] match on 0123456789 only. And that will be made explicit in the next revision of the standard
– Stéphane Chazelas
May 15 at 19:39
I wasn't aware thatperl
'sd
in Unicode mode matched on decimal digits from other scripts. Thanks for that. With PCRE, see(*UCP)
as in GNUgrep -Po '(*UCP)d'
orgrep -Po '(*UCP)[[:digit:]]
for classes to be based on Unicode properties.
– Stéphane Chazelas
May 15 at 19:46
I agree that the[:digit:]
syntax would suggest that you want to use localization, that is whatever the user considers as being a digit. I never use[:digit:]
because in practice that's the same as[0-9]
and in any case, invariably I want to match on 0123456789, I never mean to match on٠١٢٣٤٥٦٧٨٩
, and I can't think of a use case where one would want to match on a decimal digit in any script with POSIX utilities. See also the current discussion about[:blank:]
on the zsh ML. Those character classes are a bit of a mess.
– Stéphane Chazelas
May 15 at 20:38
add a comment |
up vote
36
down vote
Yes, it is [[:digit:]]
~ [0-9]
~ d
(where ~ means aproximate).
In most programming languages (where it is supported) d
≡ [[:digit:]]
(identical).
The d
is less common than [[:digit:]]
(not in POSIX but it is in GNU grep -P
).
There are many digits in UNICODE, for example:
123456789 # Hindu-Arabic
Arabic numerals٠١٢٣٤٥٦٧٨٩ # ARABIC-INDIC
۰۱۲۳۴۵۶۷۸۹ # EXTENDED ARABIC-INDIC/PERSIAN
߀߁߂߃߄߅߆߇߈߉ # NKO DIGIT
०१२३४५६७८९ # DEVANAGARI
All of which may be included in [[:digit:]]
or d
.
Instead, [0-9]
is generally only the ASCII digits 0123456789
.
There are many languages: Perl, Java, Python, C. In which [[:digit:]]
(and d
) calls for an extended meaning. For example, this perl code will match all the digits from above:
$ a='0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९'
$ echo "$a" | perl -C -pe 's/[^d]//g;' ; echo
0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९
Which is equivalent to select all characters that have the Unicode properties of Numeric
and digits
:
$ echo "$a" | perl -C -pe 's/[^p{Nd}]//g;' ; echo
0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९
Which grep could reproduce (the specific version of pcre may have a diferent internal list of numeric code points than Perl):
$ echo "$a" | grep -oP 'p{Nd}+'
0123456789
٠١٢٣٤٥٦٧٨٩
۰۱۲۳۴۵۶۷۸۹
߀߁߂߃߄߅߆߇߈߉
०१२३४५६७८९
Change it to [0-9] to see:
$ echo "$a" | grep -o '[0-9]+'
0123456789
POSIX
For the specific POSIX BRE or ERE:
The d
is not supported (not in POSIX but is in GNU grep -P
).
[[:digit:]]
is required by POSIX to correspond to the digit character class, which in turn is required by ISO C to be the characters 0 through 9 and nothing else. So only in C locale all [0-9]
, [0123456789]
, d
and [[:digit:]]
mean exactly the same. The [0123456789]
has no possible misinterpretations, [[:digit:]]
is available in more utilities and it is common to mean only [0123456789]
. The d
is supported by few utilities.
As for [0-9]
, the meaning of range expressions is only defined by POSIX in the C locale; in other locales it might be different (might be codepoint order or collation order or something else).
shells
Some implementations may understand a range to be something different than plain ASCII order (ksh93 for example):
$ LC_ALL=en_US.utf8 ksh -c 'a="'"$a"'";echo "${a//[0-9]}"'
۹ ߀߁߂߃߄߅߆߇߈߉ ९
And that is a sure source of bugs waiting to happen.
In practice on POSIX systems,iswctype()
and BRE/ERE/wildcards in POSIX utilities, [0-9] and [[:digit:]] match on 0123456789 only. And that will be made explicit in the next revision of the standard
– Stéphane Chazelas
May 15 at 19:39
I wasn't aware thatperl
'sd
in Unicode mode matched on decimal digits from other scripts. Thanks for that. With PCRE, see(*UCP)
as in GNUgrep -Po '(*UCP)d'
orgrep -Po '(*UCP)[[:digit:]]
for classes to be based on Unicode properties.
– Stéphane Chazelas
May 15 at 19:46
I agree that the[:digit:]
syntax would suggest that you want to use localization, that is whatever the user considers as being a digit. I never use[:digit:]
because in practice that's the same as[0-9]
and in any case, invariably I want to match on 0123456789, I never mean to match on٠١٢٣٤٥٦٧٨٩
, and I can't think of a use case where one would want to match on a decimal digit in any script with POSIX utilities. See also the current discussion about[:blank:]
on the zsh ML. Those character classes are a bit of a mess.
– Stéphane Chazelas
May 15 at 20:38
add a comment |
up vote
36
down vote
up vote
36
down vote
Yes, it is [[:digit:]]
~ [0-9]
~ d
(where ~ means aproximate).
In most programming languages (where it is supported) d
≡ [[:digit:]]
(identical).
The d
is less common than [[:digit:]]
(not in POSIX but it is in GNU grep -P
).
There are many digits in UNICODE, for example:
123456789 # Hindu-Arabic
Arabic numerals٠١٢٣٤٥٦٧٨٩ # ARABIC-INDIC
۰۱۲۳۴۵۶۷۸۹ # EXTENDED ARABIC-INDIC/PERSIAN
߀߁߂߃߄߅߆߇߈߉ # NKO DIGIT
०१२३४५६७८९ # DEVANAGARI
All of which may be included in [[:digit:]]
or d
.
Instead, [0-9]
is generally only the ASCII digits 0123456789
.
There are many languages: Perl, Java, Python, C. In which [[:digit:]]
(and d
) calls for an extended meaning. For example, this perl code will match all the digits from above:
$ a='0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९'
$ echo "$a" | perl -C -pe 's/[^d]//g;' ; echo
0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९
Which is equivalent to select all characters that have the Unicode properties of Numeric
and digits
:
$ echo "$a" | perl -C -pe 's/[^p{Nd}]//g;' ; echo
0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९
Which grep could reproduce (the specific version of pcre may have a diferent internal list of numeric code points than Perl):
$ echo "$a" | grep -oP 'p{Nd}+'
0123456789
٠١٢٣٤٥٦٧٨٩
۰۱۲۳۴۵۶۷۸۹
߀߁߂߃߄߅߆߇߈߉
०१२३४५६७८९
Change it to [0-9] to see:
$ echo "$a" | grep -o '[0-9]+'
0123456789
POSIX
For the specific POSIX BRE or ERE:
The d
is not supported (not in POSIX but is in GNU grep -P
).
[[:digit:]]
is required by POSIX to correspond to the digit character class, which in turn is required by ISO C to be the characters 0 through 9 and nothing else. So only in C locale all [0-9]
, [0123456789]
, d
and [[:digit:]]
mean exactly the same. The [0123456789]
has no possible misinterpretations, [[:digit:]]
is available in more utilities and it is common to mean only [0123456789]
. The d
is supported by few utilities.
As for [0-9]
, the meaning of range expressions is only defined by POSIX in the C locale; in other locales it might be different (might be codepoint order or collation order or something else).
shells
Some implementations may understand a range to be something different than plain ASCII order (ksh93 for example):
$ LC_ALL=en_US.utf8 ksh -c 'a="'"$a"'";echo "${a//[0-9]}"'
۹ ߀߁߂߃߄߅߆߇߈߉ ९
And that is a sure source of bugs waiting to happen.
Yes, it is [[:digit:]]
~ [0-9]
~ d
(where ~ means aproximate).
In most programming languages (where it is supported) d
≡ [[:digit:]]
(identical).
The d
is less common than [[:digit:]]
(not in POSIX but it is in GNU grep -P
).
There are many digits in UNICODE, for example:
123456789 # Hindu-Arabic
Arabic numerals٠١٢٣٤٥٦٧٨٩ # ARABIC-INDIC
۰۱۲۳۴۵۶۷۸۹ # EXTENDED ARABIC-INDIC/PERSIAN
߀߁߂߃߄߅߆߇߈߉ # NKO DIGIT
०१२३४५६७८९ # DEVANAGARI
All of which may be included in [[:digit:]]
or d
.
Instead, [0-9]
is generally only the ASCII digits 0123456789
.
There are many languages: Perl, Java, Python, C. In which [[:digit:]]
(and d
) calls for an extended meaning. For example, this perl code will match all the digits from above:
$ a='0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९'
$ echo "$a" | perl -C -pe 's/[^d]//g;' ; echo
0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९
Which is equivalent to select all characters that have the Unicode properties of Numeric
and digits
:
$ echo "$a" | perl -C -pe 's/[^p{Nd}]//g;' ; echo
0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९
Which grep could reproduce (the specific version of pcre may have a diferent internal list of numeric code points than Perl):
$ echo "$a" | grep -oP 'p{Nd}+'
0123456789
٠١٢٣٤٥٦٧٨٩
۰۱۲۳۴۵۶۷۸۹
߀߁߂߃߄߅߆߇߈߉
०१२३४५६७८९
Change it to [0-9] to see:
$ echo "$a" | grep -o '[0-9]+'
0123456789
POSIX
For the specific POSIX BRE or ERE:
The d
is not supported (not in POSIX but is in GNU grep -P
).
[[:digit:]]
is required by POSIX to correspond to the digit character class, which in turn is required by ISO C to be the characters 0 through 9 and nothing else. So only in C locale all [0-9]
, [0123456789]
, d
and [[:digit:]]
mean exactly the same. The [0123456789]
has no possible misinterpretations, [[:digit:]]
is available in more utilities and it is common to mean only [0123456789]
. The d
is supported by few utilities.
As for [0-9]
, the meaning of range expressions is only defined by POSIX in the C locale; in other locales it might be different (might be codepoint order or collation order or something else).
shells
Some implementations may understand a range to be something different than plain ASCII order (ksh93 for example):
$ LC_ALL=en_US.utf8 ksh -c 'a="'"$a"'";echo "${a//[0-9]}"'
۹ ߀߁߂߃߄߅߆߇߈߉ ९
And that is a sure source of bugs waiting to happen.
edited Nov 23 at 21:57
answered Jan 2 at 3:44
Isaac
9,91111445
9,91111445
In practice on POSIX systems,iswctype()
and BRE/ERE/wildcards in POSIX utilities, [0-9] and [[:digit:]] match on 0123456789 only. And that will be made explicit in the next revision of the standard
– Stéphane Chazelas
May 15 at 19:39
I wasn't aware thatperl
'sd
in Unicode mode matched on decimal digits from other scripts. Thanks for that. With PCRE, see(*UCP)
as in GNUgrep -Po '(*UCP)d'
orgrep -Po '(*UCP)[[:digit:]]
for classes to be based on Unicode properties.
– Stéphane Chazelas
May 15 at 19:46
I agree that the[:digit:]
syntax would suggest that you want to use localization, that is whatever the user considers as being a digit. I never use[:digit:]
because in practice that's the same as[0-9]
and in any case, invariably I want to match on 0123456789, I never mean to match on٠١٢٣٤٥٦٧٨٩
, and I can't think of a use case where one would want to match on a decimal digit in any script with POSIX utilities. See also the current discussion about[:blank:]
on the zsh ML. Those character classes are a bit of a mess.
– Stéphane Chazelas
May 15 at 20:38
add a comment |
In practice on POSIX systems,iswctype()
and BRE/ERE/wildcards in POSIX utilities, [0-9] and [[:digit:]] match on 0123456789 only. And that will be made explicit in the next revision of the standard
– Stéphane Chazelas
May 15 at 19:39
I wasn't aware thatperl
'sd
in Unicode mode matched on decimal digits from other scripts. Thanks for that. With PCRE, see(*UCP)
as in GNUgrep -Po '(*UCP)d'
orgrep -Po '(*UCP)[[:digit:]]
for classes to be based on Unicode properties.
– Stéphane Chazelas
May 15 at 19:46
I agree that the[:digit:]
syntax would suggest that you want to use localization, that is whatever the user considers as being a digit. I never use[:digit:]
because in practice that's the same as[0-9]
and in any case, invariably I want to match on 0123456789, I never mean to match on٠١٢٣٤٥٦٧٨٩
, and I can't think of a use case where one would want to match on a decimal digit in any script with POSIX utilities. See also the current discussion about[:blank:]
on the zsh ML. Those character classes are a bit of a mess.
– Stéphane Chazelas
May 15 at 20:38
In practice on POSIX systems,
iswctype()
and BRE/ERE/wildcards in POSIX utilities, [0-9] and [[:digit:]] match on 0123456789 only. And that will be made explicit in the next revision of the standard– Stéphane Chazelas
May 15 at 19:39
In practice on POSIX systems,
iswctype()
and BRE/ERE/wildcards in POSIX utilities, [0-9] and [[:digit:]] match on 0123456789 only. And that will be made explicit in the next revision of the standard– Stéphane Chazelas
May 15 at 19:39
I wasn't aware that
perl
's d
in Unicode mode matched on decimal digits from other scripts. Thanks for that. With PCRE, see (*UCP)
as in GNU grep -Po '(*UCP)d'
or grep -Po '(*UCP)[[:digit:]]
for classes to be based on Unicode properties.– Stéphane Chazelas
May 15 at 19:46
I wasn't aware that
perl
's d
in Unicode mode matched on decimal digits from other scripts. Thanks for that. With PCRE, see (*UCP)
as in GNU grep -Po '(*UCP)d'
or grep -Po '(*UCP)[[:digit:]]
for classes to be based on Unicode properties.– Stéphane Chazelas
May 15 at 19:46
I agree that the
[:digit:]
syntax would suggest that you want to use localization, that is whatever the user considers as being a digit. I never use [:digit:]
because in practice that's the same as [0-9]
and in any case, invariably I want to match on 0123456789, I never mean to match on ٠١٢٣٤٥٦٧٨٩
, and I can't think of a use case where one would want to match on a decimal digit in any script with POSIX utilities. See also the current discussion about [:blank:]
on the zsh ML. Those character classes are a bit of a mess.– Stéphane Chazelas
May 15 at 20:38
I agree that the
[:digit:]
syntax would suggest that you want to use localization, that is whatever the user considers as being a digit. I never use [:digit:]
because in practice that's the same as [0-9]
and in any case, invariably I want to match on 0123456789, I never mean to match on ٠١٢٣٤٥٦٧٨٩
, and I can't think of a use case where one would want to match on a decimal digit in any script with POSIX utilities. See also the current discussion about [:blank:]
on the zsh ML. Those character classes are a bit of a mess.– Stéphane Chazelas
May 15 at 20:38
add a comment |
up vote
12
down vote
This depends on how you define a digit; [0-9]
tends to be just the ASCII ones (or possibly something else that is neither ASCII nor a superset of ASCII but the same 10 digits as in ASCII only with different bit representations (EBCDIC)); d
on the other hand could either be just the plain digits (old versions of Perl, or modern versions of Perl with the /a
regular expression flag enabled) or it could be a Unicode match of p{Digit}
which is rather a larger set of digits than [0-9]
or /d/a
match.
$ perl -E 'say "match" if 42 =~ m/d/'
match
$ perl -E 'say "match" if "N{U+09EA}" =~ m/d/'
match
$ perl -E 'say "match" if "N{U+09EA}" =~ m/d/a'
$ perl -E 'say "match" if "N{U+09EA}" =~ m/[0-9]/'
$
perldoc perlrecharclass
for more information, or consult the documentation for the language in question to see how it behaves.
But wait, there's more! The locale may also vary what d
matches, so d
could match fewer digits than the complete Unicode set of such, and (hopefully, usually) also includes [0-9]
. This is similar to the difference in C between isdigit(3)
([0-9]
) and isnumber(3)
([0-9
plus whatever else from the locale).
There may be calls that can be made to obtain the value of the digit, even if it is not [0-9]
:
$ perl -MUnicode::UCD=num -E 'say num(4)'
4
$ perl -MUnicode::UCD=num -E 'say num("N{U+09EA}")'
4
$
I thinkisnumber()
is a BSD thing, at least based on the man page it seems so
– ilkkachu
Jan 2 at 18:06
I do have something of a BSD bias, yes
– thrig
Jan 2 at 19:18
The /a flag is an specific limiter to reduce the list of Unicode digits to match only …the /a modifier can be used to force d to match just the ASCII 0 through 9.. As such, it is forcing to match exactly the same and only[0-9]
.
– Isaac
Jun 4 at 22:16
add a comment |
up vote
12
down vote
This depends on how you define a digit; [0-9]
tends to be just the ASCII ones (or possibly something else that is neither ASCII nor a superset of ASCII but the same 10 digits as in ASCII only with different bit representations (EBCDIC)); d
on the other hand could either be just the plain digits (old versions of Perl, or modern versions of Perl with the /a
regular expression flag enabled) or it could be a Unicode match of p{Digit}
which is rather a larger set of digits than [0-9]
or /d/a
match.
$ perl -E 'say "match" if 42 =~ m/d/'
match
$ perl -E 'say "match" if "N{U+09EA}" =~ m/d/'
match
$ perl -E 'say "match" if "N{U+09EA}" =~ m/d/a'
$ perl -E 'say "match" if "N{U+09EA}" =~ m/[0-9]/'
$
perldoc perlrecharclass
for more information, or consult the documentation for the language in question to see how it behaves.
But wait, there's more! The locale may also vary what d
matches, so d
could match fewer digits than the complete Unicode set of such, and (hopefully, usually) also includes [0-9]
. This is similar to the difference in C between isdigit(3)
([0-9]
) and isnumber(3)
([0-9
plus whatever else from the locale).
There may be calls that can be made to obtain the value of the digit, even if it is not [0-9]
:
$ perl -MUnicode::UCD=num -E 'say num(4)'
4
$ perl -MUnicode::UCD=num -E 'say num("N{U+09EA}")'
4
$
I thinkisnumber()
is a BSD thing, at least based on the man page it seems so
– ilkkachu
Jan 2 at 18:06
I do have something of a BSD bias, yes
– thrig
Jan 2 at 19:18
The /a flag is an specific limiter to reduce the list of Unicode digits to match only …the /a modifier can be used to force d to match just the ASCII 0 through 9.. As such, it is forcing to match exactly the same and only[0-9]
.
– Isaac
Jun 4 at 22:16
add a comment |
up vote
12
down vote
up vote
12
down vote
This depends on how you define a digit; [0-9]
tends to be just the ASCII ones (or possibly something else that is neither ASCII nor a superset of ASCII but the same 10 digits as in ASCII only with different bit representations (EBCDIC)); d
on the other hand could either be just the plain digits (old versions of Perl, or modern versions of Perl with the /a
regular expression flag enabled) or it could be a Unicode match of p{Digit}
which is rather a larger set of digits than [0-9]
or /d/a
match.
$ perl -E 'say "match" if 42 =~ m/d/'
match
$ perl -E 'say "match" if "N{U+09EA}" =~ m/d/'
match
$ perl -E 'say "match" if "N{U+09EA}" =~ m/d/a'
$ perl -E 'say "match" if "N{U+09EA}" =~ m/[0-9]/'
$
perldoc perlrecharclass
for more information, or consult the documentation for the language in question to see how it behaves.
But wait, there's more! The locale may also vary what d
matches, so d
could match fewer digits than the complete Unicode set of such, and (hopefully, usually) also includes [0-9]
. This is similar to the difference in C between isdigit(3)
([0-9]
) and isnumber(3)
([0-9
plus whatever else from the locale).
There may be calls that can be made to obtain the value of the digit, even if it is not [0-9]
:
$ perl -MUnicode::UCD=num -E 'say num(4)'
4
$ perl -MUnicode::UCD=num -E 'say num("N{U+09EA}")'
4
$
This depends on how you define a digit; [0-9]
tends to be just the ASCII ones (or possibly something else that is neither ASCII nor a superset of ASCII but the same 10 digits as in ASCII only with different bit representations (EBCDIC)); d
on the other hand could either be just the plain digits (old versions of Perl, or modern versions of Perl with the /a
regular expression flag enabled) or it could be a Unicode match of p{Digit}
which is rather a larger set of digits than [0-9]
or /d/a
match.
$ perl -E 'say "match" if 42 =~ m/d/'
match
$ perl -E 'say "match" if "N{U+09EA}" =~ m/d/'
match
$ perl -E 'say "match" if "N{U+09EA}" =~ m/d/a'
$ perl -E 'say "match" if "N{U+09EA}" =~ m/[0-9]/'
$
perldoc perlrecharclass
for more information, or consult the documentation for the language in question to see how it behaves.
But wait, there's more! The locale may also vary what d
matches, so d
could match fewer digits than the complete Unicode set of such, and (hopefully, usually) also includes [0-9]
. This is similar to the difference in C between isdigit(3)
([0-9]
) and isnumber(3)
([0-9
plus whatever else from the locale).
There may be calls that can be made to obtain the value of the digit, even if it is not [0-9]
:
$ perl -MUnicode::UCD=num -E 'say num(4)'
4
$ perl -MUnicode::UCD=num -E 'say num("N{U+09EA}")'
4
$
edited Jan 2 at 15:18
answered Jan 2 at 3:42
thrig
23.8k12955
23.8k12955
I thinkisnumber()
is a BSD thing, at least based on the man page it seems so
– ilkkachu
Jan 2 at 18:06
I do have something of a BSD bias, yes
– thrig
Jan 2 at 19:18
The /a flag is an specific limiter to reduce the list of Unicode digits to match only …the /a modifier can be used to force d to match just the ASCII 0 through 9.. As such, it is forcing to match exactly the same and only[0-9]
.
– Isaac
Jun 4 at 22:16
add a comment |
I thinkisnumber()
is a BSD thing, at least based on the man page it seems so
– ilkkachu
Jan 2 at 18:06
I do have something of a BSD bias, yes
– thrig
Jan 2 at 19:18
The /a flag is an specific limiter to reduce the list of Unicode digits to match only …the /a modifier can be used to force d to match just the ASCII 0 through 9.. As such, it is forcing to match exactly the same and only[0-9]
.
– Isaac
Jun 4 at 22:16
I think
isnumber()
is a BSD thing, at least based on the man page it seems so– ilkkachu
Jan 2 at 18:06
I think
isnumber()
is a BSD thing, at least based on the man page it seems so– ilkkachu
Jan 2 at 18:06
I do have something of a BSD bias, yes
– thrig
Jan 2 at 19:18
I do have something of a BSD bias, yes
– thrig
Jan 2 at 19:18
The /a flag is an specific limiter to reduce the list of Unicode digits to match only …the /a modifier can be used to force d to match just the ASCII 0 through 9.. As such, it is forcing to match exactly the same and only
[0-9]
.– Isaac
Jun 4 at 22:16
The /a flag is an specific limiter to reduce the list of Unicode digits to match only …the /a modifier can be used to force d to match just the ASCII 0 through 9.. As such, it is forcing to match exactly the same and only
[0-9]
.– Isaac
Jun 4 at 22:16
add a comment |
up vote
4
down vote
Different meaning of [0-9]
, [[:digit:]]
and d
are presented in other answers. Here I would like to add differences in implementation of regex engine.
[[:digit:]] d
grep -E ✓ ×
grep -P ✓ ✓
sed ✓ ×
sed -E ✓ ×
So [[:digit:]]
always works, d
depends. In grep's manual it's mentioned that [[:digit:]]
is just 0-9
in the C
locale.
PS1: If you know more, please expand the table.
PS2: GNU grep 3.1 and GNU 4.4 is used for test.
2
1) There are many versions ofgrep
andsed
, with the biggest difference probably between the GNU versions vs. others. This answer might be more useful if it mentioned which version ofgrep
andsed
it refers to. Or what the source of that table is, for that matter. 2) that table might as well be transcribed to text, since it doesn't contain anything that requires it to be an image
– ilkkachu
Jan 2 at 14:01
@ilkkachu 1) latest GNU grep 3.1 and GNU 4.4 is used for test. 2) I don't how to create table. It seems that @ muru has converted the table to a pretty text form.
– harbinn
Jan 2 at 14:39
@harbinn Please edit that into your answer.
– Dan D.
Jan 3 at 4:56
@DanD. the version info added. thx for attention
– harbinn
Jan 4 at 0:43
1
Note that the python built inre
module does not support [[:digit:]] but the add in libraryregex
does support it so I would niggle a little at the always works. It always works in posix complaint situations.
– Steve Barnes
Jan 5 at 19:08
add a comment |
up vote
4
down vote
Different meaning of [0-9]
, [[:digit:]]
and d
are presented in other answers. Here I would like to add differences in implementation of regex engine.
[[:digit:]] d
grep -E ✓ ×
grep -P ✓ ✓
sed ✓ ×
sed -E ✓ ×
So [[:digit:]]
always works, d
depends. In grep's manual it's mentioned that [[:digit:]]
is just 0-9
in the C
locale.
PS1: If you know more, please expand the table.
PS2: GNU grep 3.1 and GNU 4.4 is used for test.
2
1) There are many versions ofgrep
andsed
, with the biggest difference probably between the GNU versions vs. others. This answer might be more useful if it mentioned which version ofgrep
andsed
it refers to. Or what the source of that table is, for that matter. 2) that table might as well be transcribed to text, since it doesn't contain anything that requires it to be an image
– ilkkachu
Jan 2 at 14:01
@ilkkachu 1) latest GNU grep 3.1 and GNU 4.4 is used for test. 2) I don't how to create table. It seems that @ muru has converted the table to a pretty text form.
– harbinn
Jan 2 at 14:39
@harbinn Please edit that into your answer.
– Dan D.
Jan 3 at 4:56
@DanD. the version info added. thx for attention
– harbinn
Jan 4 at 0:43
1
Note that the python built inre
module does not support [[:digit:]] but the add in libraryregex
does support it so I would niggle a little at the always works. It always works in posix complaint situations.
– Steve Barnes
Jan 5 at 19:08
add a comment |
up vote
4
down vote
up vote
4
down vote
Different meaning of [0-9]
, [[:digit:]]
and d
are presented in other answers. Here I would like to add differences in implementation of regex engine.
[[:digit:]] d
grep -E ✓ ×
grep -P ✓ ✓
sed ✓ ×
sed -E ✓ ×
So [[:digit:]]
always works, d
depends. In grep's manual it's mentioned that [[:digit:]]
is just 0-9
in the C
locale.
PS1: If you know more, please expand the table.
PS2: GNU grep 3.1 and GNU 4.4 is used for test.
Different meaning of [0-9]
, [[:digit:]]
and d
are presented in other answers. Here I would like to add differences in implementation of regex engine.
[[:digit:]] d
grep -E ✓ ×
grep -P ✓ ✓
sed ✓ ×
sed -E ✓ ×
So [[:digit:]]
always works, d
depends. In grep's manual it's mentioned that [[:digit:]]
is just 0-9
in the C
locale.
PS1: If you know more, please expand the table.
PS2: GNU grep 3.1 and GNU 4.4 is used for test.
edited Jan 4 at 0:40
answered Jan 2 at 13:45
harbinn
32729
32729
2
1) There are many versions ofgrep
andsed
, with the biggest difference probably between the GNU versions vs. others. This answer might be more useful if it mentioned which version ofgrep
andsed
it refers to. Or what the source of that table is, for that matter. 2) that table might as well be transcribed to text, since it doesn't contain anything that requires it to be an image
– ilkkachu
Jan 2 at 14:01
@ilkkachu 1) latest GNU grep 3.1 and GNU 4.4 is used for test. 2) I don't how to create table. It seems that @ muru has converted the table to a pretty text form.
– harbinn
Jan 2 at 14:39
@harbinn Please edit that into your answer.
– Dan D.
Jan 3 at 4:56
@DanD. the version info added. thx for attention
– harbinn
Jan 4 at 0:43
1
Note that the python built inre
module does not support [[:digit:]] but the add in libraryregex
does support it so I would niggle a little at the always works. It always works in posix complaint situations.
– Steve Barnes
Jan 5 at 19:08
add a comment |
2
1) There are many versions ofgrep
andsed
, with the biggest difference probably between the GNU versions vs. others. This answer might be more useful if it mentioned which version ofgrep
andsed
it refers to. Or what the source of that table is, for that matter. 2) that table might as well be transcribed to text, since it doesn't contain anything that requires it to be an image
– ilkkachu
Jan 2 at 14:01
@ilkkachu 1) latest GNU grep 3.1 and GNU 4.4 is used for test. 2) I don't how to create table. It seems that @ muru has converted the table to a pretty text form.
– harbinn
Jan 2 at 14:39
@harbinn Please edit that into your answer.
– Dan D.
Jan 3 at 4:56
@DanD. the version info added. thx for attention
– harbinn
Jan 4 at 0:43
1
Note that the python built inre
module does not support [[:digit:]] but the add in libraryregex
does support it so I would niggle a little at the always works. It always works in posix complaint situations.
– Steve Barnes
Jan 5 at 19:08
2
2
1) There are many versions of
grep
and sed
, with the biggest difference probably between the GNU versions vs. others. This answer might be more useful if it mentioned which version of grep
and sed
it refers to. Or what the source of that table is, for that matter. 2) that table might as well be transcribed to text, since it doesn't contain anything that requires it to be an image– ilkkachu
Jan 2 at 14:01
1) There are many versions of
grep
and sed
, with the biggest difference probably between the GNU versions vs. others. This answer might be more useful if it mentioned which version of grep
and sed
it refers to. Or what the source of that table is, for that matter. 2) that table might as well be transcribed to text, since it doesn't contain anything that requires it to be an image– ilkkachu
Jan 2 at 14:01
@ilkkachu 1) latest GNU grep 3.1 and GNU 4.4 is used for test. 2) I don't how to create table. It seems that @ muru has converted the table to a pretty text form.
– harbinn
Jan 2 at 14:39
@ilkkachu 1) latest GNU grep 3.1 and GNU 4.4 is used for test. 2) I don't how to create table. It seems that @ muru has converted the table to a pretty text form.
– harbinn
Jan 2 at 14:39
@harbinn Please edit that into your answer.
– Dan D.
Jan 3 at 4:56
@harbinn Please edit that into your answer.
– Dan D.
Jan 3 at 4:56
@DanD. the version info added. thx for attention
– harbinn
Jan 4 at 0:43
@DanD. the version info added. thx for attention
– harbinn
Jan 4 at 0:43
1
1
Note that the python built in
re
module does not support [[:digit:]] but the add in library regex
does support it so I would niggle a little at the always works. It always works in posix complaint situations.– Steve Barnes
Jan 5 at 19:08
Note that the python built in
re
module does not support [[:digit:]] but the add in library regex
does support it so I would niggle a little at the always works. It always works in posix complaint situations.– Steve Barnes
Jan 5 at 19:08
add a comment |
up vote
3
down vote
The theoretical differences have already been pretty well explained in the other answers, so it remains to explain the practical differences.
Here are some of the more common use cases for matching a digit:
One-shot data extraction
Often, when you want to crunch some numbers, the numbers themselves are in an awkwardly formatted text file. You want to extract them for use in your program. You can probably tell the number format (by looking at the file) and your current locale, so it's ok to use any of the forms, as long as it gets the job done. d
requires the fewest keystrokes, so it's very commonly used.
Input sanitizing
You have some untrusted user input (maybe from a web form), and you need to make certain it doesn't contain any surprises. Maybe you want to store it in a numeric field in a database, or use as a parameter to a shell command to run on a server. In this case, you really want [0-9]
, since it's the most restrictive and predictable one.
Data validation
You have a bit of data that you are not going to use for anything "dangerous", but it would nice to know if it's a number. For example, your program allows the user to input an address, and you want to highlight a possible typo if the input doesn't contain a house number. In this case, you probably want to be as broad as possible, so [[:digit:]]
is the way to go.
Those would seem to be the three most common use cases for digit matching. If you think I missed an important one, please drop a comment.
nice job, Is security problem related, such as ReDoS or others
– frams
Jan 4 at 0:56
add a comment |
up vote
3
down vote
The theoretical differences have already been pretty well explained in the other answers, so it remains to explain the practical differences.
Here are some of the more common use cases for matching a digit:
One-shot data extraction
Often, when you want to crunch some numbers, the numbers themselves are in an awkwardly formatted text file. You want to extract them for use in your program. You can probably tell the number format (by looking at the file) and your current locale, so it's ok to use any of the forms, as long as it gets the job done. d
requires the fewest keystrokes, so it's very commonly used.
Input sanitizing
You have some untrusted user input (maybe from a web form), and you need to make certain it doesn't contain any surprises. Maybe you want to store it in a numeric field in a database, or use as a parameter to a shell command to run on a server. In this case, you really want [0-9]
, since it's the most restrictive and predictable one.
Data validation
You have a bit of data that you are not going to use for anything "dangerous", but it would nice to know if it's a number. For example, your program allows the user to input an address, and you want to highlight a possible typo if the input doesn't contain a house number. In this case, you probably want to be as broad as possible, so [[:digit:]]
is the way to go.
Those would seem to be the three most common use cases for digit matching. If you think I missed an important one, please drop a comment.
nice job, Is security problem related, such as ReDoS or others
– frams
Jan 4 at 0:56
add a comment |
up vote
3
down vote
up vote
3
down vote
The theoretical differences have already been pretty well explained in the other answers, so it remains to explain the practical differences.
Here are some of the more common use cases for matching a digit:
One-shot data extraction
Often, when you want to crunch some numbers, the numbers themselves are in an awkwardly formatted text file. You want to extract them for use in your program. You can probably tell the number format (by looking at the file) and your current locale, so it's ok to use any of the forms, as long as it gets the job done. d
requires the fewest keystrokes, so it's very commonly used.
Input sanitizing
You have some untrusted user input (maybe from a web form), and you need to make certain it doesn't contain any surprises. Maybe you want to store it in a numeric field in a database, or use as a parameter to a shell command to run on a server. In this case, you really want [0-9]
, since it's the most restrictive and predictable one.
Data validation
You have a bit of data that you are not going to use for anything "dangerous", but it would nice to know if it's a number. For example, your program allows the user to input an address, and you want to highlight a possible typo if the input doesn't contain a house number. In this case, you probably want to be as broad as possible, so [[:digit:]]
is the way to go.
Those would seem to be the three most common use cases for digit matching. If you think I missed an important one, please drop a comment.
The theoretical differences have already been pretty well explained in the other answers, so it remains to explain the practical differences.
Here are some of the more common use cases for matching a digit:
One-shot data extraction
Often, when you want to crunch some numbers, the numbers themselves are in an awkwardly formatted text file. You want to extract them for use in your program. You can probably tell the number format (by looking at the file) and your current locale, so it's ok to use any of the forms, as long as it gets the job done. d
requires the fewest keystrokes, so it's very commonly used.
Input sanitizing
You have some untrusted user input (maybe from a web form), and you need to make certain it doesn't contain any surprises. Maybe you want to store it in a numeric field in a database, or use as a parameter to a shell command to run on a server. In this case, you really want [0-9]
, since it's the most restrictive and predictable one.
Data validation
You have a bit of data that you are not going to use for anything "dangerous", but it would nice to know if it's a number. For example, your program allows the user to input an address, and you want to highlight a possible typo if the input doesn't contain a house number. In this case, you probably want to be as broad as possible, so [[:digit:]]
is the way to go.
Those would seem to be the three most common use cases for digit matching. If you think I missed an important one, please drop a comment.
answered Jan 3 at 7:18
Bass
21113
21113
nice job, Is security problem related, such as ReDoS or others
– frams
Jan 4 at 0:56
add a comment |
nice job, Is security problem related, such as ReDoS or others
– frams
Jan 4 at 0:56
nice job, Is security problem related, such as ReDoS or others
– frams
Jan 4 at 0:56
nice job, Is security problem related, such as ReDoS or others
– frams
Jan 4 at 0:56
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f414226%2fdifference-between-0-9-digit-and-d%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
3
Doesn't the Wikipedia article that you linked to answer your question? Different regular expression processors/engines support different syntaxes for character classes (among other things).
– igal
Jan 2 at 3:34
@igal wiki says there is difference but doesn't give much detail. I'm asking the detail, something like isaac, thrig said. I'm pretty interested in their difference in grep, sed, awk... whether GNU version or not.
– harbinn
Jan 2 at 7:01