Text cleaning script, producing lowercase words with minimal punctuation
up vote
6
down vote
favorite
I created following script to clean text that I scraped. The clean text would ideally be lowercase words, without numbers and at maybe only commas and a dot at the end of a sentence. It should only have white-space between words and remove all "n" elements from the text.
Particularly, I'm interested in feedback to the following code:
def cleaning(text):
import string
exclude = set(string.punctuation)
import re
# remove new line and digits with regular expression
text = re.sub(r'n', '', text)
text = re.sub(r'd', '', text)
# remove patterns matching url format
url_pattern = r'((http|ftp|https)://)?[w-_]+(.[w-_]+)+([w-.,@?^=%&:/~+#]*[w-@?^=%&/~+#])?'
text = re.sub(url_pattern, ' ', text)
# remove non-ascii characters
text = ''.join(character for character in text if ord(character) < 128)
# remove punctuations
text = ''.join(character for character in text if character not in exclude)
# standardize white space
text = re.sub(r's+', ' ', text)
# drop capitalization
text = text.lower()
#remove white space
text = text.strip()
return text
The script is cleaned via
cleaner = lambda x: cleaning(x)
df['text_clean'] = df['text'].apply(cleaner)
# Replace and remove empty rows
df['text_clean'] = df['text_clean'].replace('', np.nan)
df = df.dropna(how='any')
So far, the script does the job, which is great. However, how could the script above be improved, or be written cleaner?
Unclear seems the difference between
text = re.sub(r'n', '', text)
text = re.sub('n', '', text)
and whether
text = re.sub(r's+', ' ', text)
...
text = text.strip()
makes sense.
python strings parsing regex
add a comment |
up vote
6
down vote
favorite
I created following script to clean text that I scraped. The clean text would ideally be lowercase words, without numbers and at maybe only commas and a dot at the end of a sentence. It should only have white-space between words and remove all "n" elements from the text.
Particularly, I'm interested in feedback to the following code:
def cleaning(text):
import string
exclude = set(string.punctuation)
import re
# remove new line and digits with regular expression
text = re.sub(r'n', '', text)
text = re.sub(r'd', '', text)
# remove patterns matching url format
url_pattern = r'((http|ftp|https)://)?[w-_]+(.[w-_]+)+([w-.,@?^=%&:/~+#]*[w-@?^=%&/~+#])?'
text = re.sub(url_pattern, ' ', text)
# remove non-ascii characters
text = ''.join(character for character in text if ord(character) < 128)
# remove punctuations
text = ''.join(character for character in text if character not in exclude)
# standardize white space
text = re.sub(r's+', ' ', text)
# drop capitalization
text = text.lower()
#remove white space
text = text.strip()
return text
The script is cleaned via
cleaner = lambda x: cleaning(x)
df['text_clean'] = df['text'].apply(cleaner)
# Replace and remove empty rows
df['text_clean'] = df['text_clean'].replace('', np.nan)
df = df.dropna(how='any')
So far, the script does the job, which is great. However, how could the script above be improved, or be written cleaner?
Unclear seems the difference between
text = re.sub(r'n', '', text)
text = re.sub('n', '', text)
and whether
text = re.sub(r's+', ' ', text)
...
text = text.strip()
makes sense.
python strings parsing regex
Also, maybe text = ''.join(character for character in text if ord(character) < 128) can be replaced by .decode('utf-8')?
– TensorJoe
Feb 2 at 21:01
2
cleaner
is not necessary. Just do.apply(cleaning)
.
– Dair
Feb 2 at 22:34
Good suggestion. I actually wondered, too, why another script did it. Performance?
– TensorJoe
Feb 3 at 8:14
add a comment |
up vote
6
down vote
favorite
up vote
6
down vote
favorite
I created following script to clean text that I scraped. The clean text would ideally be lowercase words, without numbers and at maybe only commas and a dot at the end of a sentence. It should only have white-space between words and remove all "n" elements from the text.
Particularly, I'm interested in feedback to the following code:
def cleaning(text):
import string
exclude = set(string.punctuation)
import re
# remove new line and digits with regular expression
text = re.sub(r'n', '', text)
text = re.sub(r'd', '', text)
# remove patterns matching url format
url_pattern = r'((http|ftp|https)://)?[w-_]+(.[w-_]+)+([w-.,@?^=%&:/~+#]*[w-@?^=%&/~+#])?'
text = re.sub(url_pattern, ' ', text)
# remove non-ascii characters
text = ''.join(character for character in text if ord(character) < 128)
# remove punctuations
text = ''.join(character for character in text if character not in exclude)
# standardize white space
text = re.sub(r's+', ' ', text)
# drop capitalization
text = text.lower()
#remove white space
text = text.strip()
return text
The script is cleaned via
cleaner = lambda x: cleaning(x)
df['text_clean'] = df['text'].apply(cleaner)
# Replace and remove empty rows
df['text_clean'] = df['text_clean'].replace('', np.nan)
df = df.dropna(how='any')
So far, the script does the job, which is great. However, how could the script above be improved, or be written cleaner?
Unclear seems the difference between
text = re.sub(r'n', '', text)
text = re.sub('n', '', text)
and whether
text = re.sub(r's+', ' ', text)
...
text = text.strip()
makes sense.
python strings parsing regex
I created following script to clean text that I scraped. The clean text would ideally be lowercase words, without numbers and at maybe only commas and a dot at the end of a sentence. It should only have white-space between words and remove all "n" elements from the text.
Particularly, I'm interested in feedback to the following code:
def cleaning(text):
import string
exclude = set(string.punctuation)
import re
# remove new line and digits with regular expression
text = re.sub(r'n', '', text)
text = re.sub(r'd', '', text)
# remove patterns matching url format
url_pattern = r'((http|ftp|https)://)?[w-_]+(.[w-_]+)+([w-.,@?^=%&:/~+#]*[w-@?^=%&/~+#])?'
text = re.sub(url_pattern, ' ', text)
# remove non-ascii characters
text = ''.join(character for character in text if ord(character) < 128)
# remove punctuations
text = ''.join(character for character in text if character not in exclude)
# standardize white space
text = re.sub(r's+', ' ', text)
# drop capitalization
text = text.lower()
#remove white space
text = text.strip()
return text
The script is cleaned via
cleaner = lambda x: cleaning(x)
df['text_clean'] = df['text'].apply(cleaner)
# Replace and remove empty rows
df['text_clean'] = df['text_clean'].replace('', np.nan)
df = df.dropna(how='any')
So far, the script does the job, which is great. However, how could the script above be improved, or be written cleaner?
Unclear seems the difference between
text = re.sub(r'n', '', text)
text = re.sub('n', '', text)
and whether
text = re.sub(r's+', ' ', text)
...
text = text.strip()
makes sense.
python strings parsing regex
python strings parsing regex
edited Feb 2 at 18:44
200_success
127k15148411
127k15148411
asked Feb 2 at 16:08
TensorJoe
312
312
Also, maybe text = ''.join(character for character in text if ord(character) < 128) can be replaced by .decode('utf-8')?
– TensorJoe
Feb 2 at 21:01
2
cleaner
is not necessary. Just do.apply(cleaning)
.
– Dair
Feb 2 at 22:34
Good suggestion. I actually wondered, too, why another script did it. Performance?
– TensorJoe
Feb 3 at 8:14
add a comment |
Also, maybe text = ''.join(character for character in text if ord(character) < 128) can be replaced by .decode('utf-8')?
– TensorJoe
Feb 2 at 21:01
2
cleaner
is not necessary. Just do.apply(cleaning)
.
– Dair
Feb 2 at 22:34
Good suggestion. I actually wondered, too, why another script did it. Performance?
– TensorJoe
Feb 3 at 8:14
Also, maybe text = ''.join(character for character in text if ord(character) < 128) can be replaced by .decode('utf-8')?
– TensorJoe
Feb 2 at 21:01
Also, maybe text = ''.join(character for character in text if ord(character) < 128) can be replaced by .decode('utf-8')?
– TensorJoe
Feb 2 at 21:01
2
2
cleaner
is not necessary. Just do .apply(cleaning)
.– Dair
Feb 2 at 22:34
cleaner
is not necessary. Just do .apply(cleaning)
.– Dair
Feb 2 at 22:34
Good suggestion. I actually wondered, too, why another script did it. Performance?
– TensorJoe
Feb 3 at 8:14
Good suggestion. I actually wondered, too, why another script did it. Performance?
– TensorJoe
Feb 3 at 8:14
add a comment |
1 Answer
1
active
oldest
votes
up vote
2
down vote
Actually, your approach consists to remove or to replace with a space all that isn't a word (urls and characters that are not an ascii letter). Then you finish the job removing duplicate spaces, spaces at the beginning or the end of the string, and converting all in lower-case.
The idea makes sense.
But concretely, what is the result of this script?
It returns all words in lower-case separated by a space.
Described like that, you easily understand that you can extract the words and join them with a space. To do that, a simple re.findall(r'[a-z]+', text)
suffices, but you have to remove the urls first if you don't want to catch letter sequences contained in them.
The url pattern
If you read your url pattern, you can see that the only part that isn't optional is in fact [w-]+(?:.[w-]+)+
(written [w-_]+(.[w-_]+)+
in your script: _
is already inside w
, you can put -
at the end of a character without to escape it, the capture group is useless).
All that comes after this part of the pattern doesn't require a precise description and can be replaced with a S*
(zero or more non-white-spaces). Even if it catches a closing parenthesis or a comma, it isn't important for what you want to do (we will see how to handle commas or dots later).
One of the weaknesses of the url pattern is that it starts with an alternation in an optional group. This means that at each failing position of the string, the regex engine has to test the three alternatives (http|ftp|https)
and without the whole group for nothing.
It's possible to improve that a little if you start the pattern with a word boundary and if you replace the last alternative (https) with an optional s
in the first.
The url pattern can be rewritten like this:
b(?:(?:https|ftp)://)?w[w-]*(?:.[w-]+)+S*
and the whole function:
import re
def cleaning2(text):
text = re.sub(r'b(?:(?:https?|ftp)://)?w[w-]*(?:.[w-]+)+S*', ' ', text.lower())
words = re.findall(r'[a-z]+', text)
return ' '.join(words)
Note that the URL syntax can be particularly complex, and that it isn't always possible to extract an URL from a non-formatted string.
If you want to keep commas and dots:
Few changes, you only have to be sure that S*
in the url pattern doesn't eat a comma or a dot at the end of the url with a negative lookbehind (?<!...)
, and to add them in the character class in the re.findall
pattern:
import re
def cleaning2(text):
text = re.sub(r'b(?:(?:https?|ftp)://)?w[w-]*(?:.[w-]+)+S*(?<![.,])', ' ', text.lower())
words = re.findall(r'[a-z.,]+', text)
return ' '.join(words)
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
Actually, your approach consists to remove or to replace with a space all that isn't a word (urls and characters that are not an ascii letter). Then you finish the job removing duplicate spaces, spaces at the beginning or the end of the string, and converting all in lower-case.
The idea makes sense.
But concretely, what is the result of this script?
It returns all words in lower-case separated by a space.
Described like that, you easily understand that you can extract the words and join them with a space. To do that, a simple re.findall(r'[a-z]+', text)
suffices, but you have to remove the urls first if you don't want to catch letter sequences contained in them.
The url pattern
If you read your url pattern, you can see that the only part that isn't optional is in fact [w-]+(?:.[w-]+)+
(written [w-_]+(.[w-_]+)+
in your script: _
is already inside w
, you can put -
at the end of a character without to escape it, the capture group is useless).
All that comes after this part of the pattern doesn't require a precise description and can be replaced with a S*
(zero or more non-white-spaces). Even if it catches a closing parenthesis or a comma, it isn't important for what you want to do (we will see how to handle commas or dots later).
One of the weaknesses of the url pattern is that it starts with an alternation in an optional group. This means that at each failing position of the string, the regex engine has to test the three alternatives (http|ftp|https)
and without the whole group for nothing.
It's possible to improve that a little if you start the pattern with a word boundary and if you replace the last alternative (https) with an optional s
in the first.
The url pattern can be rewritten like this:
b(?:(?:https|ftp)://)?w[w-]*(?:.[w-]+)+S*
and the whole function:
import re
def cleaning2(text):
text = re.sub(r'b(?:(?:https?|ftp)://)?w[w-]*(?:.[w-]+)+S*', ' ', text.lower())
words = re.findall(r'[a-z]+', text)
return ' '.join(words)
Note that the URL syntax can be particularly complex, and that it isn't always possible to extract an URL from a non-formatted string.
If you want to keep commas and dots:
Few changes, you only have to be sure that S*
in the url pattern doesn't eat a comma or a dot at the end of the url with a negative lookbehind (?<!...)
, and to add them in the character class in the re.findall
pattern:
import re
def cleaning2(text):
text = re.sub(r'b(?:(?:https?|ftp)://)?w[w-]*(?:.[w-]+)+S*(?<![.,])', ' ', text.lower())
words = re.findall(r'[a-z.,]+', text)
return ' '.join(words)
add a comment |
up vote
2
down vote
Actually, your approach consists to remove or to replace with a space all that isn't a word (urls and characters that are not an ascii letter). Then you finish the job removing duplicate spaces, spaces at the beginning or the end of the string, and converting all in lower-case.
The idea makes sense.
But concretely, what is the result of this script?
It returns all words in lower-case separated by a space.
Described like that, you easily understand that you can extract the words and join them with a space. To do that, a simple re.findall(r'[a-z]+', text)
suffices, but you have to remove the urls first if you don't want to catch letter sequences contained in them.
The url pattern
If you read your url pattern, you can see that the only part that isn't optional is in fact [w-]+(?:.[w-]+)+
(written [w-_]+(.[w-_]+)+
in your script: _
is already inside w
, you can put -
at the end of a character without to escape it, the capture group is useless).
All that comes after this part of the pattern doesn't require a precise description and can be replaced with a S*
(zero or more non-white-spaces). Even if it catches a closing parenthesis or a comma, it isn't important for what you want to do (we will see how to handle commas or dots later).
One of the weaknesses of the url pattern is that it starts with an alternation in an optional group. This means that at each failing position of the string, the regex engine has to test the three alternatives (http|ftp|https)
and without the whole group for nothing.
It's possible to improve that a little if you start the pattern with a word boundary and if you replace the last alternative (https) with an optional s
in the first.
The url pattern can be rewritten like this:
b(?:(?:https|ftp)://)?w[w-]*(?:.[w-]+)+S*
and the whole function:
import re
def cleaning2(text):
text = re.sub(r'b(?:(?:https?|ftp)://)?w[w-]*(?:.[w-]+)+S*', ' ', text.lower())
words = re.findall(r'[a-z]+', text)
return ' '.join(words)
Note that the URL syntax can be particularly complex, and that it isn't always possible to extract an URL from a non-formatted string.
If you want to keep commas and dots:
Few changes, you only have to be sure that S*
in the url pattern doesn't eat a comma or a dot at the end of the url with a negative lookbehind (?<!...)
, and to add them in the character class in the re.findall
pattern:
import re
def cleaning2(text):
text = re.sub(r'b(?:(?:https?|ftp)://)?w[w-]*(?:.[w-]+)+S*(?<![.,])', ' ', text.lower())
words = re.findall(r'[a-z.,]+', text)
return ' '.join(words)
add a comment |
up vote
2
down vote
up vote
2
down vote
Actually, your approach consists to remove or to replace with a space all that isn't a word (urls and characters that are not an ascii letter). Then you finish the job removing duplicate spaces, spaces at the beginning or the end of the string, and converting all in lower-case.
The idea makes sense.
But concretely, what is the result of this script?
It returns all words in lower-case separated by a space.
Described like that, you easily understand that you can extract the words and join them with a space. To do that, a simple re.findall(r'[a-z]+', text)
suffices, but you have to remove the urls first if you don't want to catch letter sequences contained in them.
The url pattern
If you read your url pattern, you can see that the only part that isn't optional is in fact [w-]+(?:.[w-]+)+
(written [w-_]+(.[w-_]+)+
in your script: _
is already inside w
, you can put -
at the end of a character without to escape it, the capture group is useless).
All that comes after this part of the pattern doesn't require a precise description and can be replaced with a S*
(zero or more non-white-spaces). Even if it catches a closing parenthesis or a comma, it isn't important for what you want to do (we will see how to handle commas or dots later).
One of the weaknesses of the url pattern is that it starts with an alternation in an optional group. This means that at each failing position of the string, the regex engine has to test the three alternatives (http|ftp|https)
and without the whole group for nothing.
It's possible to improve that a little if you start the pattern with a word boundary and if you replace the last alternative (https) with an optional s
in the first.
The url pattern can be rewritten like this:
b(?:(?:https|ftp)://)?w[w-]*(?:.[w-]+)+S*
and the whole function:
import re
def cleaning2(text):
text = re.sub(r'b(?:(?:https?|ftp)://)?w[w-]*(?:.[w-]+)+S*', ' ', text.lower())
words = re.findall(r'[a-z]+', text)
return ' '.join(words)
Note that the URL syntax can be particularly complex, and that it isn't always possible to extract an URL from a non-formatted string.
If you want to keep commas and dots:
Few changes, you only have to be sure that S*
in the url pattern doesn't eat a comma or a dot at the end of the url with a negative lookbehind (?<!...)
, and to add them in the character class in the re.findall
pattern:
import re
def cleaning2(text):
text = re.sub(r'b(?:(?:https?|ftp)://)?w[w-]*(?:.[w-]+)+S*(?<![.,])', ' ', text.lower())
words = re.findall(r'[a-z.,]+', text)
return ' '.join(words)
Actually, your approach consists to remove or to replace with a space all that isn't a word (urls and characters that are not an ascii letter). Then you finish the job removing duplicate spaces, spaces at the beginning or the end of the string, and converting all in lower-case.
The idea makes sense.
But concretely, what is the result of this script?
It returns all words in lower-case separated by a space.
Described like that, you easily understand that you can extract the words and join them with a space. To do that, a simple re.findall(r'[a-z]+', text)
suffices, but you have to remove the urls first if you don't want to catch letter sequences contained in them.
The url pattern
If you read your url pattern, you can see that the only part that isn't optional is in fact [w-]+(?:.[w-]+)+
(written [w-_]+(.[w-_]+)+
in your script: _
is already inside w
, you can put -
at the end of a character without to escape it, the capture group is useless).
All that comes after this part of the pattern doesn't require a precise description and can be replaced with a S*
(zero or more non-white-spaces). Even if it catches a closing parenthesis or a comma, it isn't important for what you want to do (we will see how to handle commas or dots later).
One of the weaknesses of the url pattern is that it starts with an alternation in an optional group. This means that at each failing position of the string, the regex engine has to test the three alternatives (http|ftp|https)
and without the whole group for nothing.
It's possible to improve that a little if you start the pattern with a word boundary and if you replace the last alternative (https) with an optional s
in the first.
The url pattern can be rewritten like this:
b(?:(?:https|ftp)://)?w[w-]*(?:.[w-]+)+S*
and the whole function:
import re
def cleaning2(text):
text = re.sub(r'b(?:(?:https?|ftp)://)?w[w-]*(?:.[w-]+)+S*', ' ', text.lower())
words = re.findall(r'[a-z]+', text)
return ' '.join(words)
Note that the URL syntax can be particularly complex, and that it isn't always possible to extract an URL from a non-formatted string.
If you want to keep commas and dots:
Few changes, you only have to be sure that S*
in the url pattern doesn't eat a comma or a dot at the end of the url with a negative lookbehind (?<!...)
, and to add them in the character class in the re.findall
pattern:
import re
def cleaning2(text):
text = re.sub(r'b(?:(?:https?|ftp)://)?w[w-]*(?:.[w-]+)+S*(?<![.,])', ' ', text.lower())
words = re.findall(r'[a-z.,]+', text)
return ' '.join(words)
answered Feb 4 at 17:39
Casimir et Hippolyte
25728
25728
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f186614%2ftext-cleaning-script-producing-lowercase-words-with-minimal-punctuation%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Also, maybe text = ''.join(character for character in text if ord(character) < 128) can be replaced by .decode('utf-8')?
– TensorJoe
Feb 2 at 21:01
2
cleaner
is not necessary. Just do.apply(cleaning)
.– Dair
Feb 2 at 22:34
Good suggestion. I actually wondered, too, why another script did it. Performance?
– TensorJoe
Feb 3 at 8:14