Text cleaning script, producing lowercase words with minimal punctuation

up vote
6
down vote

favorite

I created following script to clean text that I scraped. The clean text would ideally be lowercase words, without numbers and at maybe only commas and a dot at the end of a sentence. It should only have white-space between words and remove all "n" elements from the text.

Particularly, I'm interested in feedback to the following code:

def cleaning(text):



    import string

    exclude = set(string.punctuation)



    import re

    # remove new line and digits with regular expression

    text = re.sub(r'n', '', text)

    text = re.sub(r'd', '', text)

    # remove patterns matching url format

    url_pattern = r'((http|ftp|https)://)?[w-_]+(.[w-_]+)+([w-.,@?^=%&amp;:/~+#]*[w-@?^=%&amp;/~+#])?'

    text = re.sub(url_pattern, ' ', text)

    # remove non-ascii characters

    text = ''.join(character for character in text if ord(character) < 128)

    # remove punctuations

    text = ''.join(character for character in text if character not in exclude)

    # standardize white space

    text = re.sub(r's+', ' ', text)

    # drop capitalization

    text = text.lower()

    #remove white space

    text = text.strip()



    return text

The script is cleaned via

cleaner = lambda x: cleaning(x)

df['text_clean'] = df['text'].apply(cleaner)

# Replace and remove empty rows

df['text_clean'] = df['text_clean'].replace('', np.nan)

df = df.dropna(how='any')

So far, the script does the job, which is great. However, how could the script above be improved, or be written cleaner?

Unclear seems the difference between

text = re.sub(r'n', '', text)

text = re.sub('n', '', text)

and whether

text = re.sub(r's+', ' ', text)

...

text = text.strip()

makes sense.

edited Feb 2 at 18:44

200_success

127k15148411

asked Feb 2 at 16:08

TensorJoe

312

Also, maybe text = ''.join(character for character in text if ord(character) < 128) can be replaced by .decode('utf-8')?
– TensorJoe
Feb 2 at 21:01

2

cleaner is not necessary. Just do .apply(cleaning).
– Dair
Feb 2 at 22:34

Good suggestion. I actually wondered, too, why another script did it. Performance?
– TensorJoe
Feb 3 at 8:14

add a comment |

up vote
6
down vote

favorite

Particularly, I'm interested in feedback to the following code:

def cleaning(text):



    import string

    exclude = set(string.punctuation)



    import re

    # remove new line and digits with regular expression

    text = re.sub(r'n', '', text)

    text = re.sub(r'd', '', text)

    # remove patterns matching url format

    url_pattern = r'((http|ftp|https)://)?[w-_]+(.[w-_]+)+([w-.,@?^=%&amp;:/~+#]*[w-@?^=%&amp;/~+#])?'

    text = re.sub(url_pattern, ' ', text)

    # remove non-ascii characters

    text = ''.join(character for character in text if ord(character) < 128)

    # remove punctuations

    text = ''.join(character for character in text if character not in exclude)

    # standardize white space

    text = re.sub(r's+', ' ', text)

    # drop capitalization

    text = text.lower()

    #remove white space

    text = text.strip()



    return text

The script is cleaned via

cleaner = lambda x: cleaning(x)

df['text_clean'] = df['text'].apply(cleaner)

# Replace and remove empty rows

df['text_clean'] = df['text_clean'].replace('', np.nan)

df = df.dropna(how='any')

So far, the script does the job, which is great. However, how could the script above be improved, or be written cleaner?

Unclear seems the difference between

text = re.sub(r'n', '', text)

text = re.sub('n', '', text)

and whether

text = re.sub(r's+', ' ', text)

...

text = text.strip()

makes sense.

edited Feb 2 at 18:44

200_success

127k15148411

asked Feb 2 at 16:08

TensorJoe

312

Also, maybe text = ''.join(character for character in text if ord(character) < 128) can be replaced by .decode('utf-8')?
– TensorJoe
Feb 2 at 21:01

2

cleaner is not necessary. Just do .apply(cleaning).
– Dair
Feb 2 at 22:34

Good suggestion. I actually wondered, too, why another script did it. Performance?
– TensorJoe
Feb 3 at 8:14

add a comment |

up vote
6
down vote

favorite

Particularly, I'm interested in feedback to the following code:

def cleaning(text):



    import string

    exclude = set(string.punctuation)



    import re

    # remove new line and digits with regular expression

    text = re.sub(r'n', '', text)

    text = re.sub(r'd', '', text)

    # remove patterns matching url format

    url_pattern = r'((http|ftp|https)://)?[w-_]+(.[w-_]+)+([w-.,@?^=%&amp;:/~+#]*[w-@?^=%&amp;/~+#])?'

    text = re.sub(url_pattern, ' ', text)

    # remove non-ascii characters

    text = ''.join(character for character in text if ord(character) < 128)

    # remove punctuations

    text = ''.join(character for character in text if character not in exclude)

    # standardize white space

    text = re.sub(r's+', ' ', text)

    # drop capitalization

    text = text.lower()

    #remove white space

    text = text.strip()



    return text

The script is cleaned via

cleaner = lambda x: cleaning(x)

df['text_clean'] = df['text'].apply(cleaner)

# Replace and remove empty rows

df['text_clean'] = df['text_clean'].replace('', np.nan)

df = df.dropna(how='any')

So far, the script does the job, which is great. However, how could the script above be improved, or be written cleaner?

Unclear seems the difference between

text = re.sub(r'n', '', text)

text = re.sub('n', '', text)

and whether

text = re.sub(r's+', ' ', text)

...

text = text.strip()

makes sense.

edited Feb 2 at 18:44

200_success

127k15148411

asked Feb 2 at 16:08

TensorJoe

312

Particularly, I'm interested in feedback to the following code:

def cleaning(text):



    import string

    exclude = set(string.punctuation)



    import re

    # remove new line and digits with regular expression

    text = re.sub(r'n', '', text)

    text = re.sub(r'd', '', text)

    # remove patterns matching url format

    url_pattern = r'((http|ftp|https)://)?[w-_]+(.[w-_]+)+([w-.,@?^=%&amp;:/~+#]*[w-@?^=%&amp;/~+#])?'

    text = re.sub(url_pattern, ' ', text)

    # remove non-ascii characters

    text = ''.join(character for character in text if ord(character) < 128)

    # remove punctuations

    text = ''.join(character for character in text if character not in exclude)

    # standardize white space

    text = re.sub(r's+', ' ', text)

    # drop capitalization

    text = text.lower()

    #remove white space

    text = text.strip()



    return text

The script is cleaned via

cleaner = lambda x: cleaning(x)

df['text_clean'] = df['text'].apply(cleaner)

# Replace and remove empty rows

df['text_clean'] = df['text_clean'].replace('', np.nan)

df = df.dropna(how='any')

So far, the script does the job, which is great. However, how could the script above be improved, or be written cleaner?

Unclear seems the difference between

text = re.sub(r'n', '', text)

text = re.sub('n', '', text)

and whether

text = re.sub(r's+', ' ', text)

...

text = text.strip()

makes sense.

python strings parsing regex

edited Feb 2 at 18:44

200_success

127k15148411

asked Feb 2 at 16:08

TensorJoe

312

edited Feb 2 at 18:44

200_success

127k15148411

asked Feb 2 at 16:08

TensorJoe

312

edited Feb 2 at 18:44

200_success

127k15148411

edited Feb 2 at 18:44

200_success

127k15148411

edited Feb 2 at 18:44

200_success

127k15148411

asked Feb 2 at 16:08

TensorJoe

312

asked Feb 2 at 16:08

TensorJoe

312

asked Feb 2 at 16:08

TensorJoe

312

Also, maybe text = ''.join(character for character in text if ord(character) < 128) can be replaced by .decode('utf-8')?
– TensorJoe
Feb 2 at 21:01

2

cleaner is not necessary. Just do .apply(cleaning).
– Dair
Feb 2 at 22:34

Good suggestion. I actually wondered, too, why another script did it. Performance?
– TensorJoe
Feb 3 at 8:14

add a comment |

Also, maybe text = ''.join(character for character in text if ord(character) < 128) can be replaced by .decode('utf-8')?
– TensorJoe
Feb 2 at 21:01

2

cleaner is not necessary. Just do .apply(cleaning).
– Dair
Feb 2 at 22:34

Good suggestion. I actually wondered, too, why another script did it. Performance?
– TensorJoe
Feb 3 at 8:14

Also, maybe text = ''.join(character for character in text if ord(character) < 128) can be replaced by .decode('utf-8')?
– TensorJoe
Feb 2 at 21:01

cleaner is not necessary. Just do .apply(cleaning).
– Dair
Feb 2 at 22:34

Good suggestion. I actually wondered, too, why another script did it. Performance?
– TensorJoe
Feb 3 at 8:14

add a comment |

1 Answer
1

active

oldest

votes

up vote
2
down vote

Actually, your approach consists to remove or to replace with a space all that isn't a word (urls and characters that are not an ascii letter). Then you finish the job removing duplicate spaces, spaces at the beginning or the end of the string, and converting all in lower-case.

The idea makes sense.

But concretely, what is the result of this script?
It returns all words in lower-case separated by a space.

Described like that, you easily understand that you can extract the words and join them with a space. To do that, a simple re.findall(r'[a-z]+', text) suffices, but you have to remove the urls first if you don't want to catch letter sequences contained in them.

The url pattern

If you read your url pattern, you can see that the only part that isn't optional is in fact [w-]+(?:.[w-]+)+ (written [w-_]+(.[w-_]+)+ in your script: _ is already inside w, you can put - at the end of a character without to escape it, the capture group is useless).
All that comes after this part of the pattern doesn't require a precise description and can be replaced with a S* (zero or more non-white-spaces). Even if it catches a closing parenthesis or a comma, it isn't important for what you want to do (we will see how to handle commas or dots later).

One of the weaknesses of the url pattern is that it starts with an alternation in an optional group. This means that at each failing position of the string, the regex engine has to test the three alternatives (http|ftp|https) and without the whole group for nothing.
It's possible to improve that a little if you start the pattern with a word boundary and if you replace the last alternative (https) with an optional s in the first.

The url pattern can be rewritten like this:

b(?:(?:https|ftp)://)?w[w-]*(?:.[w-]+)+S*

and the whole function:

import re



def cleaning2(text):

    text = re.sub(r'b(?:(?:https?|ftp)://)?w[w-]*(?:.[w-]+)+S*', ' ', text.lower())

    words = re.findall(r'[a-z]+', text)

    return ' '.join(words)

Note that the URL syntax can be particularly complex, and that it isn't always possible to extract an URL from a non-formatted string.

If you want to keep commas and dots:

Few changes, you only have to be sure that S* in the url pattern doesn't eat a comma or a dot at the end of the url with a negative lookbehind (?<!...), and to add them in the character class in the re.findall pattern:

import re



def cleaning2(text):

    text = re.sub(r'b(?:(?:https?|ftp)://)?w[w-]*(?:.[w-]+)+S*(?<![.,])', ' ', text.lower())

    words = re.findall(r'[a-z.,]+', text)

    return ' '.join(words)

answered Feb 4 at 17:39

Casimir et Hippolyte

25728

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f186614%2ftext-cleaning-script-producing-lowercase-words-with-minimal-punctuation%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
2
down vote

The idea makes sense.

But concretely, what is the result of this script?
It returns all words in lower-case separated by a space.

The url pattern

The url pattern can be rewritten like this:

b(?:(?:https|ftp)://)?w[w-]*(?:.[w-]+)+S*

and the whole function:

import re



def cleaning2(text):

    text = re.sub(r'b(?:(?:https?|ftp)://)?w[w-]*(?:.[w-]+)+S*', ' ', text.lower())

    words = re.findall(r'[a-z]+', text)

    return ' '.join(words)

Note that the URL syntax can be particularly complex, and that it isn't always possible to extract an URL from a non-formatted string.

import re



def cleaning2(text):

    text = re.sub(r'b(?:(?:https?|ftp)://)?w[w-]*(?:.[w-]+)+S*(?<![.,])', ' ', text.lower())

    words = re.findall(r'[a-z.,]+', text)

    return ' '.join(words)

answered Feb 4 at 17:39

Casimir et Hippolyte

25728

add a comment |

up vote
2
down vote

The idea makes sense.

But concretely, what is the result of this script?
It returns all words in lower-case separated by a space.

The url pattern

The url pattern can be rewritten like this:

b(?:(?:https|ftp)://)?w[w-]*(?:.[w-]+)+S*

and the whole function:

import re



def cleaning2(text):

    text = re.sub(r'b(?:(?:https?|ftp)://)?w[w-]*(?:.[w-]+)+S*', ' ', text.lower())

    words = re.findall(r'[a-z]+', text)

    return ' '.join(words)

Note that the URL syntax can be particularly complex, and that it isn't always possible to extract an URL from a non-formatted string.

import re



def cleaning2(text):

    text = re.sub(r'b(?:(?:https?|ftp)://)?w[w-]*(?:.[w-]+)+S*(?<![.,])', ' ', text.lower())

    words = re.findall(r'[a-z.,]+', text)

    return ' '.join(words)

answered Feb 4 at 17:39

Casimir et Hippolyte

25728

add a comment |

up vote
2
down vote

The idea makes sense.

But concretely, what is the result of this script?
It returns all words in lower-case separated by a space.

The url pattern

The url pattern can be rewritten like this:

b(?:(?:https|ftp)://)?w[w-]*(?:.[w-]+)+S*

and the whole function:

import re



def cleaning2(text):

    text = re.sub(r'b(?:(?:https?|ftp)://)?w[w-]*(?:.[w-]+)+S*', ' ', text.lower())

    words = re.findall(r'[a-z]+', text)

    return ' '.join(words)

Note that the URL syntax can be particularly complex, and that it isn't always possible to extract an URL from a non-formatted string.

import re



def cleaning2(text):

    text = re.sub(r'b(?:(?:https?|ftp)://)?w[w-]*(?:.[w-]+)+S*(?<![.,])', ' ', text.lower())

    words = re.findall(r'[a-z.,]+', text)

    return ' '.join(words)

answered Feb 4 at 17:39

Casimir et Hippolyte

25728

The idea makes sense.

But concretely, what is the result of this script?
It returns all words in lower-case separated by a space.

The url pattern

The url pattern can be rewritten like this:

b(?:(?:https|ftp)://)?w[w-]*(?:.[w-]+)+S*

and the whole function:

import re



def cleaning2(text):

    text = re.sub(r'b(?:(?:https?|ftp)://)?w[w-]*(?:.[w-]+)+S*', ' ', text.lower())

    words = re.findall(r'[a-z]+', text)

    return ' '.join(words)

Note that the URL syntax can be particularly complex, and that it isn't always possible to extract an URL from a non-formatted string.

import re



def cleaning2(text):

    text = re.sub(r'b(?:(?:https?|ftp)://)?w[w-]*(?:.[w-]+)+S*(?<![.,])', ' ', text.lower())

    words = re.findall(r'[a-z.,]+', text)

    return ' '.join(words)

answered Feb 4 at 17:39

Casimir et Hippolyte

25728

answered Feb 4 at 17:39

Casimir et Hippolyte

25728

answered Feb 4 at 17:39

Casimir et Hippolyte

25728

answered Feb 4 at 17:39

Casimir et Hippolyte

25728

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

m,avE,o,20YDV,nF4rzHd,TxFkVl970Za6v7QhYIVmoeguyI PmbaAGihhR KSqqLAqlf6 681s0U,09kyVaf wsDL1gvg1lfSks3lA

搜尋此網誌

Cfrtjryk