Posts

Showing posts from November 22, 2018

Text cleaning script, producing lowercase words with minimal punctuation

Image
up vote 6 down vote favorite I created following script to clean text that I scraped. The clean text would ideally be lowercase words, without numbers and at maybe only commas and a dot at the end of a sentence. It should only have white-space between words and remove all "n" elements from the text. Particularly, I'm interested in feedback to the following code: def cleaning(text): import string exclude = set(string.punctuation) import re # remove new line and digits with regular expression text = re.sub(r'n', '', text) text = re.sub(r'd', '', text) # remove patterns matching url format url_pattern = r'((http|ftp|https)://)?[w-_]+(.[w-_]+)+([w-.,@?^=%&:/~+#]*[w-@?^=%&/~+#])?' text = re.sub(url_pattern, ' '