1

I have a text (string) and I want to perform this task in python:

I perform the CountVectorizer method in order to make a bag of words. You may find this method here: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

This method includes stopWords removal and it works fine. It removes any punctuation and break every word. But besides the words it returns lots of trash like single letters and numbers.

This method though, has one parameter called "token_pattern" that takes a string (regex) that can give me better results.

What i want to do is: a) Exlude Any words that start, end or include numbers. b) exclude any numbers from text c) exclude any words <= 2 letters b) exclude all the http pages

For example, this regex should give me this:

text = "It can be dangerous to take Fido for a ride: http://t.co/eR2WfAnZBI http://t.co/RF3bhPNPwR',each year, on average, 20 billion empty miles are incurred by trucks, which costs the economy billions"

final_text = "can dangerous take Fido for ride each year average billion empty miles are incurred trucks which costs the economy billions"

I Thanks in advance for your time and attention :)

1
  • 2
    Could you show what you have tried so far? Commented Aug 5, 2015 at 13:10

2 Answers 2

1

Here is a piece of regex that grabs any word made up of solely letters of length 3 or more.

[a-zA-Z]{3,}

Here is a piece of regex that grabs any line without a URL in it.

^((?!(https?:\/\/)+([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w=?$#% \.-]*)).)*$

I haven't figured out how to combine the two yet. But at the very least, this is a step in the right direction. You could put each word on its own line, then remove urls, then match words of 3 or more letters. Ugly, but would work.

Sign up to request clarification or add additional context in comments.

3 Comments

For your first regex, wouldn't that be easier with this regex: [a-zA-Z]{3,}
Yep, I thought there was a solution like that but didn't know about the comma functionality. Edited.
Just to explain, {x, y} means to match minimum x-times and maximum y-times. When ommiting any part it means we just need only one part of the min-max.
0

I don't know python but regex is the same for any programming language so my answer is :

"(\s?\w+[0-9]+\w+\s?)|([0-9]+)|(\s\w\w\s)|(http://t.co/)"g

2 Comments

Could you show a live example of your regex ? I tested yours on regex101 against OP's text, and it doesn't work at all
You cand see in this image that it works fine for me [1]: i.sstatic.net/sTVoo.jpg

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.