Extracting words from text using python regex

Question

I have a text (string) and I want to perform this task in python:

I perform the CountVectorizer method in order to make a bag of words. You may find this method here: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

This method includes stopWords removal and it works fine. It removes any punctuation and break every word. But besides the words it returns lots of trash like single letters and numbers.

This method though, has one parameter called "token_pattern" that takes a string (regex) that can give me better results.

What i want to do is: a) Exlude Any words that start, end or include numbers. b) exclude any numbers from text c) exclude any words <= 2 letters b) exclude all the http pages

For example, this regex should give me this:

text = "It can be dangerous to take Fido for a ride: http://t.co/eR2WfAnZBI http://t.co/RF3bhPNPwR',each year, on average, 20 billion empty miles are incurred by trucks, which costs the economy billions"

final_text = "can dangerous take Fido for ride each year average billion empty miles are incurred trucks which costs the economy billions"

I Thanks in advance for your time and attention :)

Could you show what you have tried so far?

Cleb
– Cleb

2015-08-05 13:10:26 +00:00
Commented Aug 5, 2015 at 13:10 — Cleb
– Cleb, Commented Aug 5, 2015 at 13:10

Charlie Haley · Accepted Answer · 2015-08-05 14:21:05Z

1

Here is a piece of regex that grabs any word made up of solely letters of length 3 or more.

[a-zA-Z]{3,}

Here is a piece of regex that grabs any line without a URL in it.

^((?!(https?:\/\/)+([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w=?$#% \.-]*)).)*$

I haven't figured out how to combine the two yet. But at the very least, this is a step in the right direction. You could put each word on its own line, then remove urls, then match words of 3 or more letters. Ugly, but would work.

edited Aug 5, 2015 at 14:21

answered Aug 5, 2015 at 13:53

Charlie Haley

4,3264 gold badges24 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Asunez Over a year ago

For your first regex, wouldn't that be easier with this regex: [a-zA-Z]{3,}

Charlie Haley Over a year ago

Yep, I thought there was a solution like that but didn't know about the comma functionality. Edited.

Asunez Over a year ago

Just to explain, {x, y} means to match minimum x-times and maximum y-times. When ommiting any part it means we just need only one part of the min-max.

Claudiu-Florin Stroe · Accepted Answer · 2015-08-05 13:25:16Z

0

I don't know python but regex is the same for any programming language so my answer is :

"(\s?\w+[0-9]+\w+\s?)|([0-9]+)|(\s\w\w\s)|(http://t.co/)"g

answered Aug 5, 2015 at 13:25

Claudiu-Florin Stroe

1801 silver badge11 bronze badges

2 Comments

Aserre Over a year ago

Could you show a live example of your regex ? I tested yours on regex101 against OP's text, and it doesn't work at all

Claudiu-Florin Stroe Over a year ago

You cand see in this image that it works fine for me [1]: i.sstatic.net/sTVoo.jpg

Collectives™ on Stack Overflow

Extracting words from text using python regex

2 Answers 2

3 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related