Edit regex in python script

Question

The following python script allows me to scrape email addresses from a given file using regular expressions.

I'm trying to add phone numbers to the regular expression also. I created this regex and seems to work on 7 and 10 digit numbers:

(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})

Can this just be added to my existing regular expression? I figure I need to edit where I use re.compile but not completely sure how to do this in python. Any help would be appreciated.

# filename variables
filename = 'file.txt'
newfilename = 'result.txt'

# read the file
if os.path.exists(filename):
        data = open(filename,'r')
        bulkemails = data.read()
else:
        print "File not found."
        raise SystemExit

# regex = [email protected]
r = re.compile(r'(\b[\w.]+@+[\w.]+.+[\w.]\b)')
results = r.findall(bulkemails)
emails = ""
for x in results:
        emails += str(x)+"\n"

# function to write file
def writefile():
        f = open(newfilename, 'w')
        f.write(emails)
        f.close()
        print "File written."

EDIT When running on http://en.wikipedia.org/wiki/Telephone_number It produces the following output:

2678400
2678400
2678400
2678400
2678400
2678400
2678400
2678400
2678400
8790468
9664261
555-1212
555-9225
555-1212
869-1234
555-5555
555-1212
867-5309
867-5309
867-5309
(267) 867-5309
(212) 736-5000
243-3460
2977743
1000000
2048000
2048000
8790468
9070412
9664261
9664261
9664261

pjmorse · Accepted Answer · 2010-10-10 17:08:18Z

1

I would not advise combining the two regexes. It's possible, but it will make for code which is harder to understand and maintain down the road.

(Also, leaving the regexes separate will let you handle emails and phone numbers differently down the line, which you're likely to want to do.)

answered Oct 10, 2010 at 17:08

pjmorse

9,3349 gold badges57 silver badges126 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Tim Pietzcker · Accepted Answer · 2010-10-11 06:16:15Z

0

For one, I would simplify your regex:

(?:\(?\b\d{3}\)?[-.\s]*)?\d{3}[-.\s]*\d{4}\b

will match the same correct numbers as before and have fewer false hits.

Second, your e-mail regex will miss a lot of valid e-mail addresses and have many false positives, too (it would match aaaa@@@@aaaa, for example). While you can never match e-mail address with 100 % reliability using regex, the following one is better, too:

\b[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}\b

(Use the case insensitive option when compiling it).

To restrict yourself to some few TLDs, you can use

\b[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+(?:asia|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[A-Z]{2})\b

edited Oct 11, 2010 at 6:16

answered Oct 10, 2010 at 17:16

Tim Pietzcker

337k59 gold badges520 silver badges572 bronze badges

10 Comments

Aaron Over a year ago

Thanks for the modified regex. How do you specify case insensitive option when compiling?

Aaron Over a year ago

And, you happen to know of a simple way to specify only TLD's for the email address?

Tim Pietzcker Over a year ago

re.compile("regex", re.I), and why would you want to limit your regex to TLDs?

Aaron Over a year ago

Cool, I was just thinking to help verify the emails even more.

Tim Pietzcker Over a year ago

Not a good idea. You'll have to send an email to a potential address anyway to verify - no regex and no parser can find out if an address actually exists.

|

Collectives™ on Stack Overflow

Edit regex in python script

2 Answers 2

Comments

10 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

10 Comments

Your Answer

Sign up or log in

Post as a guest

Related