0

The following python script allows me to scrape email addresses from a given file using regular expressions.

I'm trying to add phone numbers to the regular expression also. I created this regex and seems to work on 7 and 10 digit numbers:

(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})

Can this just be added to my existing regular expression? I figure I need to edit where I use re.compile but not completely sure how to do this in python. Any help would be appreciated.

# filename variables
filename = 'file.txt'
newfilename = 'result.txt'

# read the file
if os.path.exists(filename):
        data = open(filename,'r')
        bulkemails = data.read()
else:
        print "File not found."
        raise SystemExit

# regex = [email protected]
r = re.compile(r'(\b[\w.]+@+[\w.]+.+[\w.]\b)')
results = r.findall(bulkemails)
emails = ""
for x in results:
        emails += str(x)+"\n"

# function to write file
def writefile():
        f = open(newfilename, 'w')
        f.write(emails)
        f.close()
        print "File written."

EDIT When running on http://en.wikipedia.org/wiki/Telephone_number It produces the following output:

2678400
2678400
2678400
2678400
2678400
2678400
2678400
2678400
2678400
8790468
9664261
555-1212
555-9225
555-1212
869-1234
555-5555
555-1212
867-5309
867-5309
867-5309
(267) 867-5309
(212) 736-5000
243-3460
2977743
1000000
2048000
2048000
8790468
9070412
9664261
9664261
9664261

2 Answers 2

1

I would not advise combining the two regexes. It's possible, but it will make for code which is harder to understand and maintain down the road.

(Also, leaving the regexes separate will let you handle emails and phone numbers differently down the line, which you're likely to want to do.)

Sign up to request clarification or add additional context in comments.

Comments

0

For one, I would simplify your regex:

(?:\(?\b\d{3}\)?[-.\s]*)?\d{3}[-.\s]*\d{4}\b

will match the same correct numbers as before and have fewer false hits.

Second, your e-mail regex will miss a lot of valid e-mail addresses and have many false positives, too (it would match aaaa@@@@aaaa, for example). While you can never match e-mail address with 100 % reliability using regex, the following one is better, too:

\b[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}\b

(Use the case insensitive option when compiling it).

To restrict yourself to some few TLDs, you can use

\b[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+(?:asia|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[A-Z]{2})\b

10 Comments

Thanks for the modified regex. How do you specify case insensitive option when compiling?
And, you happen to know of a simple way to specify only TLD's for the email address?
re.compile("regex", re.I), and why would you want to limit your regex to TLDs?
Cool, I was just thinking to help verify the emails even more.
Not a good idea. You'll have to send an email to a potential address anyway to verify - no regex and no parser can find out if an address actually exists.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.