Welcome to the wonderful world of regular expressions, where the tiniest of changes can result in totally unexpected outcomes.
To start, let's analyze the regex pattern you have:
r'\b([a-zA-Z])([\w.-_+]+)@([\w.-]+)([a-zA-Z])\b'
\b is a correct choice, since you want items which are there own word. Be careful though, since this won't include the beginning or end of a string.
([a-zA-Z]) if your first capture group. You can replace with the simpler ([A-z])
([\w.-_+]+) is your second capture group. It will capture:
\w any word character (redundant)
. will not necessarily capture the period character, instead, is capturing "any" character
- will not capture the dash character, instead, will capture a range of characters
_ will indeed capture underscore characters – but in this case, it's being referenced as the end of a range
+ will not capture plus characters, rather, will get "1 or more" characters from a group or range.
... I'll stop here, since the rest is more or less similar...
You'll want to replace your regex with the following:
r'\b([A-z0-9\-\+]+@[A-z\-\+]+\.[A-z]{3})\b'
- There is only one capture group, since we want entire email addresses.
- Email addresses (here) are allowed to contain:
- Before the at symbol:
[A-z0-9\-\+]+ all alpha-numeric characters as well as '-' and '+' characters (as denoted by the escaped characters \- and \+
- Following the at symbol, a domain name
[A-z\-\+] with alpha characters and escaped chars
- Followed by a domain extension
\.[A-z]{3} Ex: .org
Next, you can refactor your code to the following:
import re
pattern = re.compile(r'\b([A-z0-9\-\+]+@[A-z\-\+]+\.[A-z]{3})\b')
match = pattern.search(s)
if match:
email = match.group()
else:
email = None
-has special meaning inside[]in a regexp, it's used to specify a range of characters, likea-z. What do you think.-_matches?