Regular Expressions using Python

Question

I am writing a function using regular expressions for emails. I think I write the pattern correctly, however, I couldn't find out why example 2 '[email protected]' failed to be detected while example 1 worked successfully?

def parse_email(s):
    try:
        pattern = re.compile(r'\b([a-zA-Z])([\w.-_+]+)@([\w.-]+)([a-zA-Z])\b')
        matches = pattern.finditer(s)
        for match in matches:
            print(match.group(0))
            return (match.group(1)+match.group(2), match.group(3)+match.group(4))
    except AttributeError:
        #print('here')
        raise ValueError


print(parse_email('[email protected]'))
print(parse_email('[email protected]'))

Results:

[email protected]
('JKRowling', 'Huge-Books.org')

[email protected]
('much', 'gmail.com')

- has special meaning inside [] in a regexp, it's used to specify a range of characters, like a-z. What do you think .-_ matches? — Barmar
– Barmar, Commented Sep 20, 2021 at 17:25

Brad Solomon · Accepted Answer · 2021-09-20 17:41:39Z

1

From re docs:

Ranges of characters can be indicated by giving two characters and separating them by a '-', for example [a-z] will match any lowercase ASCII letter, [0-5][0-9] will match all the two-digits numbers from 00 to 59, and [0-9A-Fa-f] will match any hexadecimal digit. If - is escaped (e.g. [a-z]) or if it’s placed as the first or last character (e.g. [-a] or [a-]), it will match a literal '-'. [emphasis added]

It looks like you are trying to match a literal -, so place it as the first character of the range, e.g. [-xxx]:

pattern = re.compile(r'\b([a-zA-Z])([-\w._+]+)@([-\w.]+)([a-zA-Z])\b')

Test:

>>> import re
>>> pat = r"\b([a-zA-Z])([-\w._+]+)@([-\w.]+)([a-zA-Z])\b"
>>> old_pattern = re.compile(r'\b([a-zA-Z])([\w.-_+]+)@([\w.-]+)([a-zA-Z])\b')
>>> new_pattern = re.compile(r'\b([a-zA-Z])([-\w._+]+)@([-\w.]+)([a-zA-Z])\b')
>>> old_pattern.search('[email protected]')
<re.Match object; span=(21, 35), match='[email protected]'>
>>> new_pattern.search('[email protected]')
<re.Match object; span=(0, 35), match='[email protected]'>

answered Sep 20, 2021 at 17:41

Brad Solomon

41.2k39 gold badges167 silver badges260 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

CodingLife Over a year ago

Thank you! I tried to put it at the front of the group and it worked!

Yaakov Bressler · Accepted Answer · 2021-09-20 17:51:05Z

0

Welcome to the wonderful world of regular expressions, where the tiniest of changes can result in totally unexpected outcomes.

To start, let's analyze the regex pattern you have:

r'\b([a-zA-Z])([\w.-_+]+)@([\w.-]+)([a-zA-Z])\b'

\b is a correct choice, since you want items which are there own word. Be careful though, since this won't include the beginning or end of a string.
([a-zA-Z]) if your first capture group. You can replace with the simpler ([A-z])
([\w.-_+]+) is your second capture group. It will capture:
- \w any word character (redundant)
- . will not necessarily capture the period character, instead, is capturing "any" character
- - will not capture the dash character, instead, will capture a range of characters
- _ will indeed capture underscore characters – but in this case, it's being referenced as the end of a range
- + will not capture plus characters, rather, will get "1 or more" characters from a group or range.

... I'll stop here, since the rest is more or less similar...

You'll want to replace your regex with the following:

r'\b([A-z0-9\-\+]+@[A-z\-\+]+\.[A-z]{3})\b'

There is only one capture group, since we want entire email addresses.
Email addresses (here) are allowed to contain:
- Before the at symbol: [A-z0-9\-\+]+ all alpha-numeric characters as well as '-' and '+' characters (as denoted by the escaped characters \- and \+
- Following the at symbol, a domain name [A-z\-\+] with alpha characters and escaped chars
- Followed by a domain extension \.[A-z]{3} Ex: .org

Next, you can refactor your code to the following:

import re

pattern = re.compile(r'\b([A-z0-9\-\+]+@[A-z\-\+]+\.[A-z]{3})\b')
match = pattern.search(s)

if match:
   email = match.group()
else:
   email = None

edited Sep 20, 2021 at 17:51

answered Sep 20, 2021 at 17:45

Yaakov Bressler

12.7k5 gold badges66 silver badges96 bronze badges

2 Comments

CodingLife Over a year ago

Thank you for carefully analyzing my code. I wonder why you think \w is redundant?

Yaakov Bressler Over a year ago

\w is equivalent to [A-z]+ @CodingLife

Collectives™ on Stack Overflow

Regular Expressions using Python

2 Answers 2

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related