2

I would like to do a simple email validation for list import of email addresses into a database. I just want to make sure that there is content before the @ sign, an @ sign, content after the @ sign, and 2+ characters after the '.' . Here is a sample df:

import pandas as pd
import re

errors= {}

data= {'First Name': ['Sally', 'Bob', 'Sue', 'Tom', 'Will'],
     'Last Name': ['William', '', 'Wright', 'Smith','Thomas'],
     'Email Address': ['[email protected]','[email protected]','[email protected]','[email protected]','']}
df=pd.DataFrame(data)

This is the expression I was using to check for valid emails:

regex = re.compile(r'([A-Za-z0-9]+[.-_])*[A-Za-z0-9]+@[A-Za-z0-9-]+(\.[A-Z|a-z]{2,})+')
def isValid(email):
    if re.fullmatch(regex, email):
      pass
    else:
      return("Invalid email")

This regex is working fine but I am not sure how to easily loop through my entire df email address column. I have tried:

for col in df['Email Address'].columns:
   for i in df['Email Address'].index:
      if df.loc[i,col] = 'Invalid email'
           errors={'row':i, 'column':col, 'message': 'this is not a valid email address'

I am wanting to write the invalid email to a dictionary titled errors. with the above code I get an invalid error.

3 Answers 3

2

According to your description, I'd probably do

df["Email Address"].str.match(r"^.+@.+\..{2,}$")

str.match returns True if the regex matches the string.

The regex is

  • the start of the string ^
  • content before the @ sign .+
  • an @ sign @
  • content after the @ sign .+
  • a dot \.
  • and 2+ characters after the '.' .{2,}
Sign up to request clarification or add additional context in comments.

2 Comments

thie description is an oversimplification. This will match something like @@@.@@ which is not a valid emai address
@onyambu yes, you're absolutely right. The crux is str.match though, and I think OP would be better off using some ready-made solution for a common task like e-mail recognition, anyway. (If it is crucial to the task, that is, and the question didn't sound like a precise matcher is needed).
1

The beautiful thing about Pandas dataframes is that you almost never have to loop through them--and avoiding loops will increase your speed significantly.

df['Email Address'].str.contains(regex) will return a boolean Series of whether each observation in the Email Address column.

Check out this chapter on vectorized string operations for more.

Comments

0

You can iterate through rows using .iterrows() on a dataframe. row contains a series and you can access your column the same way you would a dictionary.

for i, row in df.iterrows():
    if not isValid(row['Email Address']):
        print("Invalid email")

1 Comment

The function defined in the question.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.