0

I am new to Pandas and have a hard time figuring out the best way to solve the below problem:

I have a Dataframe with one column called Email like below:

Email
[email protected]
[email protected]
NAN
[email protected]
[email protected]

I separated the strings on '@' to create a Domain column and want to use that column to assign keywords in a new column. For instance, if the domain contains the word 'yahoo', call it 'Yahoo Account' in the new column. If it does not contain the word 'yahoo', assign it the value 'Other Domain', and if it is NaN, call it 'Unknown'.

The new column called Affiliation would look like:

Affiliation 
Yahoo Account 
Other Domain 
Unknown 
Yahoo Account 
Other Domain

There are over 2,000 different types of domains so am looking for a way where I don't list and map all the unique domains as either being "Yahoo Account" or "Other Domain."

I have looked into a few options, one of which is the where clause, but it assigns NaN values to the Other Domain keyword.

df['Affiliation'] = np.where(df['Domain']=='yahoo', 'Yahoo Account', 'Other Domain')

I have also started to look at using the replace clause, but don't think this is the best way due to the amount of unique domains there are that would need to be added to other_affiliations. See below:

yahoo_affiliations = (r'(yahoo\S*)')
other_affiliations= (r'(gmail\S*)|(hotmail\S*)|(outlook\S*)')

# Create a new column called Affiliation from Domains
df['Affiliation'] = df['Domain']

# Fill NaN with Unknwon
df['Affiliation']  = df['Affiliation'].fillna('Unknown')

replacements = {
           'Affiliation': {yahoo_affiliations: 'Yahoo Account',
                                        other_affiliations: 'Other Domain'}
                        }

df.replace(replacements, regex=True, inplace=True)
3
  • You have managed to separate the strings? Can you be more specific about what the issue is? Have you read the Pandas docs? Commented Dec 16, 2019 at 23:16
  • Yes I have looked through the docs and on StackOverflow - I just tried to clarify a bit more. Thanks for asking for more information! Commented Dec 16, 2019 at 23:37
  • Is the answer by opressionslayer alright? You don't need to assign the intermediate Series to your DataFrame by the way, I think he did it just for the sake of clarity. Commented Dec 16, 2019 at 23:40

1 Answer 1

1

You can split them like this to get your mappings

email_map = {'yahoo.com': 'Yahoo Account',
'gmail.com': 'Other Domain',
'gmail.com.it': 'Other Domain' 
}
dfa['domain'] = dfa['Email'].str.extract(r'.*?@(.*)') 
dfa['Affiliation'] = dfa['domain'].map(email_map).fillna('Unknown') 

output:

              Email        domain    Affiliation
0     [email protected]     yahoo.com  Yahoo Account
1     [email protected]     gmail.com   Other Domain
2               NAN           NaN        Unknown
3     [email protected]     yahoo.com  Yahoo Account
4  [email protected]  gmail.com.it   Other Domain
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks oppressionslayer! I was hoping to avoid creating something like email_map because there are over 2,000 unique domains I would have to assign to 'Other Domain'. I realized that wasn't obvious so edited my question after seeing your response!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.