1

I have an issue while trying to concat and use set on multiple columns.

This is an example df:

df = pd.DataFrame({'customer id':[1,2,3,4,5],
                   'email1':['[email protected]',np.nan,'[email protected]',np.nan, np.nan],
                   'email2':['[email protected]'   ,np.nan,'[email protected]','[email protected]', np.nan],
                   'email3':['[email protected]',np.nan,'[email protected]','[email protected]', '[email protected]']})  

df:

   customer id          email1          email2          email3
0            1  [email protected]  [email protected]  [email protected]
1            2             NaN             NaN             NaN
2            3  [email protected]   [email protected]   [email protected]
3            4             NaN   [email protected]   [email protected]
4            5             NaN             NaN   [email protected]

I would like to create a new column with unique values from all columns (email1, email2 & email3) so the created columns will have a set of unique emails per customer, some emails have different cases (upper, lower .. etc)

This is what I did so far:

df['ALL_EMAILS'] = df[['email1','email2','email3']].apply(lambda x: ', '.join(x[x.notnull()]), axis = 1)

This took about 3 minutes on a df of > 500K customers!

then I created a function to handle the output and get the unique values if the cell is not null:

def checkemail(x):
    if x:
        #to_lower
        lower_x = x.lower()
        y= lower_x.split(',')
        return set(y)

then applies it to the column:

df['ALL_EMAILS'] = df['ALL_EMAILS'].apply(checkemail)

but I got wrong output under ALL_EMAILS column!

   ALL_EMAILS
0  { [email protected], [email protected],  [email protected]}  
1                                               None  
2                   { [email protected], [email protected]}  
3                    { [email protected], [email protected]}  
4                                    {[email protected]}  
1
  • 1
    You join with ', ', but split with ',' (no space). Hence, you get extra spaces before emails. Commented Sep 14, 2022 at 16:28

2 Answers 2

1

Lets filter the email like columns then stack to convert to series then transform into lowercase and aggregate with set on level=0

email = df.filter(like='email')
df['all_emails'] = email.stack().str.lower().groupby(level=0).agg(set)

   customer id          email1          email2          email3                        all_emails
0            1  [email protected]  [email protected]  [email protected]  {[email protected], [email protected]}
1            2             NaN             NaN             NaN                               NaN
2            3  [email protected]   [email protected]   [email protected]   {[email protected], [email protected]}
3            4             NaN   [email protected]   [email protected]                   {[email protected]}
4            5             NaN             NaN   [email protected]                   {[email protected]}
Sign up to request clarification or add additional context in comments.

Comments

1

Try work on the values directly instead of joining them then split again:

df['ALL_EMAILS'] = df.filter(like='email').apply(lambda x: set(x.dropna().str.lower()) or None, axis=1)

Output:

   customer id          email1          email2          email3                        ALL_EMAILS
0            1  [email protected]  [email protected]  [email protected]  {[email protected], [email protected]}
1            2             NaN             NaN             NaN                              None
2            3  [email protected]   [email protected]   [email protected]   {[email protected], [email protected]}
3            4             NaN   [email protected]   [email protected]                   {[email protected]}
4            5             NaN             NaN   [email protected]                   {[email protected]}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.