Pandas: Add a new column in a data frame based on a value in another data frame

Question

I have two data frames one with a userId, and gender and another data frame that has online activities of these users.

First Data Frame (df1)

userId, gender
001, F
002, M
003, F
004, M
005, M
006, M

Second data frame (df2)

userId, itemClicked, ItemBought, date
001, 123182, 123212, 02/02/2016
003, 234256, 123182, 05/02/2016
005, 986834, 234256, 04/19/2016
004, 787663, 787663, 05/12/2016
020, 465738, 465738, 03/20/2016
004, 787223, 787663, 07/12/2016

I want to add gender column to the second data frame by looking up the first data frame based on the userId. df2 might have multiple rows per user since its a click data where same user may have click multiple items.

This is very easy to do in MySql but I am trying to figure out to do it in pandas.

for index, row in df2.iterrows():
    user_id = row['userId']
    if user_id in df1['userId']:
        t = df1.loc[df1['userId'] == user_id]
        pdb.set_trace()

Is this the pandas way to so such a task?

jezrael · Accepted Answer · 2016-11-15 06:44:57Z

11

print (df1)
   userId gender
0       1      F
1       2      M
2       3      F
3       4      M
4       5      M
5       6      M

print (df2)
   userId  itemClicked  ItemBought        date
0       1       123182      123212  02/02/2016
1       3       234256      123182  05/02/2016
2       5       986834      234256  04/19/2016
3       4       787663      787663  05/12/2016
4      20       465738      465738  03/20/2016
5       4       787223      787663  07/12/2016

You can use map:

df2['gender'] = df2.userId.map(df1.set_index('userId')['gender'].to_dict())

print (df2)
   userId  itemClicked  ItemBought        date gender
0       1       123182      123212  02/02/2016      F
1       3       234256      123182  05/02/2016      F
2       5       986834      234256  04/19/2016      M
3       4       787663      787663  05/12/2016      M
4      20       465738      465738  03/20/2016    NaN
5       4       787223      787663  07/12/2016      M

Another solution with merge and left join, parameter on can be omit if only column gender is same in both DataFrames:

df = pd.merge(df2, df1, how='left')

print (df)
   userId  itemClicked  ItemBought        date gender
0       1       123182      123212  02/02/2016      F
1       3       234256      123182  05/02/2016      F
2       5       986834      234256  04/19/2016      M
3       4       787663      787663  05/12/2016      M
4      20       465738      465738  03/20/2016    NaN
5       4       787223      787663  07/12/2016      M

Timings:

#len(df2) = 600k
df2 = pd.concat([df2]*100000).reset_index(drop=True)

def f(df1,df2):
    df2['gender'] = df2.userId.map(df1.set_index('userId')['gender'].to_dict())
    return df2


In [43]: %timeit f(df1,df2)
10 loops, best of 3: 34.2 ms per loop

In [44]: %timeit (pd.merge(df2, df1, how='left'))
10 loops, best of 3: 102 ms per loop

edited Nov 15, 2016 at 6:44

answered Nov 15, 2016 at 6:19

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

add-semi-colons Over a year ago

Is this a good approach when dealing with large data frames.10million records?

add-semi-colons Over a year ago

df2 might have same user more than once since its a click / transaction data. Merge might be problematic right.

jezrael Over a year ago

I add timings, map solution is the fastest.

add-semi-colons Over a year ago

Looks like merge is not possible because second data frame doesn't have the gender column. is it possible to do something like this df2['gender'] = NULL, and then do the merge

jezrael Over a year ago

I think there is no problem, because it is same as (pd.merge(df2, df1, how='left', on='userId')) - it is left join on column userId

|

Gaarv · Accepted Answer · 2016-11-15 06:25:52Z

1

If user_id is the index you can just use :

df2.join(df1)

answered Nov 15, 2016 at 6:25

Gaarv

8249 silver badges15 bronze badges

1 Comment

add-semi-colons Over a year ago

second data frame have multiple records per same user so its hard to create an index for that data frame.

Chandan · Accepted Answer · 2016-11-15 06:54:36Z

0

You can try this:

for index, row in df1.iterrows():
   for ind,r in df2.iterrows():
      if r['userId']==row['userId']:
         df2.set_value(ind,'Gender',row['gender'])
         break

answered Nov 15, 2016 at 6:54

Chandan

7727 silver badges12 bronze badges

Collectives™ on Stack Overflow

Pandas: Add a new column in a data frame based on a value in another data frame

3 Answers 3

7 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

7 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related