Efficiently add column to Pandas DataFrame with values from another DataFrame

Question

I have a simple database consisting of 2 tables (say, Items and Users), where a column of the Users is their User_ID, a column of the Items is their Item_ID and another column of the Items is a foreign key to a User_ID, for instance:

Items                                       Users
Item_ID  Value_A  Its_User_ID ...           User_ID  Name  ...
1        35       1                         1        Alice
2        991      1                         2        John
3        20       2

Imagine I want to denormalize this database, i.e. I'm adding the value of column Name from table Users into table Items for performance reasons when querying the data. My current solution is the following:

items['User_Name'] = pd.Series([users.loc[users['User_ID']==x, 'Name'].iloc[0] 
                     for x in items['Its_User_ID']])

That is, I'm adding the column as a Pandas Series constructed from a comprehension list, which uses .loc[] to retrieve the names of the users with a specific ID, and .iloc[0] to get the first element of the selection (which is the only one because user IDs are unique).

But this solution is really slow for large sets of items. I did the following tests:

For 1000 items and ~200K users: 20 seconds.
For ~400K items and ~200K users: 2.5 hours. (and this is the real data size).

Because this approach is column-wise, its execution time grows multiplicatively by the number of columns for which I'm doing this process, and gets too time-expensive. While I haven't tried using for loops to fill the new Series row by row, I expect that it should be much more costly. Are there other approaches that I'm ignoring? Is there a possible solution that takes a few minutes instead of a few hours?

iDrwish · Accepted Answer · 2018-10-05 11:21:37Z

3

I think it would be more straightforward if you used table merges.

items.merge(users[['User_ID', 'Name']], left_on='Its_User_ID', right_on='User_ID', how='left')

This will add the column Name to the new dataset, which you can of-course rename later. This will be much more efficient that doing the operation via a for loop column-wise.

answered Oct 5, 2018 at 11:21

iDrwish

3,1131 gold badge18 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

christopherlovell · Accepted Answer · 2018-10-05 11:22:40Z

1

Use the high performance database operations provided by Panda, see here.

For example:

pd.merge(items, users, left_on='Its_User_ID', right_on='User_ID')

answered Oct 5, 2018 at 11:22

christopherlovell

4,2884 gold badges22 silver badges29 bronze badges

1 Comment

iDrwish Over a year ago

Duplicate answer?

Collectives™ on Stack Overflow

Efficiently add column to Pandas DataFrame with values from another DataFrame

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related