Fastest way to replace multiple values of a pandas dataframe with values from another dataframe

Question

I am trying to replace multiple rows of pandas dataframe, with values from another dataframe.

Supposed I have 10,000 rows of customer_id in my dataframe df1 and I want to replace these customer_id with 3,000 values from df2.

For the sake of illustration, let's generate the dataframes (below).

Say these 10 rows in df1 represent 10,000 rows, and the 3 rows from df2 represent 3,000 values.

import numpy as np
import pandas as pd
np.random.seed(42)

# Create df1 with unique values
arr1 = np.arange(100,200,10)
np.random.shuffle(arr1)
df1 = pd.DataFrame(data=arr1, 
                   columns=['customer_id'])

# Create df2 for new unique_values
df2 = pd.DataFrame(data = [1800, 1100, 1500],
                   index = [180, 110, 150], # this is customer_id column on df1
                   columns = ['customer_id_new'])

I want to replace 180 with 1800, 110 with 1100, and 150 with 1500.

I know we can do below ...

# Replace multiple values
replace_values = {180 : 1800, 110 : 1100, 150 : 1500 }                                                                                          
df1_replaced = df1.replace({'customer_id': replace_values})

And it works fine if I only have a few lines...

But if I have thousands of lines that I need to replace, how could I do this without typing out what values I want to change one at a time?

EDIT: To clarify, I don't need to use replace. Anything that could replace those values in df1 from values in df2 in the fastest most efficient way is ok.

brentertainer · Accepted Answer · 2019-07-20 03:12:25Z

3

df1['customer_id'] = df1['customer_id'].replace(df2['customer_id_new'])

Alternatively, you can do it in place.

df1['customer_id'].replace(df2['customer_id_new'], inplace=True)

answered Jul 20, 2019 at 3:12

brentertainer

2,2101 gold badge8 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

rshardja Over a year ago

Thank you! What does "inplace=True" do? The documentation isn't really clear: If True, in place. Note: this will modify any other views on this object (e.g. a column from a DataFrame). Returns the caller if this is True.

Scott Boston · Accepted Answer · 2019-07-20 03:09:21Z

2

You can try this, using map with a pd.Series:

 df1['customer_id'] = df1['customer_id'].map(df2.squeeze()).fillna(df1['customer_id'])

or

df1['customer_id'] = df1['customer_id'].map(df2['customer_id_new']).fillna(df1['customer_id'])

Output:

   customer_id
0       1800.0
1       1100.0
2       1500.0
3        100.0
4        170.0
5        120.0
6        190.0
7        140.0
8        130.0
9        160.0

edited Jul 20, 2019 at 3:09

answered Jul 20, 2019 at 2:57

Scott Boston

154k15 gold badges160 silver badges207 bronze badges

Comments

sacuL · Accepted Answer · 2019-07-20 03:06:51Z

1

Going with your original method using replace, you can simplify it with to_dict to create your mapping dictionary without having to do it manually:

df1["customer_id"] = df1["customer_id"].replace(df2["customer_id_new"].to_dict())

>>> df1
   customer_id
0         1800
1         1100
2         1500
3          100
4          170
5          120
6          190
7          140
8          130
9          160

answered Jul 20, 2019 at 3:06

sacuL

51.6k9 gold badges88 silver badges115 bronze badges

Comments

Deepak Yadav · Accepted Answer · 2019-07-20 04:26:08Z

In my opinion, apart from trying out useful answers mentioned above, you may try parallelising your data-frame in-case you have multi-core processor.

For example:

import pandas as pd, numpy as np, seaborn as sns
from multiprocessing import Pool

num_partitions = 10 #number of partitions to split data-frame
num_cores = 4 #number of cores on your machine

iris = pd.DataFrame(sns.load_dataset('iris'))
def parallelize_dataframe(df, func):
   df_split = np.array_split(df, num_partitions)
   pool = Pool(num_cores)
   df = pd.concat(pool.map(func, df_split))
   pool.close()
   pool.join()
   return df

In place of 'func' parameter, you may pass your replace method. Please let me know if it helps. In case of any error, do comment.

Thanks!

Collectives™ on Stack Overflow

Fastest way to replace multiple values of a pandas dataframe with values from another dataframe

4 Answers 4

1 Comment

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related