Replace values in pandas dataframe column with different replacement dict based on condition

Question

I have a dataframe where I want to replace values in a column, but the dict describing the replacement is based on values in another column. A sample dataframe would look like this:

   Map me strings        date
0       1   test1  2020-01-01
1       2   test2  2020-02-10
2       3   test3  2020-01-01
3       4   test2  2020-03-15

I have a dictionary that looks like this:

map_dict = {'2020-01-01': {1: 4, 2: 3, 3: 1, 4: 2},
            '2020-02-10': {1: 3, 2: 4, 3: 1, 4: 2},
            '2020-03-15': {1: 3, 2: 2, 3: 1, 4: 4}}

Where I want the mapping logic to be different based on the date.

In this example, the expected output would be:

   Map me strings        date
0       4   test1  2020-01-01
1       4   test2  2020-02-10
2       1   test3  2020-01-01
3       4   test2  2020-03-15

I have a massive dataframe (100M+ rows) so I really want to avoid any looping solutions if at all possible.

I have tried to think of a way to use either map or replace but have been unsuccessful

What about looping on the dates, and assigning using df.loc? — tmrlvi
– tmrlvi, Commented Nov 19, 2020 at 9:30
Yea that was my original attempt but it took a very very long time so I was trying to avoid any loop based solutions — Fredrik Nilsson
– Fredrik Nilsson, Commented Nov 19, 2020 at 9:31

jezrael · Accepted Answer · 2020-11-19 09:51:54Z

7

Use DataFrame.join with MultiIndex Series created by DataFrame cosntructor and DataFrame.stack:

df = df.join(pd.DataFrame(map_dict).stack().rename('new'), on=['Map me','date'])
print (df)
   Map me strings        date  new
0       1   test1  2020-01-01    4
1       2   test2  2020-02-10    4
2       3   test3  2020-01-01    1
3       4   test2  2020-03-15    4

edited Nov 19, 2020 at 9:51

answered Nov 19, 2020 at 9:44

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

oskros Over a year ago

This is not a good solution in terms of performance. My solution runs in 850 µs (mean of 1000 loops), while yours run in 3.17 ms (mean of 1000 loops)

oskros Over a year ago

Yes didn't try with a huge dataset

Cainã Max Couto da Silva Over a year ago

Indeed, expanding the dataframe with df = pd.concat([df for _ in range(100000)]) this results in 47.6 ms ± 1.56 ms, while @oskros solution results in 3.01 s ± 102 ms per loop. Awesome solutions!

oskros · Accepted Answer · 2020-11-19 09:33:47Z

1

Try something like this maybe?

df['mapped'] = df.apply(lambda x: map_dict[x['date']][x['Map me']], axis=1)

answered Nov 19, 2020 at 9:33

oskros

3,3452 gold badges13 silver badges36 bronze badges

2 Comments

Fredrik Nilsson Over a year ago

Is this fundamentally different than just looping with loc? Isnt apply basically a for loop?

oskros Over a year ago

Yes basically, its just a cleaner syntax. If you want it to run faster than this, you probably need to look into using cython or numba - you can try following the guide here: Pandas optimization guide

Mapotofu · Accepted Answer · 2020-11-19 09:47:31Z

1

Try with np.where, which normally has better performance than pandas:

df["Mapped"] = ""
for key in map_dict.keys():
    df["Mapped"] = np.where((df["date"] == key)&(df["Mapped"] == ""), df["Map me"].apply(lambda x: map_dict[key][x]), df["Mapped"])

Result:

    Map me  strings date    Mapped
0   1   test1   2020-01-01  4
1   2   test2   2020-02-10  4
2   3   test3   2020-01-01  1
3   4   test2   2020-03-15  4

answered Nov 19, 2020 at 9:47

Mapotofu

3203 gold badges6 silver badges21 bronze badges

Comments

Hagalín Ásgrímur Guðmundsson · Accepted Answer · 2020-11-19 09:47:27Z

A more pandas-like way to this would be convert the map_dict to a DataFrame and join it to your sample frame. For example:

# Create the original dataframe
>>> df = pd.DataFrame([(1, 'test1', '2020-01-01'), (2, 'test2', '2020-02-10'), (3, 'test3', '2020-01-01'), (4, 'test2', '2020-03-15')], columns=['Map me', 'strings', 'date'])
>>> df
   Map me strings        date
0       1   test1  2020-01-01
1       2   test2  2020-02-10
2       3   test3  2020-01-01
3       4   test2  2020-03-15

# Convert the map dict to a dataframe
>>> map_df = pd.DataFrame([(k, j, l) for k, v in map_dict.items() for j,l in v.items()], columns=['date', 'Map me', 'Map to'])
>>> map_df
          date  Map me  Map to
0   2020-01-01       1       4
1   2020-01-01       2       3
2   2020-01-01       3       1
3   2020-01-01       4       2
4   2020-02-10       1       3
5   2020-02-10       2       4
6   2020-02-10       3       1
7   2020-02-10       4       2
8   2020-03-15       1       3
9   2020-03-15       2       2
10  2020-03-15       3       1
11  2020-03-15       4       4

# Perform the join
>>> mapped_df = pd.merge(df, map_df, left_on=['date', 'Map me'], right_on=['date', 'Map me'])
>>> mapped_df
   Map me strings        date  Map to
0       1   test1  2020-01-01       4
1       2   test2  2020-02-10       4
2       3   test3  2020-01-01       1
3       4   test2  2020-03-15       4
>>>

Collectives™ on Stack Overflow

Replace values in pandas dataframe column with different replacement dict based on condition

4 Answers 4

3 Comments

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related