6

I have a dataframe where I want to replace values in a column, but the dict describing the replacement is based on values in another column. A sample dataframe would look like this:

   Map me strings        date
0       1   test1  2020-01-01
1       2   test2  2020-02-10
2       3   test3  2020-01-01
3       4   test2  2020-03-15

I have a dictionary that looks like this:

map_dict = {'2020-01-01': {1: 4, 2: 3, 3: 1, 4: 2},
            '2020-02-10': {1: 3, 2: 4, 3: 1, 4: 2},
            '2020-03-15': {1: 3, 2: 2, 3: 1, 4: 4}}

Where I want the mapping logic to be different based on the date.

In this example, the expected output would be:

   Map me strings        date
0       4   test1  2020-01-01
1       4   test2  2020-02-10
2       1   test3  2020-01-01
3       4   test2  2020-03-15

I have a massive dataframe (100M+ rows) so I really want to avoid any looping solutions if at all possible.

I have tried to think of a way to use either map or replace but have been unsuccessful

5
  • How many keys do you have in map_dict? Commented Nov 19, 2020 at 9:27
  • There are about 800 dates Commented Nov 19, 2020 at 9:29
  • What about looping on the dates, and assigning using df.loc? Commented Nov 19, 2020 at 9:30
  • Yea that was my original attempt but it took a very very long time so I was trying to avoid any loop based solutions Commented Nov 19, 2020 at 9:31
  • Can you add the code of your original attempt? Commented Nov 19, 2020 at 9:33

4 Answers 4

7

Use DataFrame.join with MultiIndex Series created by DataFrame cosntructor and DataFrame.stack:

df = df.join(pd.DataFrame(map_dict).stack().rename('new'), on=['Map me','date'])
print (df)
   Map me strings        date  new
0       1   test1  2020-01-01    4
1       2   test2  2020-02-10    4
2       3   test3  2020-01-01    1
3       4   test2  2020-03-15    4
Sign up to request clarification or add additional context in comments.

3 Comments

This is not a good solution in terms of performance. My solution runs in 850 µs (mean of 1000 loops), while yours run in 3.17 ms (mean of 1000 loops)
Yes didn't try with a huge dataset
Indeed, expanding the dataframe with df = pd.concat([df for _ in range(100000)]) this results in 47.6 ms ± 1.56 ms, while @oskros solution results in 3.01 s ± 102 ms per loop. Awesome solutions!
1

Try something like this maybe?

df['mapped'] = df.apply(lambda x: map_dict[x['date']][x['Map me']], axis=1)

2 Comments

Is this fundamentally different than just looping with loc? Isnt apply basically a for loop?
Yes basically, its just a cleaner syntax. If you want it to run faster than this, you probably need to look into using cython or numba - you can try following the guide here: Pandas optimization guide
1

Try with np.where, which normally has better performance than pandas:

df["Mapped"] = ""
for key in map_dict.keys():
    df["Mapped"] = np.where((df["date"] == key)&(df["Mapped"] == ""), df["Map me"].apply(lambda x: map_dict[key][x]), df["Mapped"])

Result:

    Map me  strings date    Mapped
0   1   test1   2020-01-01  4
1   2   test2   2020-02-10  4
2   3   test3   2020-01-01  1
3   4   test2   2020-03-15  4

Comments

0

A more pandas-like way to this would be convert the map_dict to a DataFrame and join it to your sample frame. For example:

# Create the original dataframe
>>> df = pd.DataFrame([(1, 'test1', '2020-01-01'), (2, 'test2', '2020-02-10'), (3, 'test3', '2020-01-01'), (4, 'test2', '2020-03-15')], columns=['Map me', 'strings', 'date'])
>>> df
   Map me strings        date
0       1   test1  2020-01-01
1       2   test2  2020-02-10
2       3   test3  2020-01-01
3       4   test2  2020-03-15

# Convert the map dict to a dataframe
>>> map_df = pd.DataFrame([(k, j, l) for k, v in map_dict.items() for j,l in v.items()], columns=['date', 'Map me', 'Map to'])
>>> map_df
          date  Map me  Map to
0   2020-01-01       1       4
1   2020-01-01       2       3
2   2020-01-01       3       1
3   2020-01-01       4       2
4   2020-02-10       1       3
5   2020-02-10       2       4
6   2020-02-10       3       1
7   2020-02-10       4       2
8   2020-03-15       1       3
9   2020-03-15       2       2
10  2020-03-15       3       1
11  2020-03-15       4       4

# Perform the join
>>> mapped_df = pd.merge(df, map_df, left_on=['date', 'Map me'], right_on=['date', 'Map me'])
>>> mapped_df
   Map me strings        date  Map to
0       1   test1  2020-01-01       4
1       2   test2  2020-02-10       4
2       3   test3  2020-01-01       1
3       4   test2  2020-03-15       4
>>> 

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.