Faster way to use linkage dataframes with other dataframes - Python

Question

I have two similar dataframes to the below:

import pandas as pd

num1 = ["1111 2222", "3333", "4444 5555 6666", "7777 8888", "9999"]
num2 = ["A1", "A2", "A3", "A4", "A5"] 
linkage = pd.DataFrame({"num1":num1, "num2":num2})
num1 = ["2222", "3333", "5555", "8888", "9999"]
num2 = ['none', 'none', 'none', 'none', 'none']
df = pd.DataFrame({"num1":num1, "num2":num2})

Linkage:

num1            num2 
1111 2222       A1 
3333            A2 
4444 5555 6666  A3 
7777 8888       A4 
9999            A5

df:

num1   num2
2222   none
3333   none
5555   none
8888   none
9999   none

I want to place the "num2" value from the linkage dataframe in the second dataframe based on if the "num1" value from the second dataframe is one of the "num1" values in the linkage dataframe. The code I currently have is:

df.num2 = [linkage.num2[i] for y in df.num1 for i, x in enumerate(linkage.num1) if y in x]

Which yields what I want:

num1   num2
2222   A1
3333   A2
5555   A3
8888   A4
9999   A5

But the code is noticeably slower the larger the dataframes get. CPU times: total: 516 ms Wall time: 519 ms Is there a better method of using linkage dataframes?

mozway · Accepted Answer · 2022-07-20 13:28:46Z

1

split the string and explode, then use this to map the data:

mapper = (linkage.assign(num1=linkage['num1'].str.split())
                 .explode('num1')
                 .set_index('num1')['num2']
          )

df['num2'] = df['num1'].map(mapper)

output:

   num1 num2
0  2222   A1
1  3333   A2
2  5555   A3
3  8888   A4
4  9999   A5

intermediate mapper:

num1
1111    A1
2222    A1
3333    A2
4444    A3
5555    A3
6666    A3
7777    A4
8888    A4
9999    A5
Name: num2, dtype: object

answered Jul 20, 2022 at 13:28

mozway

267k13 gold badges56 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Michael S. Over a year ago

When I use this, I receive the error Reindexing only valid with uniquely valued Index objects, I checked the mapper and it seems that I have some duplicates. Is there a best practice to unique the mapper?

mozway Over a year ago

@Michael you probably should get rid of duplicates, otherwise how to know which value to chose? Except if you want to create 2 or more rows on multiple matches, then you should merge. Please provide an example of the expected behavior and I can update the answer.

Michael S. Over a year ago

I know what it is. Due to my data, sometimes (not often) there are two numbers in "num1" that are the same. Adding .drop_duplicates after exploding and before setting the index fixed the problem for me. (There were only 4 duplicates so - knowing my data - I trust the results).

Michael S. Over a year ago

CPU times: total: 15.6 ms Wall time: 12.6 ms ....... this is my preferred method because I can use the mapper for other things. Thank you.

ko3 · Accepted Answer · 2022-07-20 13:37:15Z

1

You can make use of pd.Series.str.extract to capture groupings and assign the matches only and use pd.merge to join the corresponding num2 values of your linkage data frame:

pd.merge(df.drop(columns="num2"), linkage.assign(num1=linkage["num1"].str.extract(f'({"|".join(df["num1"].unique())})')), on=["num1"], how="left")

Output:

    num1    num2
0   2222    A1
1   3333    A2
2   5555    A3
3   8888    A4
4   9999    A5

edited Jul 20, 2022 at 13:37

answered Jul 20, 2022 at 13:31

ko3

1,8117 silver badges14 bronze badges

1 Comment

Michael S. Over a year ago

CPU times: total: 62.5 ms Wall time: 40.9 ms .... this worked much better than mine. Thank you for the answer.

Collectives™ on Stack Overflow

Faster way to use linkage dataframes with other dataframes - Python

2 Answers 2

4 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related