how to pandas fast nested for loop for "non numeric" columns?

Question

how to pandas fast nested for loop for "non numeric" columns? because this for loop is way to slow:

for i in range(len(df1[column_A]):
    for j in range(len(df2[column_A]):
        if df1[column_A][i] == df2[column_A][j]:
            df1[column_B][i] = df2[column_B][j]
        else:
            pass

so any other way to do it by pandas itself or other libraries?

UPDATE:

and main goal is:

input:

df1: 
     name  rpm    power
0   John   1500   high+
1   Mary   1400   high-
2  Sally   300    low-
3    Doe   700    medium-
4   July   1000   medium+


df2: 
     name  age
0  Peter   77
1  Sally   44
2  Micky   22
3  Sally   34
4  July    50
5   Bob    20


required output is:

 but i want it df2:
     name  age  rpm    power
0  Peter   77   0      NA
1  Sally   44   300    low-
2  Micky   22   0      NA
3  Sally   34   300    low-
4   July   50   1000   medium+
5    Bob   20   0      NA

i also add question in official pandas github: https://github.com/pandas-dev/pandas/issues/59824

dear @Steven. its not working like i want it. please check it ibb.co/GCTP6f6 — Bold Ganbaatar
– Bold Ganbaatar, Commented Sep 17, 2024 at 4:24

steve-ed · Accepted Answer · 2024-09-17 03:27:02Z

2

The nested loop you provided result in O(n^2) complexity, making it slow for larger datasets . Looping over the same range for both i and j, which is unnecessary. Instead you can use pd.merge

import pandas as pd

# Merge file1 and file2 on column A
merged_df = pd.merge(file1, file2, on='column_A') # assuming it is a pandas dataframe 

# Update file1_column_B with matched values from file2
file1_column_B = merged_df['column_B_y']

pd.merge() function merges two DataFrames ( file1 and file2 ) based on a common column ( column_A ).Also by default, pd.merge() performs an inner merge, which means only rows with matching values in column_A are included in the resulting DataFrame.

Time complexity of pd.merge is O(n + m) in best case scenario (concluding it) where n is the number of rows in the left DataFrame (file1) and m is the number of rows in the right DataFrame (file2). However, in the worst-case scenario (e.g., when there are many duplicate values in the merge column), the time complexity can be O(n × m).

You can also use numPy argsort + searchsorted

import numpy as np


sorted = np.argsort(file2['column_A'])

match = np.searchsorted(file2['column_A'][sorted], file1['column_A'])

file1_column_B = file2['column_B'][sorted[match]]

Time Complexity of above sort: argsort: O(n log n) searchsorted: O(m log n) Total: O(n log n + m log n)

answered Sep 17, 2024 at 3:27

steve-ed

3372 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

steve-ed Over a year ago

Use merged_df = pd.merge(df2, df1, on='name', how='left') instead and rest of the code is same . I also explained it here in a notebook

steve-ed Over a year ago

I apologize for the mistake; I've made it public now, after accidentally making it secret.

Bold Ganbaatar Over a year ago

can you make it for my updated questions main question example? i have little confusion about on my example. thank you

steve-ed Over a year ago

Simply use merged = pd.merge(df2, df1, on='name', how='left'). And if you want to follow the code.I don't want to ask but why you raised a GitHub issue for this

Bold Ganbaatar Over a year ago

its in github "general question about pandas" section. i think it will help some one in future...

Collectives™ on Stack Overflow

how to pandas fast nested for loop for "non numeric" columns?

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related