2

how to pandas fast nested for loop for "non numeric" columns? because this for loop is way to slow:

for i in range(len(df1[column_A]):
    for j in range(len(df2[column_A]):
        if df1[column_A][i] == df2[column_A][j]:
            df1[column_B][i] = df2[column_B][j]
        else:
            pass

so any other way to do it by pandas itself or other libraries?

UPDATE:

and main goal is:

input:

df1: 
     name  rpm    power
0   John   1500   high+
1   Mary   1400   high-
2  Sally   300    low-
3    Doe   700    medium-
4   July   1000   medium+


df2: 
     name  age
0  Peter   77
1  Sally   44
2  Micky   22
3  Sally   34
4  July    50
5   Bob    20


required output is:

 but i want it df2:
     name  age  rpm    power
0  Peter   77   0      NA
1  Sally   44   300    low-
2  Micky   22   0      NA
3  Sally   34   300    low-
4   July   50   1000   medium+
5    Bob   20   0      NA

i also add question in official pandas github: https://github.com/pandas-dev/pandas/issues/59824

1
  • dear @Steven. its not working like i want it. please check it ibb.co/GCTP6f6 Commented Sep 17, 2024 at 4:24

1 Answer 1

2

The nested loop you provided result in O(n^2) complexity, making it slow for larger datasets . Looping over the same range for both i and j, which is unnecessary. Instead you can use pd.merge

import pandas as pd

# Merge file1 and file2 on column A
merged_df = pd.merge(file1, file2, on='column_A') # assuming it is a pandas dataframe 

# Update file1_column_B with matched values from file2
file1_column_B = merged_df['column_B_y']


pd.merge() function merges two DataFrames ( file1 and file2 ) based on a common column ( column_A ).Also by default, pd.merge() performs an inner merge, which means only rows with matching values in column_A are included in the resulting DataFrame.

Time complexity of pd.merge is O(n + m) in best case scenario (concluding it) where n is the number of rows in the left DataFrame (file1) and m is the number of rows in the right DataFrame (file2). However, in the worst-case scenario (e.g., when there are many duplicate values in the merge column), the time complexity can be O(n × m).

You can also use numPy argsort + searchsorted

import numpy as np


sorted = np.argsort(file2['column_A'])

match = np.searchsorted(file2['column_A'][sorted], file1['column_A'])

file1_column_B = file2['column_B'][sorted[match]]

Time Complexity of above sort: argsort: O(n log n) searchsorted: O(m log n) Total: O(n log n + m log n)

Sign up to request clarification or add additional context in comments.

5 Comments

Use merged_df = pd.merge(df2, df1, on='name', how='left') instead and rest of the code is same . I also explained it here in a notebook
I apologize for the mistake; I've made it public now, after accidentally making it secret.
can you make it for my updated questions main question example? i have little confusion about on my example. thank you
Simply use merged = pd.merge(df2, df1, on='name', how='left'). And if you want to follow the code.I don't want to ask but why you raised a GitHub issue for this
its in github "general question about pandas" section. i think it will help some one in future...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.