Pandas vectorization instead of loop for two dataframes

Question

I have 2 dataframes. My main dataframe dffinal

        date  id  och  och1  och2  och3  cch1  LCH  L#
0  3/27/2020   1 -2.1     3     3     1     5  NaN NaN
1   4/9/2020   2  2.0     1     2     1     3  NaN NaN

My second dataframe df2

        date  och  cch  och1  och2  och3  cch1
0  5/30/2012 -0.7 -0.7     3    -1     1    56
1  9/16/2013  0.9 -1.0     6     4     3     7
2  9/26/2013  2.5  5.4     2     3     2     4
3  8/26/2016  0.1 -0.7     4     3     5    10

I have this loop

for i in dffinal.index:    
    df3=df2.copy()
    
    df3 = df3[df3['och1'] >dffinal['och1'].iloc[i]]
    df3 = df3[df3['och2'] >dffinal['och2'].iloc[i]]
    df3 = df3[df3['och3'] >dffinal['och3'].iloc[i]]    
    
    df3 = df3[df3['cch1'] >dffinal['cch1'].iloc[i]]     
    
    dffinal['LCH'][i] =df3["och"].mean()
    dffinal['L#'][i] =len(df3.index)

As it is clear from my code the values of LCH and L# are obtained from df2(df3) based on above conditions.

This code works very well, but it is very slow. I found out that i can improve efficiency with pandas vectorization. However, I could not figure out how to do it for my case.

This is my desired result

        date  id  och  och1  och2  och3  cch1       LCH   L#
0  3/27/2020   1 -2.1     3     3     1     5  0.900000  1.0
1   4/9/2020   2  2.0     1     2     1     3  1.166667  3.0

I would greatly appreciate if you could help me to increase the efficiency of my code

Correct answer

I personally use the answer of @shadowtalker easy method, simply because I can undesrtand how it works.

The most efficient answer is fast but complex

It helps a lot if you can post the data in CSV or JSON format, so that people can easily load it and test out their answers. Fixed width is less ideal. — shadowtalker
– shadowtalker, Commented Jun 30, 2021 at 15:17
@shadowtalker sorry, I was trying to do the question according to this stackoverflow.com/a/20159305/15542251 guide. Not sure if I understand you correctly. diffinal is simply my first dataframe — Bogdan Titomir
– Bogdan Titomir, Commented Jun 30, 2021 at 15:24
See include a minimal data frame for how to include a data frame with your code. Make it easy for others to help you. — Prune
– Prune, Commented Jun 30, 2021 at 15:28
That is only one support item. Please continue to use on topic and how to ask from the intro tour. — Prune
– Prune, Commented Jun 30, 2021 at 15:41

Clay Shwery · Accepted Answer · 2021-06-30 15:19:57Z

3

It may be very difficult to avoid iterration with the logic you have in place to select a subset of rows in df2 for a given dffinal row, but you should be able to speed up the iterative method (hopefully by a lot) using this.

(note: if you're repeatedly accessing the row of the dataframe you're iterating through, use .iterrows so you can grab things much more simply (and quickly)

for i,row in dffinal.iterrows():
    och_array = df2.loc[(df3['och1'] >row['och1']) &\
          (df2['och2'] >row['och2']) &\
          (df2['och3'] >row['och3']) &\   
          (df2['cch1'] >row['cch1']),'och'].values
    dffinal.at[i,'LCH'] = och_array.mean()
    dffinal.at[i,'L#'] = len(och_array)

This avoids lookups in dffinal, avoids creating a new copy of the df several times over. Can't test this without a data sample, but I think this will work.

edited Jun 30, 2021 at 15:19

answered Jun 30, 2021 at 15:02

Clay Shwery

3801 silver badge8 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Bogdan Titomir Over a year ago

Thank you, apparently it is not possible to use vectorization in my case. I tried your code, but i get this error "None of [Index(['LCH'], dtype='object')] are in the [columns]".

for i,row in dffinal.iterrows():     df_stats = df2.loc[(df2['och1'] >row['och1']) &           (df2['och2'] >row['och2']) &           (df2['och3'] >row['och3']) &              (df2['cch1'] >row['cch1']),['LCH']].mean()     dffinal.at[i,'LCH'] = df_stats['LCH']

I used this code

shadowtalker Over a year ago

Note that itertuples should be even faster than iterrows, and might be more "dtype-safe" as well.

Bogdan Titomir Over a year ago

@shadowtalker could you please show me example, i have never tried itertuples before

Clay Shwery Over a year ago

@BogdanTitomir I edited the code, I didn't read carefully enough how you calculated LCH and L#, I think it should work now

Bogdan Titomir Over a year ago

Thank you, your code works perfectly and it significantly improved the performance of my code. I chose @shadowtalker answer as correct simply because it is slightly faster. Unfortunately, I have to choose only one answer as correct

shadowtalker · Accepted Answer · 2021-06-30 15:38:11Z

3

This answer is based on https://stackoverflow.com/a/68197271/2954547, except that it uses itertuples instead of iterrows. itertuples is generally safer than iterrows, because it preserves dtypes correctly. See the "Notes" section of the DataFrame.iterrows documentation.

It also is self-contained, in that it can be executed top-to-bottom without having to copy/paste data, etc.

Note that I iterate over df1.itertuples and not df_final.itertuples. Never mutate something that you are iterating over, and never iterate over something that you are mutating. Modifying a DataFrame in-place is a form of mutation.

import io

import pandas as pd


data1_txt = """
     date  id  och  och1  och2  och3  cch1  LCH  L#
3/27/2020   1 -2.1     3     3     1     5  NaN NaN
4/9/2020   2  2.0     1     2     1     3  NaN NaN
"""

data2_txt = """
     date  och  cch  och1  och2  och3  cch1
5/30/2012 -0.7 -0.7     3    -1     1    56
9/16/2013  0.9 -1.0     6     4     3     7
9/26/2013  2.5  5.4     2     3     2     4
8/26/2016  0.1 -0.7     4     3     5    10
"""

df1 = pd.read_fwf(io.StringIO(data1_txt), index_col='id')
df2 = pd.read_fwf(io.StringIO(data2_txt))

df_final = df1.copy()

for row in df1.itertuples():
    row_mask = (
        (df2['och1'] > row.och1) &
        (df2['och2'] > row.och2) &
        (df2['och3'] > row.och3) &
        (df2['cch1'] > row.cch1)
    )
    och_vals = df2.loc[row_mask, 'och']
    i = row.Index
    df_final.at[i, 'LCH'] = och_vals.mean()
    df_final.at[i, 'L#'] = len(och_vals)

print(df_final)

The output is

         date  och  och1  och2  och3  cch1  LCH  L#       LCH   L#
id                                                                
1   3/27/2020 -2.1     3     3     1     5  NaN NaN  0.900000  1.0
2    4/9/2020  2.0     1     2     1     3  NaN NaN  1.166667  3.0

edited Jun 30, 2021 at 15:38

answered Jun 30, 2021 at 15:29

shadowtalker

14.1k5 gold badges65 silver badges120 bronze badges

6 Comments

Clay Shwery Over a year ago

I haven't really spent much time answering questions here until this week, but will definitely steal that StringIO technique for getting print statement output from questions into my notebooks. Also thanks for showing this example. My suspicion regarding optimization is that iterrows vs itertuples will not make a big difference because the vast majority of the computation is in selecting the correct rows from df2, but it's cool to see this implementation!

shadowtalker Over a year ago

Good point. The main benefit of itertuples is not speed but dtype-safety. See the "Notes" section of pandas.pydata.org/pandas-docs/stable/reference/api/…

Bogdan Titomir Over a year ago

How can I add a new line df_final.at[i, 'SumPosNeg']= which needs to be equal to sum of all positive och values divided by the sum of all negative och values or sum(och>0)/sum(och<0)

Bogdan Titomir Over a year ago

I.e. the new column results should be equal to 2.0/(-2.1)=-0.95. I got 2.0 and 2.1 from the final results dataframe.

shadowtalker Over a year ago

@BogdanTitomir it might be worthwhile to spend some time understanding mine and Clay's answers, so that you can add your own extensions or modifications as needed.

|

anky · Accepted Answer · 2021-06-30 15:53:29Z

3

Only way I can think of by pandas methods without loops is a cross join after resetting the index and comparing with df.all(1)

cols = ['och1','och2','och3','cch1']
u = df2.reset_index().assign(k=1).merge(
    dffinal.reset_index().assign(k=1),on='k',suffixes=('','_y'))
#for new Version of pandas there is a how='cross' included now

dffinal['NewLCH'] = (u[u[cols].gt(u[[f"{i}_y" for i in cols]].to_numpy()).all(1)]
                     .groupby("index_y")['och'].mean())

print(dffinal)

        date  id  och  och1  och2  och3  cch1  LCH  L#    NewLCH
0  3/27/2020   1 -2.1     3     3     1     5  NaN NaN  0.900000
1   4/9/2020   2  2.0     1     2     1     3  NaN NaN  1.166667

edited Jun 30, 2021 at 15:53

answered Jun 30, 2021 at 15:48

anky

75.3k11 gold badges46 silver badges76 bronze badges

3 Comments

Bogdan Titomir Over a year ago

I tried your code, but I have error that says "Unable to allocate 39.9 GiB for an array with shape (5358055840,) and data type int64", which I understand is this code requires 40 GiB ram. Or I did something wrong

anky Over a year ago

@BogdanTitomir yes, I wouldnot recommend using a cross join for big dataframes (it takes a lot of space) , may be select only relevant columns (cols and och) before reset index and try

Bogdan Titomir Over a year ago

I think this method is too complicated for my brain:) but anyway thank you!

Shubham Sharma · Accepted Answer · 2021-06-30 17:57:46Z

2

Here is one way to approach your problem

def fast(A, B):
    for a in A:
        m = (B[:, 1:] > a[1:]).all(1)
        yield B[m, 0].mean(), m.sum()

c = ['och', 'och1', 'och2', 'och3', 'cch1']
df1[['LCH', 'L#']] = list(fast(df1[c].to_numpy(), df2[c].to_numpy()))

        date  id  och  och1  och2  och3  cch1       LCH  L#
0  3/27/2020   1 -2.1     3     3     1     5  0.900000   1
1   4/9/2020   2  2.0     1     2     1     3  1.166667   3

edited Jun 30, 2021 at 17:57

answered Jun 30, 2021 at 17:15

Shubham Sharma

71.8k6 gold badges26 silver badges58 bronze badges

7 Comments

Bogdan Titomir Over a year ago

Thank you very much, the code gets the job done and much faster than my original code, but surprisingly a little slower than other answers. The code looked very complex and I expected it to be faster than other answers. But maybe i did something wrong and corrupted the code:)

Shubham Sharma Over a year ago

@BogdanTitomir Not sure about the problem but in my tests i found this to be 5-6x faster. By the way what are the shapes of dataframe df1 and df2?

Bogdan Titomir Over a year ago

Test sample df1 has 12,000 rows. df2 has 450,000

Bogdan Titomir Over a year ago

Thank you, it turns out to be the fastest method. I chose this as the correct answer, but probably I will go with other methods simply because this method feels too complicated to me.

Shubham Sharma Over a year ago

Glad it worked for you. Although few more optimizations can be possible but that will only make the code more complex.

|

Collectives™ on Stack Overflow

Pandas vectorization instead of loop for two dataframes

4 Answers 4

5 Comments

6 Comments

3 Comments

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

5 Comments

6 Comments

3 Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related