4

I have 2 dataframes. My main dataframe dffinal

        date  id  och  och1  och2  och3  cch1  LCH  L#
0  3/27/2020   1 -2.1     3     3     1     5  NaN NaN
1   4/9/2020   2  2.0     1     2     1     3  NaN NaN

My second dataframe df2

        date  och  cch  och1  och2  och3  cch1
0  5/30/2012 -0.7 -0.7     3    -1     1    56
1  9/16/2013  0.9 -1.0     6     4     3     7
2  9/26/2013  2.5  5.4     2     3     2     4
3  8/26/2016  0.1 -0.7     4     3     5    10

I have this loop

for i in dffinal.index:    
    df3=df2.copy()
    
    df3 = df3[df3['och1'] >dffinal['och1'].iloc[i]]
    df3 = df3[df3['och2'] >dffinal['och2'].iloc[i]]
    df3 = df3[df3['och3'] >dffinal['och3'].iloc[i]]    
    
    df3 = df3[df3['cch1'] >dffinal['cch1'].iloc[i]]     
    
    dffinal['LCH'][i] =df3["och"].mean()
    dffinal['L#'][i] =len(df3.index)

As it is clear from my code the values of LCH and L# are obtained from df2(df3) based on above conditions.

This code works very well, but it is very slow. I found out that i can improve efficiency with pandas vectorization. However, I could not figure out how to do it for my case.

This is my desired result

        date  id  och  och1  och2  och3  cch1       LCH   L#
0  3/27/2020   1 -2.1     3     3     1     5  0.900000  1.0
1   4/9/2020   2  2.0     1     2     1     3  1.166667  3.0

I would greatly appreciate if you could help me to increase the efficiency of my code

Correct answer

I personally use the answer of @shadowtalker easy method, simply because I can undesrtand how it works.

The most efficient answer is fast but complex

6
  • 1
    It helps a lot if you can post the data in CSV or JSON format, so that people can easily load it and test out their answers. Fixed width is less ideal. Commented Jun 30, 2021 at 15:17
  • Also - how is diffinal defined? Commented Jun 30, 2021 at 15:21
  • @shadowtalker sorry, I was trying to do the question according to this stackoverflow.com/a/20159305/15542251 guide. Not sure if I understand you correctly. diffinal is simply my first dataframe Commented Jun 30, 2021 at 15:24
  • 1
    See include a minimal data frame for how to include a data frame with your code. Make it easy for others to help you. Commented Jun 30, 2021 at 15:28
  • 2
    That is only one support item. Please continue to use on topic and how to ask from the intro tour. Commented Jun 30, 2021 at 15:41

4 Answers 4

3

It may be very difficult to avoid iterration with the logic you have in place to select a subset of rows in df2 for a given dffinal row, but you should be able to speed up the iterative method (hopefully by a lot) using this.

(note: if you're repeatedly accessing the row of the dataframe you're iterating through, use .iterrows so you can grab things much more simply (and quickly)

for i,row in dffinal.iterrows():
    och_array = df2.loc[(df3['och1'] >row['och1']) &\
          (df2['och2'] >row['och2']) &\
          (df2['och3'] >row['och3']) &\   
          (df2['cch1'] >row['cch1']),'och'].values
    dffinal.at[i,'LCH'] = och_array.mean()
    dffinal.at[i,'L#'] = len(och_array)

This avoids lookups in dffinal, avoids creating a new copy of the df several times over. Can't test this without a data sample, but I think this will work.

Sign up to request clarification or add additional context in comments.

5 Comments

Thank you, apparently it is not possible to use vectorization in my case. I tried your code, but i get this error "None of [Index(['LCH'], dtype='object')] are in the [columns]". for i,row in dffinal.iterrows(): df_stats = df2.loc[(df2['och1'] >row['och1']) & (df2['och2'] >row['och2']) & (df2['och3'] >row['och3']) & (df2['cch1'] >row['cch1']),['LCH']].mean() dffinal.at[i,'LCH'] = df_stats['LCH'] I used this code
Note that itertuples should be even faster than iterrows, and might be more "dtype-safe" as well.
@shadowtalker could you please show me example, i have never tried itertuples before
@BogdanTitomir I edited the code, I didn't read carefully enough how you calculated LCH and L#, I think it should work now
Thank you, your code works perfectly and it significantly improved the performance of my code. I chose @shadowtalker answer as correct simply because it is slightly faster. Unfortunately, I have to choose only one answer as correct
3

This answer is based on https://stackoverflow.com/a/68197271/2954547, except that it uses itertuples instead of iterrows. itertuples is generally safer than iterrows, because it preserves dtypes correctly. See the "Notes" section of the DataFrame.iterrows documentation.

It also is self-contained, in that it can be executed top-to-bottom without having to copy/paste data, etc.

Note that I iterate over df1.itertuples and not df_final.itertuples. Never mutate something that you are iterating over, and never iterate over something that you are mutating. Modifying a DataFrame in-place is a form of mutation.

import io

import pandas as pd


data1_txt = """
     date  id  och  och1  och2  och3  cch1  LCH  L#
3/27/2020   1 -2.1     3     3     1     5  NaN NaN
4/9/2020   2  2.0     1     2     1     3  NaN NaN
"""

data2_txt = """
     date  och  cch  och1  och2  och3  cch1
5/30/2012 -0.7 -0.7     3    -1     1    56
9/16/2013  0.9 -1.0     6     4     3     7
9/26/2013  2.5  5.4     2     3     2     4
8/26/2016  0.1 -0.7     4     3     5    10
"""

df1 = pd.read_fwf(io.StringIO(data1_txt), index_col='id')
df2 = pd.read_fwf(io.StringIO(data2_txt))

df_final = df1.copy()

for row in df1.itertuples():
    row_mask = (
        (df2['och1'] > row.och1) &
        (df2['och2'] > row.och2) &
        (df2['och3'] > row.och3) &
        (df2['cch1'] > row.cch1)
    )
    och_vals = df2.loc[row_mask, 'och']
    i = row.Index
    df_final.at[i, 'LCH'] = och_vals.mean()
    df_final.at[i, 'L#'] = len(och_vals)

print(df_final)

The output is

         date  och  och1  och2  och3  cch1  LCH  L#       LCH   L#
id                                                                
1   3/27/2020 -2.1     3     3     1     5  NaN NaN  0.900000  1.0
2    4/9/2020  2.0     1     2     1     3  NaN NaN  1.166667  3.0

6 Comments

I haven't really spent much time answering questions here until this week, but will definitely steal that StringIO technique for getting print statement output from questions into my notebooks. Also thanks for showing this example. My suspicion regarding optimization is that iterrows vs itertuples will not make a big difference because the vast majority of the computation is in selecting the correct rows from df2, but it's cool to see this implementation!
Good point. The main benefit of itertuples is not speed but dtype-safety. See the "Notes" section of pandas.pydata.org/pandas-docs/stable/reference/api/…
How can I add a new line df_final.at[i, 'SumPosNeg']= which needs to be equal to sum of all positive och values divided by the sum of all negative och values or sum(och>0)/sum(och<0)
I.e. the new column results should be equal to 2.0/(-2.1)=-0.95. I got 2.0 and 2.1 from the final results dataframe.
@BogdanTitomir it might be worthwhile to spend some time understanding mine and Clay's answers, so that you can add your own extensions or modifications as needed.
|
3

Only way I can think of by pandas methods without loops is a cross join after resetting the index and comparing with df.all(1)

cols = ['och1','och2','och3','cch1']
u = df2.reset_index().assign(k=1).merge(
    dffinal.reset_index().assign(k=1),on='k',suffixes=('','_y'))
#for new Version of pandas there is a how='cross' included now

dffinal['NewLCH'] = (u[u[cols].gt(u[[f"{i}_y" for i in cols]].to_numpy()).all(1)]
                     .groupby("index_y")['och'].mean())

print(dffinal)

        date  id  och  och1  och2  och3  cch1  LCH  L#    NewLCH
0  3/27/2020   1 -2.1     3     3     1     5  NaN NaN  0.900000
1   4/9/2020   2  2.0     1     2     1     3  NaN NaN  1.166667

3 Comments

I tried your code, but I have error that says "Unable to allocate 39.9 GiB for an array with shape (5358055840,) and data type int64", which I understand is this code requires 40 GiB ram. Or I did something wrong
@BogdanTitomir yes, I wouldnot recommend using a cross join for big dataframes (it takes a lot of space) , may be select only relevant columns (cols and och) before reset index and try
I think this method is too complicated for my brain:) but anyway thank you!
2

Here is one way to approach your problem

def fast(A, B):
    for a in A:
        m = (B[:, 1:] > a[1:]).all(1)
        yield B[m, 0].mean(), m.sum()

c = ['och', 'och1', 'och2', 'och3', 'cch1']
df1[['LCH', 'L#']] = list(fast(df1[c].to_numpy(), df2[c].to_numpy()))

        date  id  och  och1  och2  och3  cch1       LCH  L#
0  3/27/2020   1 -2.1     3     3     1     5  0.900000   1
1   4/9/2020   2  2.0     1     2     1     3  1.166667   3

7 Comments

Thank you very much, the code gets the job done and much faster than my original code, but surprisingly a little slower than other answers. The code looked very complex and I expected it to be faster than other answers. But maybe i did something wrong and corrupted the code:)
@BogdanTitomir Not sure about the problem but in my tests i found this to be 5-6x faster. By the way what are the shapes of dataframe df1 and df2?
Test sample df1 has 12,000 rows. df2 has 450,000
Thank you, it turns out to be the fastest method. I chose this as the correct answer, but probably I will go with other methods simply because this method feels too complicated to me.
Glad it worked for you. Although few more optimizations can be possible but that will only make the code more complex.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.