1

I have two dataframes, df and ct

df: UNIT, START, END, CHECK are relevant columns. CHECK values are all defaulted to 'n'. My code tests them and if they pass changes this value to 'y'

df.shape = 59155, 5
UNIT DTHours START END CHECK
0 A7000 0.1097 43100.0321 43382.1417 n
1 A7000 0.0125 43105.6621 43382.6736 n
2 A7000 0.2042 43200.1214 43388.3465 n
3 A7000 0.1521 43389.2979 43854.4500 n
4 A7002 0.1455 44198.5049 44199.0009 n

ct: UNIT, START, END are unique combinations, but each UNIT can have multiple entries (these are effectively start/end dates). On average each UNIT row repeats ~5 times

ct.shape = 6219, 3
UNIT START END
0 A7000 43101 43156
1 A7000 43101 43173
2 A7000 43172 43616
3 A7000 43172 43712
4 A7002 43172 44196

I am testing if the values df['START'] and df['END '] are in between any of the instances of that UNIT in ct. However the code is taking exceptionally long. It has iterated through ~12000 rows in 12 hours. What am I doing wrong?

Pseudocode:

 for loop iterating through all rows of df
      create list of all cf[START] values for current df[UNIT]
      create list of all cf[END] values for current df[UNIT]

      test if the df[CHECK] value is 'n' viz default value
           for loop iterating over length of cf[START] (len = cf[END]), index
                test if (df[START]>= cf[START][index]) AND (df[END]>= cf[END][index])
                     change df[CHECK] to 'y'
                     break for loop iterating over cf[START] for this row in df

Actual code:

for index, row in df.iterrows():
    display.clear_output(wait=True)
    display.display(plt.gcf())
    
    df_unit = row['UNIT']
    ct_start = ct.loc[ct['UNIT'] == df_unit]['START'].values
    ct_end = ct.loc[ct['UNIT'] == df_unit]['END'].values
    
    if (row['UNIT'] != 'y'):
        for ct_index in range(len(ct_start)):
            if ((row['START'] >= ct_start[ct_index]) & (row['END'] <= 
ct_end[ct_index])):
                row['CHECK'] == 'y'
                aaa += 1
                break
        
    plt.scatter(index,aaa, c='r')

Expected Output for the dataframes above

UNIT DTHours START END CHECK
0 A7000 0.1097 43100.0321 43382.1417 n
1 A7000 0.0125 43105.6621 43382.6736 n
2 A7000 0.2042 43200.1214 43388.3465 y
3 A7000 0.1521 43389.2979 43854.4500 n
4 A7002 0.1455 44198.5049 44199.0009 n
1
  • Please include your expected output for the given sample dataframes. Commented Jun 22, 2021 at 14:46

2 Answers 2

1

As a rule of thumb: 90% of the time if you are using a loop on a dataframe, you are doing something wrong

In your case:

  • rather than doing for ... if, it might be a good idea to filter the dataframe: df[df["CHECK"] == "y"]
  • you are constantly filtering the ct dataframe, for EACH row. You are better off grouping this dataframe once and transforming that into a dictionary that you can access by unit:
    • units = dict(list(ct.groupby("UNIT")))
    • for index in units[df_unit]: do_something()

This should already considerably improve the performance

Sign up to request clarification or add additional context in comments.

Comments

0

We can approach the problem using the following steps

  • Reset the index of df and merge it with ct on UNIT
  • Query the resulting merged dataframe to find the indices where the corresponding values in START and END satisfies the condition of inclusion
  • Using boolean indexing with loc update the values in CHECK column to y using the indices obtained in previous step
s  = df.reset_index().merge(ct, on='UNIT', suffixes=['', '_r'])
ix = s.query("START >= START_r and END <= END_r")['index']
df.loc[df.index.isin(ix), 'CHECK'] = 'y'

    UNIT  DTHours       START         END CHECK
0  A7000   0.1097  43100.0321  43382.1417     n
1  A7000   0.0125  43105.6621  43382.6736     n
2  A7000   0.2042  43200.1214  43388.3465     y
3  A7000   0.1521  43389.2979  43854.4500     n
4  A7002   0.1455  44198.5049  44199.0009     n

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.