Optimizing python dataframe iteration loop

Question

I have two dataframes, df and ct

df: UNIT, START, END, CHECK are relevant columns. CHECK values are all defaulted to 'n'. My code tests them and if they pass changes this value to 'y'

df.shape = 59155, 5

	UNIT	DTHours	START	END	CHECK
0	A7000	0.1097	43100.0321	43382.1417	n
1	A7000	0.0125	43105.6621	43382.6736	n
2	A7000	0.2042	43200.1214	43388.3465	n
3	A7000	0.1521	43389.2979	43854.4500	n
4	A7002	0.1455	44198.5049	44199.0009	n

ct: UNIT, START, END are unique combinations, but each UNIT can have multiple entries (these are effectively start/end dates). On average each UNIT row repeats ~5 times

ct.shape = 6219, 3

	UNIT	START	END
0	A7000	43101	43156
1	A7000	43101	43173
2	A7000	43172	43616
3	A7000	43172	43712
4	A7002	43172	44196

I am testing if the values df['START'] and df['END '] are in between any of the instances of that UNIT in ct. However the code is taking exceptionally long. It has iterated through ~12000 rows in 12 hours. What am I doing wrong?

Pseudocode:

 for loop iterating through all rows of df
      create list of all cf[START] values for current df[UNIT]
      create list of all cf[END] values for current df[UNIT]

      test if the df[CHECK] value is 'n' viz default value
           for loop iterating over length of cf[START] (len = cf[END]), index
                test if (df[START]>= cf[START][index]) AND (df[END]>= cf[END][index])
                     change df[CHECK] to 'y'
                     break for loop iterating over cf[START] for this row in df

Actual code:

for index, row in df.iterrows():
    display.clear_output(wait=True)
    display.display(plt.gcf())
    
    df_unit = row['UNIT']
    ct_start = ct.loc[ct['UNIT'] == df_unit]['START'].values
    ct_end = ct.loc[ct['UNIT'] == df_unit]['END'].values
    
    if (row['UNIT'] != 'y'):
        for ct_index in range(len(ct_start)):
            if ((row['START'] >= ct_start[ct_index]) & (row['END'] <= 
ct_end[ct_index])):
                row['CHECK'] == 'y'
                aaa += 1
                break
        
    plt.scatter(index,aaa, c='r')

Expected Output for the dataframes above

	UNIT	DTHours	START	END	CHECK
0	A7000	0.1097	43100.0321	43382.1417	n
1	A7000	0.0125	43105.6621	43382.6736	n
2	A7000	0.2042	43200.1214	43388.3465	y
3	A7000	0.1521	43389.2979	43854.4500	n
4	A7002	0.1455	44198.5049	44199.0009	n

Please include your expected output for the given sample dataframes. — Shubham Sharma
– Shubham Sharma, Commented Jun 22, 2021 at 14:46

qmeeus · Accepted Answer · 2021-06-22 15:00:54Z

1

As a rule of thumb: 90% of the time if you are using a loop on a dataframe, you are doing something wrong

In your case:

rather than doing for ... if, it might be a good idea to filter the dataframe: df[df["CHECK"] == "y"]
you are constantly filtering the ct dataframe, for EACH row. You are better off grouping this dataframe once and transforming that into a dictionary that you can access by unit:
- units = dict(list(ct.groupby("UNIT")))
- for index in units[df_unit]: do_something()

This should already considerably improve the performance

answered Jun 22, 2021 at 15:00

qmeeus

2,4612 gold badges15 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Shubham Sharma · Accepted Answer · 2021-06-22 15:11:02Z

We can approach the problem using the following steps

Reset the index of df and merge it with ct on UNIT
Query the resulting merged dataframe to find the indices where the corresponding values in START and END satisfies the condition of inclusion
Using boolean indexing with loc update the values in CHECK column to y using the indices obtained in previous step

s  = df.reset_index().merge(ct, on='UNIT', suffixes=['', '_r'])
ix = s.query("START >= START_r and END <= END_r")['index']
df.loc[df.index.isin(ix), 'CHECK'] = 'y'

    UNIT  DTHours       START         END CHECK
0  A7000   0.1097  43100.0321  43382.1417     n
1  A7000   0.0125  43105.6621  43382.6736     n
2  A7000   0.2042  43200.1214  43388.3465     y
3  A7000   0.1521  43389.2979  43854.4500     n
4  A7002   0.1455  44198.5049  44199.0009     n

Collectives™ on Stack Overflow

Optimizing python dataframe iteration loop

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related