I have two dataframes, df and ct
df: UNIT, START, END, CHECK are relevant columns. CHECK values are all defaulted to 'n'. My code tests them and if they pass changes this value to 'y'
df.shape = 59155, 5
| UNIT | DTHours | START | END | CHECK | |
|---|---|---|---|---|---|
| 0 | A7000 | 0.1097 | 43100.0321 | 43382.1417 | n |
| 1 | A7000 | 0.0125 | 43105.6621 | 43382.6736 | n |
| 2 | A7000 | 0.2042 | 43200.1214 | 43388.3465 | n |
| 3 | A7000 | 0.1521 | 43389.2979 | 43854.4500 | n |
| 4 | A7002 | 0.1455 | 44198.5049 | 44199.0009 | n |
ct: UNIT, START, END are unique combinations, but each UNIT can have multiple entries (these are effectively start/end dates). On average each UNIT row repeats ~5 times
ct.shape = 6219, 3
| UNIT | START | END | |
|---|---|---|---|
| 0 | A7000 | 43101 | 43156 |
| 1 | A7000 | 43101 | 43173 |
| 2 | A7000 | 43172 | 43616 |
| 3 | A7000 | 43172 | 43712 |
| 4 | A7002 | 43172 | 44196 |
I am testing if the values df['START'] and df['END '] are in between any of the instances of that UNIT in ct. However the code is taking exceptionally long. It has iterated through ~12000 rows in 12 hours. What am I doing wrong?
Pseudocode:
for loop iterating through all rows of df
create list of all cf[START] values for current df[UNIT]
create list of all cf[END] values for current df[UNIT]
test if the df[CHECK] value is 'n' viz default value
for loop iterating over length of cf[START] (len = cf[END]), index
test if (df[START]>= cf[START][index]) AND (df[END]>= cf[END][index])
change df[CHECK] to 'y'
break for loop iterating over cf[START] for this row in df
Actual code:
for index, row in df.iterrows():
display.clear_output(wait=True)
display.display(plt.gcf())
df_unit = row['UNIT']
ct_start = ct.loc[ct['UNIT'] == df_unit]['START'].values
ct_end = ct.loc[ct['UNIT'] == df_unit]['END'].values
if (row['UNIT'] != 'y'):
for ct_index in range(len(ct_start)):
if ((row['START'] >= ct_start[ct_index]) & (row['END'] <=
ct_end[ct_index])):
row['CHECK'] == 'y'
aaa += 1
break
plt.scatter(index,aaa, c='r')
Expected Output for the dataframes above
| UNIT | DTHours | START | END | CHECK | |
|---|---|---|---|---|---|
| 0 | A7000 | 0.1097 | 43100.0321 | 43382.1417 | n |
| 1 | A7000 | 0.0125 | 43105.6621 | 43382.6736 | n |
| 2 | A7000 | 0.2042 | 43200.1214 | 43388.3465 | y |
| 3 | A7000 | 0.1521 | 43389.2979 | 43854.4500 | n |
| 4 | A7002 | 0.1455 | 44198.5049 | 44199.0009 | n |