Pandas Dataframe inefficient for loop through columns

Question

I have Precipitation Data (1800 rows and 15k columns) for each cell and date.

                          486335  486336  486337
2019-07-03 13:35:54.445       0       2      22
2019-07-04 13:35:54.445       0       1       1
2019-07-05 13:35:54.445      16       8      22
2019-07-06 13:35:54.445       0       0       0
2019-07-07 13:35:54.445       0      11       0

I want to find dates where a specific amount of rain (>15mm) was reached and count the days after this event were less rain (<1,1mm) occurred. Together with the amount of rain, start and end period, cell and other information stored in a new DataFrame.

I write a for loop that does this job but it took several days to finish ;(. I am a beginner of python so maybe there are some tips for other methods.

from datetime import datetime, timedelta, date
import datetime
import pandas as pd

#Existing Data
index_dates =  pd.date_range(pd.datetime.today(), periods=10).tolist()
df = pd.DataFrame({'486335':[0,0,16,0,0,0,2,1,8,2],'486336':[2,1,8,0,11,16,0,1,6,8],'486337':[22,1,22,0,0,0,5,3,6,1]},index=index_dates)
columns = df.columns 
counter_columns = 0

iteration = -1 #Iterations Steps
counter = 10 #10 precipitation values per column
duration = 0 #days with no or less than pp_max_1 rain 
count = False

index_list = df.index #Index for updating df / Integear
period_range = 0  #Amount of days after Event without much rain Integear
period_amount = 0 #Amount of PP in dry days except event Integear
event_amount = 0.0  #Amount of heavy rainfall on the event date Float
pp = 0 #actual precipitation
pp_sum = 0.0 #mm
pp_min = 15.0 #mm min pp for start to count dry days until duration_min_after
pp_max_1 = 0.11 #max pp for 1 day while counting dry days
dry_days = 0 #dry days after event

for x in df:
    for y in df[x]:
        iteration = iteration + 1
        if iteration == counter:
            iteration = 0
            counter_columns = counter_columns + 1
            print("column :",counter_columns, "finished")
        if y >= pp_min and count == False:
            duration = duration + 1
            count = True
            start_period = index_list[iteration]
            event_amount = y
            index = iteration
            pp_sum = pp_sum + y
        elif y >= pp_min and count == True or y >= pp_max_1 and count == True:
            end_period = index_list[iteration]
            dry_periods = dry_periods.append({"start_period":start_period ,"end_period":end_period,"period_range":duration,"period_amount":pp_sum ,"event_amount":event_amount, "cell":columns[counter_columns]},ignore_index=True).sort_values('period_range',ascending=False)
            duration = 0
            count = False
            pp_sum = 0
        elif pp <= pp_max_1 and count == True:
            duration = duration + 1
            pp_sum = pp_sum + y
        else:
            continue
print(dry_periods)

The output looks like this

start_period              end_period period_range  \
0  2019-07-05 13:15:05.545 2019-07-09 13:15:05.545            4   
1  2019-07-05 13:15:05.545 2019-07-09 13:15:05.545            4   
2  2019-07-05 13:15:36.569 2019-07-09 13:15:36.569            4   
3  2019-07-05 13:15:36.569 2019-07-09 13:15:36.569            4   
4  2019-07-05 13:16:16.372 2019-07-09 13:16:16.372            4   
5  2019-07-05 13:16:16.372 2019-07-09 13:16:16.372            4   


    period_amount event_amount    cell  
0            16.0           16  486335  
1            22.0           22  486337  
2            16.0           16  486335  
3            22.0           22  486337  
4            16.0           16  486335  
5            22.0           22  486337

@Hanggy asked: what is inside the columns ? ( can't comment because of reputation) — Serge Ballesta
– Serge Ballesta, Commented Jul 3, 2019 at 11:53
In all it doesn't look that bad. I only can imagine, that probably the index access consumes time. Could you try how the performance changes if you replace "for y in df[x]:" by "for period, y in df[x].items():" and just set "start_period=period" everywhere you currently do the index access (end_period likewise)? I hope that performs better. This way, you could also get rid of your "iteration"-variable and the technical code associated with it. — jottbe
– jottbe, Commented Jul 3, 2019 at 12:42
Ah one more thing, I think you can get rid of "if iteration == counter:" as well. I would rather add the code executed here in the outer loop (behind the inner loop). Probably that doesn't save a lot of runtime, but it makes the code easier to understand and easier to maintain, because you don't need to know how many iterations the code performs before you enter the loop (I think you can get rid of the "counter" variable this way as well). — jottbe
– jottbe, Commented Jul 3, 2019 at 12:50
loops in python are the most memory inefficient solutions, always try to built a vectorized solution of the dataframe, in your dataframe, the phenomen of rain >15 and rain <1.6 will occure more often right? — PV8
– PV8, Commented Jul 3, 2019 at 13:09
complete vectarization is not possible, but serge ballesta gives you a good approach — PV8
– PV8, Commented Jul 4, 2019 at 12:18

Serge Ballesta · Accepted Answer · 2019-07-03 13:24:50Z

3

You could avoid iteration on rows, because it does not scale well for large dataframes.

This is a different approach, unsure whether it will be more efficient for your complete dataframe:

periods=[]
for cell in df.columns:
    sub = pd.DataFrame({'amount': df[cell].values}, index=df.index)
    sub['flag'] = pd.cut(sub['amount'], [0.11, 15, np.inf],
                         labels=[0, 1]).astype(np.float)
    sub.loc[sub.flag>0, 'flag']=sub.loc[sub.flag>0, 'flag'].cumsum()
    sub.flag.ffill(inplace=True)
    x = sub[sub.flag>0].reset_index().groupby('flag').agg(
        {'index':['min', 'max'], 'amount': 'sum'})
    x.columns = ['start', 'end', 'amount']
    x['period_range'] = (x.end - x.start).dt.days + 1
    x['cell'] = cell
    x.reindex(columns=['start', 'end', 'period_range', 'cell'])
    periods.append(x)

resul = pd.concat(periods).reset_index(drop=True)

answered Jul 3, 2019 at 13:24

Serge Ballesta

150k13 gold badges137 silver badges267 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

jottbe Over a year ago

Nice! do you actually need the ffill above? wouldn't it give the same result, if you skip the loc[sub.flag>0 instead and also sum over the zeros?

jottbe Over a year ago

The period length is the length measured from the beginning of the first period to the end of the last period, right?

Serge Ballesta Over a year ago

@jottbe: the problem is that any value between 0.11 and 15 will interrupt a current dry period without starting a new group.

Serge Ballesta Over a year ago

The period lenght is the number of days between the start of an event and the last day of that event + 1.

jottbe Over a year ago

Nice solution. I haven't come across pd.cut so far. I'm sure it will make my life simpler. But how can you handle the case when several events occur in the same column? or does it already do that?

|

jottbe · Accepted Answer · 2019-07-03 13:46:48Z

Because I don't have your whole data set I can't really say what consumes the time, but I guess it is because of the index accesses, when you fetch the periods and the sort operations you perform in the loop. Maybe you'd like to try the following code. It should be logically equivalent to your code, except for some changes:

duration = 0 #days with no or less than pp_max_1 rain 
count = False

index_list = df.index #Index for updating df / Integear
period_range = 0  #Amount of days after Event without much rain Integear
period_amount = 0 #Amount of PP in dry days except event Integear
event_amount = 0.0  #Amount of heavy rainfall on the event date Float
pp = 0 #actual precipitation
pp_sum = 0.0 #mm
pp_min = 15.0 #mm min pp for start to count dry days until duration_min_after
pp_max_1 = 0.11 #max pp for 1 day while counting dry days
dry_days = 0 #dry days after event
dry_periods= list()

for counter_columns, column in enumerate(df.columns, 1):
    for period, y in df[column].items():
        if not count and y >= pp_min:
            duration += 1
            count = True
            start_period = period
            event_amount = y
            pp_sum += y
        elif count and (y >= pp_min or y >= pp_max_1):
            end_period = period
            dry_periods.append({
                    "start_period":  start_period ,
                    "end_period":    end_period,
                    "period_range":  duration,
                    "period_amount": pp_sum ,
                    "event_amount":  event_amount, 
                    "cell":          column})
            duration = 0
            count =    False
            pp_sum =   0
        elif count and pp <= pp_max_1:
            duration += 1
            pp_sum   += y
    print("column :",counter_columns, "finished")

dry_periods.sort(key=lambda record: record['period_range'])
print(dry_periods)

The changes are:

removed the index_list[iteration] accesses, which I think could consume some time
removed the whole iteration counter logic, because the logic associated with it can be placed outside the inner loop, that way the inner loop gets smaller, though it probably doesn't really increase the performance that much
the comparison count == True is not necessary, you can just write count instead in the if clause
changed the incrementation and sum logic from var = var + num to var += num (that's probably a matter of taste, you can also skip this if you like, it won't have such a big influence on the performance)
then I put the sort logic of you dry_periods outside the loop because it seems to me that your loop logic does not rely on the set to be sorted --> maybe this is even the biggest influence in performance

Btw. Because I didn't know how dry_periods is exactly defined, I just used it as a list. Please also have a look at the condition

elif count and (y >= pp_min or y >= pp_max_1):

above. It looks suspicious to me, but it is just the rewritten condition from your program. If it is ok, probably you can remove one of the comparisons, because I guess pp_min < pp_max_1, right?

Collectives™ on Stack Overflow

Pandas Dataframe inefficient for loop through columns

2 Answers 2

9 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

9 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related