Improving performance of Python for loops with Pandas data frames

Question

please consider the following DataFrame df:

timestamp    id        condition
             1234      A    
             2323      B
             3843      B
             1234      C
             8574      A
             9483      A

Basing on the condition contained in the column condition I have to define a new column in this data frame which counts how many ids are in that condition. However, please note that since the DataFrame is ordered by the timestamp column, one could have multiple entries of the same id and then a simple .cumsum() is not a viable option.

I have come out with the following code, which is working properly but is extremely slow:

#I start defining empty arrays
ids_with_condition_a = np.empty(0)
ids_with_condition_b = np.empty(0)
ids_with_condition_c = np.empty(0)

#Initializing new column
df['count'] = 0

#Using a for loop to do the task, but this is sooo slow!
for r in range(0, df.shape[0]):
    if df.condition[r] == 'A':
        ids_with_condition_a = np.append(ids_with_condition_a, df.id[r])
    elif df.condition[r] == 'B':
        ids_with_condition_b = np.append(ids_with_condition_b, df.id[r])
        ids_with_condition_a = np.setdiff1d(ids_with_condition_a, ids_with_condition_b)
    elifif df.condition[r] == 'C':
        ids_with_condition_c = np.append(ids_with_condition_c, df.id[r])

df.count[r] = ids_with_condition_a.size

Keeping these Numpy arrays is very useful to me because it gives the list of the ids in a particular condition. I would also be able to put dinamically these arrays in a corresponding cell in the df DataFrame.

Are you able to come out with a better solution in terms of performance?

What is expected output?

jezrael
– jezrael

2018-07-01 10:48:41 +00:00
Commented Jul 1, 2018 at 10:48 — jezrael
– jezrael, Commented Jul 1, 2018 at 10:48

Ben.T · Accepted Answer · 2018-07-03 16:05:52Z

1

you need to use groupby on the column 'condition' and cumcount to count how many ids are in each condition up to the current row (which seems to be what your code do):

df['count'] = df.groupby('condition').cumcount()+1 # +1 is to start at 1 not 0

with your input sample, you get:

     id condition  count
0  1234         A      1
1  2323         B      1
2  3843         B      2
3  1234         C      1
4  8574         A      2
5  9483         A      3

which is faster than using loop for

and if you want just have the row with condition A for example, you can use a mask such as, if you do print (df[df['condition'] == 'A']), you see row with only condition egal to A. So to get an array,

arr_A = df.loc[df['condition'] == 'A','id'].values
print (arr_A)
array([1234, 8574, 9483])

EDIT: to create two column per conditions, you can do for example for condition A:

# put 1 in a column where the condition is met
df['nb_cond_A'] = pd.np.where(df['condition'] == 'A',1,None)
# then use cumsum for increment number, ffill to fill the same number down
# where the condition is not meet, fillna(0) for filling other missing values
df['nb_cond_A'] = df['nb_cond_A'].cumsum().ffill().fillna(0).astype(int)
# for the partial list, first create the full array
arr_A = df.loc[df['condition'] == 'A','id'].values
# create the column with apply (here another might exist, but it's one way)
df['partial_arr_A'] = df['nb_cond_A'].apply(lambda x: arr_A[:x])

the output looks like this:

     id condition  nb_condition_A       partial_arr_A  nb_cond_A
0  1234         A               1              [1234]          1
1  2323         B               1              [1234]          1
2  3843         B               1              [1234]          1
3  1234         C               1              [1234]          1
4  8574         A               2        [1234, 8574]          2
5  9483         A               3  [1234, 8574, 9483]          3

then same thing for B, C. Maybe with a loop for cond in set(df['condition']) ould be practical for generalisation

EDIT 2: one idea to do what you expalined in the comments but not sure it improves the performance:

# array of unique condition
arr_cond = df.condition.unique()
#use apply to create row-wise the list of ids for each condition
df[arr_cond] = (df.apply(lambda row: (df.loc[:row.name].drop_duplicates('id','last')
                                          .groupby('condition').id.apply(list)) ,axis=1)
                  .applymap(lambda x: [] if not isinstance(x,list) else x))

Some explanations: for each row, select the dataframe up to this row loc[:row.name], drop the duplicated 'id' and keep the last one drop_duplicates('id','last') (in your example, it means that once we reach the row 3, the row 0 is dropped, as the id 1234 is twice), then the data is grouped by condition groupby('condition'), and ids for each condition are put in a same list id.apply(list). The part starting with applymap fillna with empty list (you can't use fillna([]), it's not possible).

For the length for each condition, you can do:

for cond in arr_cond:
    df['len_{}'.format(cond)] = df[cond].str.len().fillna(0).astype(int)

THe result is like this:

     id condition             A             B       C  len_A  len_B  len_C
0  1234         A        [1234]            []      []      1      0      0
1  2323         B        [1234]        [2323]      []      1      1      0
2  3843         B        [1234]  [2323, 3843]      []      1      2      0
3  1234         C            []  [2323, 3843]  [1234]      0      2      1
4  8574         A        [8574]  [2323, 3843]  [1234]      1      2      1
5  9483         A  [8574, 9483]  [2323, 3843]  [1234]      2      2      1

edited Jul 3, 2018 at 16:05

answered Jul 1, 2018 at 11:05

Ben.T

29.7k6 gold badges39 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

13 Comments

espogian Over a year ago

Hi, I should admit that your solution is pretty elegant, and thank you for improving my knowledge. I would like to understand if it is possible to store arr_A in each row (in a dedicated column) in order to get the list of ids per each timestamp which are meeting a particular condition. To be honest, I'm more interested to the size of this array, I need to keep track of the number of the ids which are changing between one condition to another per timestamp

Ben.T Over a year ago

@espogian the thing about the timestamp is not really clear to me as in your example there is none (beside the name of the column). Are you saying that the example you give is just for one timestamp, and that you have other timestamps with other several rows of id and conditions?

espogian Over a year ago

Each row has its own timestamp, e.g. one row per second. The purpose is to know, for example at row 3, how many ids meet condition A, how many condition B, etc

Ben.T Over a year ago

@espogian so to be sure, you would like 2 columns per condition, one with the number of row meeting this condition until the current row (even if this row is not the same condition) and a column with the list of ids meeting this condition until this row?

espogian Over a year ago

The nicest thing which I'm still unable to do would be to have one column per condition with the list of ids meeting that condition so far, and one column per condition with the number of ids meeting that condition so far

|

Collectives™ on Stack Overflow

Improving performance of Python for loops with Pandas data frames

1 Answer 1

13 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

13 Comments

Your Answer

Sign up or log in

Post as a guest

Related