please consider the following DataFrame df:
timestamp id condition
1234 A
2323 B
3843 B
1234 C
8574 A
9483 A
Basing on the condition contained in the column condition I have to define a new column in this data frame which counts how many ids are in that condition. However, please note that since the DataFrame is ordered by the timestamp column, one could have multiple entries of the same id and then a simple .cumsum() is not a viable option.
I have come out with the following code, which is working properly but is extremely slow:
#I start defining empty arrays
ids_with_condition_a = np.empty(0)
ids_with_condition_b = np.empty(0)
ids_with_condition_c = np.empty(0)
#Initializing new column
df['count'] = 0
#Using a for loop to do the task, but this is sooo slow!
for r in range(0, df.shape[0]):
if df.condition[r] == 'A':
ids_with_condition_a = np.append(ids_with_condition_a, df.id[r])
elif df.condition[r] == 'B':
ids_with_condition_b = np.append(ids_with_condition_b, df.id[r])
ids_with_condition_a = np.setdiff1d(ids_with_condition_a, ids_with_condition_b)
elifif df.condition[r] == 'C':
ids_with_condition_c = np.append(ids_with_condition_c, df.id[r])
df.count[r] = ids_with_condition_a.size
Keeping these Numpy arrays is very useful to me because it gives the list of the ids in a particular condition. I would also be able to put dinamically these arrays in a corresponding cell in the df DataFrame.
Are you able to come out with a better solution in terms of performance?