Use numpy arrays to speed up iteration in pandas dataframe

Question

I have a dataframe in the following structure as of now.

I saw this post here, in which the second answer says that using numpy array for looping huge dataframe is the best.

This is my requirement:

Loop through unique dates
Within unique dates in the dataframe, loop through unique session.
Once I'm inside unique session within unique dates, I need to do some operations

Currently I'm using for loop, but its unbearably slow. Can anyone suggest how to use numpy arrays to meet my requirements? as suggested in this post here?

EDIT:

I'm elaborating my requirement here:
1. Loop through unique dates
Which would give me the following dataframe:
2. Within unique dates, loop through unique sessionId's.
Which would give me something like this:
3. Once within unique sessionId within unique date,
Find the timestamp difference between last element and first element
This time difference is added to a list for each unique session.
4. Outside the 2nd loop, I will take the average of the list that is created in the above step.
5. The value we get in step 4 is added to another list

The aim is to find the average time difference between the last and first message of each session per day

Your possibilities depend on the operations you need to in step 3. NumPy arrays would help if you are doing some matrix operations, otherwise it should not matter whether you are using NumPy arrays or some other kind of variables. You should specify/bring out more details of your problem. Loops in Python can be really slow and I have sometimes used combination of Python and Fortran (f2py) or Python and Java to speed up the program. — msi_gerva
– msi_gerva, Commented Sep 24, 2018 at 11:35
df.groupby(['ChatDate", "sessionId"]).apply(lambda x: some_operations(x)) ? — Charles R
– Charles R, Commented Sep 24, 2018 at 11:35
We can help you better if you give us at least a contrived example of what you actually want to do with the data, including desired output for the example input you gave. — John Zwinck
– John Zwinck, Commented Sep 24, 2018 at 11:46
You can also try using multiprocessing docs.python.org/2/library/multiprocessing.html — Shubham R
– Shubham R, Commented Sep 24, 2018 at 11:49
@msi_gerva, I have updated the requirements as per your suggestion. Can you please check now and let me know if you have a solution? I have tried itertuples() instead of for loop as well. But even that is very slow — Tony Mathew
– Tony Mathew, Commented Sep 24, 2018 at 12:52

John Zwinck · Accepted Answer · 2018-09-24 14:32:24Z

2

Use groupby:

grouped = df.groupby(['ChatDate", "sessionId"])
timediff = grouped.timestamp.last() - grouped.timestamp.first() # or max-min
timediff.mean() # this is your step 4

answered Sep 24, 2018 at 14:32

John Zwinck

252k44 gold badges346 silver badges459 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Use numpy arrays to speed up iteration in pandas dataframe

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related