0

I have a dataframe in the following structure as of now.

Current dataframe

I saw this post here, in which the second answer says that using numpy array for looping huge dataframe is the best.

This is my requirement:

  1. Loop through unique dates
  2. Within unique dates in the dataframe, loop through unique session.
  3. Once I'm inside unique session within unique dates, I need to do some operations

Currently I'm using for loop, but its unbearably slow. Can anyone suggest how to use numpy arrays to meet my requirements? as suggested in this post here?

EDIT:

I'm elaborating my requirement here:
1. Loop through unique dates
Which would give me the following dataframe:
unique days 2. Within unique dates, loop through unique sessionId's.
Which would give me something like this:
unique Sessions 3. Once within unique sessionId within unique date,
Find the timestamp difference between last element and first element
This time difference is added to a list for each unique session.
4. Outside the 2nd loop, I will take the average of the list that is created in the above step.
5. The value we get in step 4 is added to another list

The aim is to find the average time difference between the last and first message of each session per day

9
  • 1
    Your possibilities depend on the operations you need to in step 3. NumPy arrays would help if you are doing some matrix operations, otherwise it should not matter whether you are using NumPy arrays or some other kind of variables. You should specify/bring out more details of your problem. Loops in Python can be really slow and I have sometimes used combination of Python and Fortran (f2py) or Python and Java to speed up the program. Commented Sep 24, 2018 at 11:35
  • 2
    df.groupby(['ChatDate", "sessionId"]).apply(lambda x: some_operations(x)) ? Commented Sep 24, 2018 at 11:35
  • We can help you better if you give us at least a contrived example of what you actually want to do with the data, including desired output for the example input you gave. Commented Sep 24, 2018 at 11:46
  • You can also try using multiprocessing docs.python.org/2/library/multiprocessing.html Commented Sep 24, 2018 at 11:49
  • @msi_gerva, I have updated the requirements as per your suggestion. Can you please check now and let me know if you have a solution? I have tried itertuples() instead of for loop as well. But even that is very slow Commented Sep 24, 2018 at 12:52

1 Answer 1

2

Use groupby:

grouped = df.groupby(['ChatDate", "sessionId"])
timediff = grouped.timestamp.last() - grouped.timestamp.first() # or max-min
timediff.mean() # this is your step 4
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.