Python: fast subsetting and looping dataframe

Question

I have the folowing minimal code which is too slow. For the 1000 rows I need, it takes about 2 min. I need it to run faster.

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(0,1000,size=(1000, 4)), columns=list('ABCD'))
start_algorithm = time.time()
myunique = df['D'].unique()
for i in myunique:
    itemp = df[df['D'] == i]
    for j in myunique:
        jtemp = df[df['D'] == j]

I know that numpy can make it run much faster but keep in mind that I want to keep a part of the original dataframe (or array in numpy) for specific values of column 'D'. How can I improve its performance?

Try always to provide a Minimal, Complete, and Verifiable example when asking questions. In case of pandas questions please provide sample input and output data sets (5-7 rows in CSV/dict/JSON/Python code format as text, so one could use it when coding an answer for you). This will help to avoid situations like: your code isn't working for me or it doesn't work with my data, etc. — MaxU - stand with Ukraine
– MaxU - stand with Ukraine, Commented Jun 12, 2016 at 9:24
What's the difference between itemp and jtemp? Again as MaxU said, a sample representative input data and the expected output with the explanation as to how it was achieved, would help a lot. — Divakar
– Divakar, Commented Jun 12, 2016 at 9:38
It's still not clear what are you trying to do! You have nested loops which are not connected anyhow - why do you need them? Are you 100% sure that you need loops at all? If i run your code i get the same row from df two times - in itemp and in jtemp. So it's hardly possible to help you without clear understanding what are you after — MaxU - stand with Ukraine
– MaxU - stand with Ukraine, Commented Jun 12, 2016 at 9:40
Then it's easy to answer your question - if you want to speed up your code, get rid of loops. It's a general answer for your general question... ;) — MaxU - stand with Ukraine
– MaxU - stand with Ukraine, Commented Jun 12, 2016 at 9:48

unutbu · Accepted Answer · 2016-06-12 12:07:07Z

5

Avoid computing the sub-DataFrame df[df['D'] == i] more than once. The original code computes this len(myunique)**2 times. Instead you can compute this once for each i (that is, len(myunique) times in total), store the results, and then pair them together later. For example,

    groups = [grp for di, grp in df.groupby('D')]
    for itemp, jtemp in IT.product(groups, repeat=2):
        pass

import pandas as pd
import itertools as IT
df = pd.DataFrame(np.random.randint(0,1000,size=(1000, 4)), columns=list('ABCD'))

def using_orig():
    myunique = df['D'].unique()
    for i in myunique:
        itemp = df[df['D'] == i]
        for j in myunique:
            jtemp = df[df['D'] == j]

def using_groupby():
    groups = [grp for di, grp in df.groupby('D')]
    for itemp, jtemp in IT.product(groups, repeat=2):
        pass

In [28]: %timeit using_groupby()
10 loops, best of 3: 63.8 ms per loop
In [31]: %timeit using_orig()
1 loop, best of 3: 2min 22s per loop

Regarding the comment:

I can easily replace itemp and jtemp with a=1 or print "Hello" so ignore that

The answer above addresses how to compute itemp and jtemp more efficiently. If itemp and jtemp are not central to your real calculation, then we would need to better understand what you really want to compute in order to suggest (if possible) a way to compute it faster.

edited Jun 12, 2016 at 12:07

answered Jun 12, 2016 at 9:51

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user6422514 Over a year ago

My comment about itemp and jtemp was meant to emphasize that (as I thought at the time) the problem was with unique. As I can now see from the amazing answer of unutbu, I was clearly wrong and I apologize to the fellow members of stackoverflow for somewhat misguiding them. Your answer works fine and I thank you all for your time and contribution.

Divakar · Accepted Answer · 2016-06-12 10:26:52Z

Here's a vectorized approach to form the groups based on unique elements from "D" column -

# Sort the dataframe based on the sorted indices of column 'D'
df_sorted = df.iloc[df['D'].argsort()]

# In the sorted dataframe's 'D' column find the shift/cut indces 
# (places where elements change values, indicating change of groups). 
# Cut the dataframe at those indices for the final groups with NumPy Split.
cut_idx = np.where(np.diff(df_sorted['D'])>0)[0]+1
df_split = np.split(df_sorted,cut_idx)

Sample testing

1] Form a sample dataframe with random elements :

>>> df = pd.DataFrame(np.random.randint(0,100,size=(5, 4)), columns=list('ABCD'))
>>> df
    A   B   C   D
0  68  68  90  39
1  53  99  20  85
2  64  76  21  19
3  90  91  32  36
4  24   9  89  19

2] Run the original code and print the results :

>>> myunique = df['D'].unique()
>>> for i in myunique:
...     itemp = df[df['D'] == i]
...     print itemp
... 
    A   B   C   D
0  68  68  90  39
    A   B   C   D
1  53  99  20  85
    A   B   C   D
2  64  76  21  19
4  24   9  89  19
    A   B   C   D
3  90  91  32  36

3] Run the proposed code and print the results :

>>> df_sorted = df.iloc[df['D'].argsort()]
>>> cut_idx = np.where(np.diff(df_sorted['D'])>0)[0]+1
>>> df_split = np.split(df_sorted,cut_idx)
>>> for split in df_split:
...     print split
... 
    A   B   C   D
2  64  76  21  19
4  24   9  89  19
    A   B   C   D
3  90  91  32  36
    A   B   C   D
0  68  68  90  39
    A   B   C   D
1  53  99  20  85

Collectives™ on Stack Overflow

Python: fast subsetting and looping dataframe

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related