Pandas: How to avoid nested for loop

Question

I have some code that compares actual data to target data, where the actual data lives in one DataFrame and the target in another. I need to look up the target, bring it into the df with the actual data, and then compare the two. In the simplified example below, I have a set of products and a set of locations all with unique targets.

I'm using a nested for loop to pull this off: looping through the products and then the locations. The problem is that my real life data is larger on all dimensions, and it takes up an inordinate amount of time to loop through everything.

I've looked at various SO articles and none (that I can find!) seem to be related to pandas and/or relevant for my problem. Does anyone have a good idea on how to vectorize this code?

import pandas as pd
import numpy as np
import time

employee_list = ['Joe', 'Bernie', 'Elizabeth', 'Kamala', 'Cory', 'Pete', 
                'Amy', 'Andrew', 'Beto', 'Jay', 'Kristen', 'Julian', 
                'Mike', 'John', 'Tulsi', 'Tim', 'Eric', 'Seth', 'Howard',
                'Bill']
location_list = ['Denver', 'Boulder', 'Phoenix', 'Reno', 'Portland',
                'Eugene', 'San Francisco']
product_list = ['Product1', 'Product2', 'Product3', 'Product4', 'Product5']

tgt_data = {'Location' : location_list, 
            'Product1' : [600, 200, 750, 225, 450, 175, 900],
            'Product2' : [300, 100, 350, 125, 200, 90, 450],
            'Product3' : [700, 250, 950, 275, 600, 225, 1200],
            'Product4' : [200, 100, 250, 75, 150, 75, 300],
            'Product5' : [900, 300, 1000, 400, 600, 275, 1300]}
tgt_df = pd.DataFrame(data = tgt_data)

employee_data = {'Employee' : employee_list,
                'Location' : ['Boulder', 'Denver', 'Portland', 'Denver',
                            'San Francisco', 'Phoenix', 'San Francisco',
                            'Eugene', 'San Francisco', 'Reno', 'Denver',
                            'Phoenix', 'Denver', 'Portland', 'Reno', 
                            'Boulder', 'San Francisco', 'Phoenix', 
                            'San Francisco', 'Phoenix'],
                'Product1' : np.random.randint(1, 1000, 20),
                'Product2' : np.random.randint(1, 700, 20),
                'Product3' : np.random.randint(1, 1500, 20),
                'Product4' : np.random.randint(1, 500, 20),
                'Product5' : np.random.randint(1, 1500, 20)}
emp_df = pd.DataFrame(data = employee_data)


start = time.time()
for p in product_list:
    for l in location_list:
        emp_df.loc[emp_df['Location'] == l, p + '_tgt'] = (
            tgt_df.loc[tgt_df['Location']==l, p].values)
    emp_df[p + '_pct'] = emp_df[p] / emp_df[p + '_tgt']

print(emp_df)
end = time.time()
print(end - start)

How large is your real data? does it even fit in the memory? the solution depend on the need. — mr_mo
– mr_mo, Commented May 12, 2019 at 13:17
Describe what you're exactly trying to do with your for loop. — Erfan
– Erfan, Commented May 12, 2019 at 13:25
love the MCVE, that too on a pandas question! cries happy tears — Paritosh Singh
– Paritosh Singh, Commented May 12, 2019 at 13:41

Paritosh Singh · Accepted Answer · 2019-05-12 13:40:33Z

If the target dataframe is guaranteed to have unique locations, you can use a join to make this process really quick.

import pandas as pd
import numpy as np
import time

employee_list = ['Joe', 'Bernie', 'Elizabeth', 'Kamala', 'Cory', 'Pete', 
                'Amy', 'Andrew', 'Beto', 'Jay', 'Kristen', 'Julian', 
                'Mike', 'John', 'Tulsi', 'Tim', 'Eric', 'Seth', 'Howard',
                'Bill']
location_list = ['Denver', 'Boulder', 'Phoenix', 'Reno', 'Portland',
                'Eugene', 'San Francisco']
product_list = ['Product1', 'Product2', 'Product3', 'Product4', 'Product5']

tgt_data = {'Location' : location_list, 
            'Product1' : [600, 200, 750, 225, 450, 175, 900],
            'Product2' : [300, 100, 350, 125, 200, 90, 450],
            'Product3' : [700, 250, 950, 275, 600, 225, 1200],
            'Product4' : [200, 100, 250, 75, 150, 75, 300],
            'Product5' : [900, 300, 1000, 400, 600, 275, 1300]}
tgt_df = pd.DataFrame(data = tgt_data)

employee_data = {'Employee' : employee_list,
                'Location' : ['Boulder', 'Denver', 'Portland', 'Denver',
                            'San Francisco', 'Phoenix', 'San Francisco',
                            'Eugene', 'San Francisco', 'Reno', 'Denver',
                            'Phoenix', 'Denver', 'Portland', 'Reno', 
                            'Boulder', 'San Francisco', 'Phoenix', 
                            'San Francisco', 'Phoenix'],
                'Product1' : np.random.randint(1, 1000, 20),
                'Product2' : np.random.randint(1, 700, 20),
                'Product3' : np.random.randint(1, 1500, 20),
                'Product4' : np.random.randint(1, 500, 20),
                'Product5' : np.random.randint(1, 1500, 20)}
emp_df = pd.DataFrame(data = employee_data)

With the setup done, we can now use our join.

product_tgt_cols = [product+'_tgt' for product in product_list]
print(product_tgt_cols) #['Product1_tgt', 'Product2_tgt', 'Product3_tgt', 'Product4_tgt', 'Product5_tgt']
product_pct_cols = [product+'_pct' for product in product_list]
print(product_pct_cols) #['Product1_pct', 'Product2_pct', 'Product3_pct', 'Product4_pct', 'Product5_pct']

start = time.time()
#join on location to get _tgt columns
emp_df = emp_df.join(tgt_df.set_index('Location'), on='Location', rsuffix='_tgt')
#divide the entire product arrays using numpy, store in temp
temp = emp_df[product_list].values/emp_df[product_tgt_cols].values
#create a new temp df for the _pct results, and assign back to emp_df
emp_df = emp_df.assign(**pd.DataFrame(temp, columns = product_pct_cols))
print(emp_df)

end = time.time()
print("with join: ",end - start)

This worked great. Cut the time in my real df from about 13 seconds to 0.2 seconds!

Quang Hoang · Accepted Answer · 2019-05-12 13:36:43Z

You are having "wide format" dataframes. I feel "long format" easier to manipulate.

# turn emp_df into long
# indexed by "Employee", "Location", and "Product"
emp_df = (emp_df.set_index(['Employee', 'Location'])
                .stack().to_frame())
emp_df.head()

                                      0
Employee    Location        
Joe         Boulder     Product1    238
                        Product2    135
                        Product3    873
                        Product4    153
                        Product5    373

# turn tmp_df into a long series
# indexed by "Location" and "Product"

tgt_df = tgt_df.set_index('Location').stack()
tgt_df.head()


# set target for employees by locations:
emp_df['target'] = (emp_df.groupby('Employee')[0]
                          .apply(lambda x: tgt_df))

# percentage
emp_df['pct'] = emp_df[0]/emp_df['target']

# you can get the wide format back by
# emp_df = emp_df.unstack(level=2)
# which will give you a  dataframe with 
# multi-level index and multi-level column

Collectives™ on Stack Overflow

Pandas: How to avoid nested for loop

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related