0

I have some code that compares actual data to target data, where the actual data lives in one DataFrame and the target in another. I need to look up the target, bring it into the df with the actual data, and then compare the two. In the simplified example below, I have a set of products and a set of locations all with unique targets.

I'm using a nested for loop to pull this off: looping through the products and then the locations. The problem is that my real life data is larger on all dimensions, and it takes up an inordinate amount of time to loop through everything.

I've looked at various SO articles and none (that I can find!) seem to be related to pandas and/or relevant for my problem. Does anyone have a good idea on how to vectorize this code?

import pandas as pd
import numpy as np
import time

employee_list = ['Joe', 'Bernie', 'Elizabeth', 'Kamala', 'Cory', 'Pete', 
                'Amy', 'Andrew', 'Beto', 'Jay', 'Kristen', 'Julian', 
                'Mike', 'John', 'Tulsi', 'Tim', 'Eric', 'Seth', 'Howard',
                'Bill']
location_list = ['Denver', 'Boulder', 'Phoenix', 'Reno', 'Portland',
                'Eugene', 'San Francisco']
product_list = ['Product1', 'Product2', 'Product3', 'Product4', 'Product5']

tgt_data = {'Location' : location_list, 
            'Product1' : [600, 200, 750, 225, 450, 175, 900],
            'Product2' : [300, 100, 350, 125, 200, 90, 450],
            'Product3' : [700, 250, 950, 275, 600, 225, 1200],
            'Product4' : [200, 100, 250, 75, 150, 75, 300],
            'Product5' : [900, 300, 1000, 400, 600, 275, 1300]}
tgt_df = pd.DataFrame(data = tgt_data)

employee_data = {'Employee' : employee_list,
                'Location' : ['Boulder', 'Denver', 'Portland', 'Denver',
                            'San Francisco', 'Phoenix', 'San Francisco',
                            'Eugene', 'San Francisco', 'Reno', 'Denver',
                            'Phoenix', 'Denver', 'Portland', 'Reno', 
                            'Boulder', 'San Francisco', 'Phoenix', 
                            'San Francisco', 'Phoenix'],
                'Product1' : np.random.randint(1, 1000, 20),
                'Product2' : np.random.randint(1, 700, 20),
                'Product3' : np.random.randint(1, 1500, 20),
                'Product4' : np.random.randint(1, 500, 20),
                'Product5' : np.random.randint(1, 1500, 20)}
emp_df = pd.DataFrame(data = employee_data)


start = time.time()
for p in product_list:
    for l in location_list:
        emp_df.loc[emp_df['Location'] == l, p + '_tgt'] = (
            tgt_df.loc[tgt_df['Location']==l, p].values)
    emp_df[p + '_pct'] = emp_df[p] / emp_df[p + '_tgt']

print(emp_df)
end = time.time()
print(end - start)
3
  • How large is your real data? does it even fit in the memory? the solution depend on the need. Commented May 12, 2019 at 13:17
  • Describe what you're exactly trying to do with your for loop. Commented May 12, 2019 at 13:25
  • 1
    love the MCVE, that too on a pandas question! cries happy tears Commented May 12, 2019 at 13:41

2 Answers 2

1

If the target dataframe is guaranteed to have unique locations, you can use a join to make this process really quick.

import pandas as pd
import numpy as np
import time

employee_list = ['Joe', 'Bernie', 'Elizabeth', 'Kamala', 'Cory', 'Pete', 
                'Amy', 'Andrew', 'Beto', 'Jay', 'Kristen', 'Julian', 
                'Mike', 'John', 'Tulsi', 'Tim', 'Eric', 'Seth', 'Howard',
                'Bill']
location_list = ['Denver', 'Boulder', 'Phoenix', 'Reno', 'Portland',
                'Eugene', 'San Francisco']
product_list = ['Product1', 'Product2', 'Product3', 'Product4', 'Product5']

tgt_data = {'Location' : location_list, 
            'Product1' : [600, 200, 750, 225, 450, 175, 900],
            'Product2' : [300, 100, 350, 125, 200, 90, 450],
            'Product3' : [700, 250, 950, 275, 600, 225, 1200],
            'Product4' : [200, 100, 250, 75, 150, 75, 300],
            'Product5' : [900, 300, 1000, 400, 600, 275, 1300]}
tgt_df = pd.DataFrame(data = tgt_data)

employee_data = {'Employee' : employee_list,
                'Location' : ['Boulder', 'Denver', 'Portland', 'Denver',
                            'San Francisco', 'Phoenix', 'San Francisco',
                            'Eugene', 'San Francisco', 'Reno', 'Denver',
                            'Phoenix', 'Denver', 'Portland', 'Reno', 
                            'Boulder', 'San Francisco', 'Phoenix', 
                            'San Francisco', 'Phoenix'],
                'Product1' : np.random.randint(1, 1000, 20),
                'Product2' : np.random.randint(1, 700, 20),
                'Product3' : np.random.randint(1, 1500, 20),
                'Product4' : np.random.randint(1, 500, 20),
                'Product5' : np.random.randint(1, 1500, 20)}
emp_df = pd.DataFrame(data = employee_data)

With the setup done, we can now use our join.

product_tgt_cols = [product+'_tgt' for product in product_list]
print(product_tgt_cols) #['Product1_tgt', 'Product2_tgt', 'Product3_tgt', 'Product4_tgt', 'Product5_tgt']
product_pct_cols = [product+'_pct' for product in product_list]
print(product_pct_cols) #['Product1_pct', 'Product2_pct', 'Product3_pct', 'Product4_pct', 'Product5_pct']

start = time.time()
#join on location to get _tgt columns
emp_df = emp_df.join(tgt_df.set_index('Location'), on='Location', rsuffix='_tgt')
#divide the entire product arrays using numpy, store in temp
temp = emp_df[product_list].values/emp_df[product_tgt_cols].values
#create a new temp df for the _pct results, and assign back to emp_df
emp_df = emp_df.assign(**pd.DataFrame(temp, columns = product_pct_cols))
print(emp_df)

end = time.time()
print("with join: ",end - start)
Sign up to request clarification or add additional context in comments.

1 Comment

This worked great. Cut the time in my real df from about 13 seconds to 0.2 seconds!
0

You are having "wide format" dataframes. I feel "long format" easier to manipulate.

# turn emp_df into long
# indexed by "Employee", "Location", and "Product"
emp_df = (emp_df.set_index(['Employee', 'Location'])
                .stack().to_frame())
emp_df.head()

                                      0
Employee    Location        
Joe         Boulder     Product1    238
                        Product2    135
                        Product3    873
                        Product4    153
                        Product5    373

# turn tmp_df into a long series
# indexed by "Location" and "Product"

tgt_df = tgt_df.set_index('Location').stack()
tgt_df.head()


# set target for employees by locations:
emp_df['target'] = (emp_df.groupby('Employee')[0]
                          .apply(lambda x: tgt_df))

# percentage
emp_df['pct'] = emp_df[0]/emp_df['target']

# you can get the wide format back by
# emp_df = emp_df.unstack(level=2)
# which will give you a  dataframe with 
# multi-level index and multi-level column

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.