I have some code that compares actual data to target data, where the actual data lives in one DataFrame and the target in another. I need to look up the target, bring it into the df with the actual data, and then compare the two. In the simplified example below, I have a set of products and a set of locations all with unique targets.
I'm using a nested for loop to pull this off: looping through the products and then the locations. The problem is that my real life data is larger on all dimensions, and it takes up an inordinate amount of time to loop through everything.
I've looked at various SO articles and none (that I can find!) seem to be related to pandas and/or relevant for my problem. Does anyone have a good idea on how to vectorize this code?
import pandas as pd
import numpy as np
import time
employee_list = ['Joe', 'Bernie', 'Elizabeth', 'Kamala', 'Cory', 'Pete',
'Amy', 'Andrew', 'Beto', 'Jay', 'Kristen', 'Julian',
'Mike', 'John', 'Tulsi', 'Tim', 'Eric', 'Seth', 'Howard',
'Bill']
location_list = ['Denver', 'Boulder', 'Phoenix', 'Reno', 'Portland',
'Eugene', 'San Francisco']
product_list = ['Product1', 'Product2', 'Product3', 'Product4', 'Product5']
tgt_data = {'Location' : location_list,
'Product1' : [600, 200, 750, 225, 450, 175, 900],
'Product2' : [300, 100, 350, 125, 200, 90, 450],
'Product3' : [700, 250, 950, 275, 600, 225, 1200],
'Product4' : [200, 100, 250, 75, 150, 75, 300],
'Product5' : [900, 300, 1000, 400, 600, 275, 1300]}
tgt_df = pd.DataFrame(data = tgt_data)
employee_data = {'Employee' : employee_list,
'Location' : ['Boulder', 'Denver', 'Portland', 'Denver',
'San Francisco', 'Phoenix', 'San Francisco',
'Eugene', 'San Francisco', 'Reno', 'Denver',
'Phoenix', 'Denver', 'Portland', 'Reno',
'Boulder', 'San Francisco', 'Phoenix',
'San Francisco', 'Phoenix'],
'Product1' : np.random.randint(1, 1000, 20),
'Product2' : np.random.randint(1, 700, 20),
'Product3' : np.random.randint(1, 1500, 20),
'Product4' : np.random.randint(1, 500, 20),
'Product5' : np.random.randint(1, 1500, 20)}
emp_df = pd.DataFrame(data = employee_data)
start = time.time()
for p in product_list:
for l in location_list:
emp_df.loc[emp_df['Location'] == l, p + '_tgt'] = (
tgt_df.loc[tgt_df['Location']==l, p].values)
emp_df[p + '_pct'] = emp_df[p] / emp_df[p + '_tgt']
print(emp_df)
end = time.time()
print(end - start)