The task is basically this:
I am given the following csv file with lots of duplicate email addresses
Display Name,First Name,Last Name,Phone Number,Email Address,Login Date,Registration Date
John,John,Doe,99999999,[email protected],4/20/2015 21:56,4/20/2015 21:56
John,John,DOE,99999999,[email protected],3/31/2015 14:05,3/31/2015 14:05
I need to remove duplicates based on email address with the following conditions:
- The row with the latest login date must be selected.
- The oldest registration date among the rows must be used.
I used Python/pandas to do this.
How do I optimize the for loop in this pandas script using groupby? I tried hard but I'm still banging my head against it.
import pandas as pd
df = pd.read_csv('pra.csv')
# first sort the data by Login Date since we always need the latest Login date first
# we're making a copy so as to keep the original data intact, while still being able to sort by datetime
df['Login Date Copy'] = pd.to_datetime(df['Login Date'])
df['Registration Date Copy'] = pd.to_datetime(df['Registration Date'])
# this way latest login date appears first for each duplicate pair
df = df.sort_values(by='Login Date Copy', ascending=False)
output_df = pd.DataFrame()
# get rows for each email address and replace registration date with the oldest one
# this can probably be optimized using groupby
for email in df['Email Address'].unique():
subdf = df.loc[df['Email Address'] == email]
oldest_date = subdf['Registration Date Copy'].min()
# get corresponding registration date for a particular registration date copy
oldest_reg_date = df[df['Registration Date Copy'] == oldest_date]['Registration Date'].values[0]
subdf['Registration Date'] = oldest_reg_date
output_df = output_df.append(subdf)
# drop working columns
output_df.drop(['Login Date Copy', 'Registration Date Copy'], axis=1, inplace=True)
# finally, get only the first of the duplicates and output the result
output_df = output_df.drop_duplicates(subset='Email Address', keep='first')
output_df.to_csv('~/output.csv', index=False)