1

I have two dataframes, a snippet looks like this:

year1 = {'DAY':['MON', 'MON', 'MON', 'TUE', 'TUE', 'TUE'],
    'TEMP':[12, 13, 14, 15, 15, 18],
    'DATE':['01/01/20', '02/01/20', '03/01/20', '06/01/20', '07/01/20', '08/01/20']}
df1 = pd.DataFrame(year1)

year2 = {'DAY':['MON', 'MON', 'MON', 'TUE', 'TUE', 'TUE'],
    'TEMP':[15, 15, 15, 15, 14, 14],
    'DATE':['01/01/20', '02/01/20', '03/01/20', '06/01/20', '07/01/20', '10/01/20']}
df2 = pd.DataFrame(year2)

The dataframes are NOT indexed on date (index is some other column). I want to merge the dataframes with rows where Date Values match in these two dataframes and add a new column based on date match:

df_FINAL['AVG_TEMP'] = (df1['TEMP'] + df2['TEMP']) / 2

So the resultant DataFrame should be like:

   DAY  TEMP      DATE    AVG_TEMP
0  MON    15  01/01/20     13.5
1  MON    15  02/01/20     14.0
2  MON    15  03/01/20     14.5
3  TUE    15  06/01/20     15.0
4  TUE    14  07/01/20     14.5

How to achieve this?

1
  • Please make sure to tag pandas questions pandas so they get seen and answered faster. Also, related question Pandas Merging 101 Commented Sep 6, 2020 at 5:48

4 Answers 4

2

You can use pd.merge on the DATE and DAY columns since the same dates will have the same day. Take the average of the TEMP_x and TEMP_y columns created from the merge and call it AVG_TEMP, then drop the TEMP_x and TEMP_y columns.

import pandas as pd

year1 = {'DAY':['MON', 'MON', 'MON', 'TUE', 'TUE', 'TUE'],
    'TEMP':[12, 13, 14, 15, 15, 18],
    'DATE':['01/01/20', '02/01/20', '03/01/20', '06/01/20', '07/01/20', '08/01/20']}
df1 = pd.DataFrame(year1)

year2 = {'DAY':['MON', 'MON', 'MON', 'TUE', 'TUE', 'TUE'],
    'TEMP':[15, 15, 15, 15, 14, 14],
    'DATE':['01/01/20', '02/01/20', '03/01/20', '06/01/20', '07/01/20', '10/01/20']}
df2 = pd.DataFrame(year2)

df_result = df1.merge(df2, on=["DATE","DAY"])
df_result['AVG_TEMP'] = (df_result['TEMP_x'] + df_result['TEMP_y']) / 2
df_result = df_result.drop(columns=['TEMP_x','TEMP_y'])

Output:

>>> df_result
   DAY      DATE  AVG_TEMP
0  MON  01/01/20      13.5
1  MON  02/01/20      14.0
2  MON  03/01/20      14.5
3  TUE  06/01/20      15.0
4  TUE  07/01/20      14.5
Sign up to request clarification or add additional context in comments.

Comments

0

Call pd.merge() on your two columns using inner join (value must be present in both df to be present in result) to create an intermediate df. Then create a new column that calculates the average

df3 = df1.merge(df2,on=['DATE','DAY'],how='inner')
df3['AVG_TEMP'] = (df3.TEMP_x + df3.TEMP_y)/2
df3.drop(['TEMP_x','TEMP_y'],inplace=True,axis=1)

Comments

0

You can use merge command and accomplish all this using lambda function. I have also provided you a few alternate options so you know they are available for you.

import pandas as pd
year1 = {'DAY':['MON', 'MON', 'MON', 'TUE', 'TUE', 'TUE'],
    'TEMP':[12, 13, 14, 15, 15, 18],
    'DATE':['01/01/20', '02/01/20', '03/01/20', '06/01/20', '07/01/20', '08/01/20']}
df1 = pd.DataFrame(year1)

year2 = {'DAY':['MON', 'MON', 'MON', 'TUE', 'TUE', 'TUE'],
    'TEMP':[15, 15, 15, 15, 14, 14],
    'DATE':['01/01/20', '02/01/20', '03/01/20', '06/01/20', '07/01/20', '10/01/20']}
df2 = pd.DataFrame(year2)

#merge on inner join based on your example
#you can either use rename or suffix. here i am using suffix
#first suffix is stripped, second is _y which will be later dropped
#kept .rename command in case you want to try that option

answer for your problem starts here:

df_FINAL = (pd.merge(df2, df1, on = "DATE",how='inner',suffixes=('', '_y'))        
        #.rename(columns={'DAY_x':'DAY','TEMP_x':'TEMP'})
        .assign(AVG_TEMP = lambda x: (x['TEMP'] + x['TEMP_y'])/2))

#drop the _y columns as you don't need them
df_FINAL.drop(list(df_FINAL.filter(regex='_y$')), axis=1, inplace=True)

print(df_FINAL)

An alternate way to do this is to combine all these into a single command as follows:

#merge on inner join based on your example
#first suffix is stripped, second is _y which will be later dropped
#after the processing, filter out the column with _y

df_FINAL = (pd.merge(df2, df1, on = "DATE",how='inner',suffixes=('', '_y'))        
        .assign(AVG_TEMP = lambda x: (x['TEMP'] + x['TEMP_y'])/2)
        .filter(regex='^(?!.*_y)'))

The final result looks like this:

   DAY  TEMP      DATE  AVG_TEMP
0  MON    15  01/01/20      13.5
1  MON    15  02/01/20      14.0
2  MON    15  03/01/20      14.5
3  TUE    15  06/01/20      15.0
4  TUE    14  07/01/20      14.5

Comments

0

Using pd.concat() and df.groupby

df3 = pd.concat([df2, df1])
df3['AVG_TEMP'] = df3.groupby('DATE', as_index=False)['TEMP'].apply(lambda x: x.mean() if len(x) > 1 else None)
df3 = df3.groupby('DATE', as_index=False).first().dropna()

print(df3)

Output:

       DATE  DAY  TEMP  AVG_TEMP
0  01/01/20  MON    15      13.5
1  02/01/20  MON    15      14.0
2  03/01/20  MON    15      14.5
3  06/01/20  TUE    15      15.0
4  07/01/20  TUE    14      14.5

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.