pandas, combining multiple dataframes

Question

I have a python program that Does the following.

reads in a .csv
creates a dataframe with values from specific columns of the csv
converts the timestamp from unix timestamp
groups the data by hour and then Finds the average of certain data in that hour.

code:

df = pd.read_csv(files,parse_dates=True)
df2 = df[['timestamp','avg_hr','avg_rr','emfit_sleep_summary_id']]
df2['timestamp'] = df2['timestamp'].astype(int)
df2['timestamp'] = pd.to_datetime(df2['timestamp'],unit='s')

df2 = df2.set_index('timestamp')
df3 = df2.groupby(df2.index.map(lambda t: t.hour))['avg_hr'].mean()
df4 = df2.groupby(df2.index.map(lambda t: t.hour))['avg_rr'].mean()

print df3
print df4

sample output:

       timestamp         avg_hr  avg_rr    emfit_sleep_summary_id
0 2015-01-28 08:14:50     101     6.4                      78
1 2015-01-28 08:14:52      98     6.4                      78
2 2015-01-28 00:25:00      60     0.0                      78 
3 2015-01-28 00:25:02      63     0.0                      78
4 2015-01-28 07:24:06      79    11.6                      78
5 2015-01-28 07:24:08      79    11.6                      78
0    99.5
7    61.5
8    78.5
Name: avg_hr, dtype: float64
0     0.000
7    11.725
8     6.400
Name: avg_rr, dtype: float64

I'm now trying to combine df3 and df4 into df2 so the result will look something like this:

       timestamp         avg_hr  avg_rr    emfit_sleep_summary_id   AVG_HR    AVG_RR
0 2015-01-28 08:14:50     101     6.4                      78        99.5       6.4 
1 2015-01-28 08:14:52      98     6.4                      78        99.5       6.4
2 2015-01-28 00:25:00      60     0.0                      78        61.5       0.0
3 2015-01-28 00:25:02      63     0.0                      78        61.5       0.0
4 2015-01-28 07:24:06      79    11.6                      78        78.5       11.6
5 2015-01-28 07:24:08      79    11.6                      78        78.5       11.6

I tried doing the following

df2['AVG_HR'] = df2.groupby(df2.index.map(lambda t: t.hour))['avg_hr'].mean()

But when I ran, it returned NAN for the entire column.

EDIT: I'd also know how to reduce the number of rows to a single one for each hour, instead of having 2 per hour.

       timestamp         avg_hr  avg_rr    emfit_sleep_summary_id   AVG_HR    AVG_RR
0 2015-01-28 08:14:50     101     6.4                      78        99.5       6.4 
1 2015-01-28 00:25:00      60     0.0                      78        61.5       0.0
2 2015-01-28 07:24:06      79    11.6                      78        78.5       11.6

I think what you want is this: df2['AVG_HR'] = df2.groupby(df2.index.map(lambda t: t.hour))['avg_hr'].transofrm('mean') can you confirm — EdChum
– EdChum, Commented Apr 9, 2015 at 14:55
Also you don't need a lambda to groupby the hour this should work: df3 = df2.groupby(df2.index.hour)['avg_hr'].mean() — EdChum
– EdChum, Commented Apr 9, 2015 at 14:56
@EdChum that worked, if you could post that as the answer, I'll accept it. Also, wondering, is there any way to reduce the rows? instead of having 2 of each timestamp, can I have just one? — cyberbemon
– cyberbemon, Commented Apr 9, 2015 at 15:07
So you want to reduce df2 to a single row per hour? In which case are you wanting the average of the aggregated columns or the sum? df2.groupby(df2.index.hour).mean().reset_index() should squeeze the df to an hourly one, also you could resample — EdChum
– EdChum, Commented Apr 9, 2015 at 15:12
yes, instead of 2 timesstamps, I need one per hour. I want the average to remain as is. Please see the edit. — cyberbemon
– cyberbemon, Commented Apr 9, 2015 at 15:20

EdChum · Accepted Answer · 2015-04-09 15:32:30Z

1

To add a aggregated column from a groupby use transform this will return a Series aligned with the original df:

df2['AVG_HR'] = df2.groupby(df2.index.map(lambda t: t.hour))['avg_hr'].transofrm('mean')

Also it's unnecessary to use a lambda to groupby the hour, the index, if it is a DateTimeindex has the datetime attributes that can be accessed directly so the above can be simplified to:

df2['AVG_HR'] = df2.groupby(df2.index.hour)['avg_hr'].transform('mean')

If you want to resample by hour you could just groupby the hour and then call reset_index:

In [17]:

df.groupby(df.index.hour).mean().reset_index()
Out[17]:
   index  avg_hr  avg_rr  emfit_sleep_summary_id
0      0    61.5     0.0                      78
1      7    79.0    11.6                      78
2      8    99.5     6.4                      78

edited Apr 9, 2015 at 15:32

answered Apr 9, 2015 at 15:10

EdChum

397k204 gold badges836 silver badges583 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

cyberbemon Over a year ago

I suppose there is no way to make the datetime stay? instead of 0, 7, 8

EdChum Over a year ago

Not using groupby, the alternative would be to call drop_duplicates on your df after you've added the average columns but that doesn't average the other duplcate columns

cyberbemon Over a year ago

I did this df2 = df2.drop_duplicates(subset='AVG_HR',take_last=True) and that worked :)

EdChum Over a year ago

But that won't average out say 'avg_hr' though if that's what you want then that's fine, I would've posted that but thought you wanted to average all values, you can upvote too ;-)

cyberbemon Over a year ago

I got the avg_hr using df2['AVG_HR'] = df2.groupby(df2.index.hour)['avg_hr'].transform('mean') this would add an extra column, I used that column to filter the hours. same with avg_rr

|

Collectives™ on Stack Overflow

pandas, combining multiple dataframes

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related