How to calculate mean of specific rows in python dataframe?

Question

I have a dataframe with 11 000k rows. There are multiple columns but I am interested only in 2 of them: Name and Value. One name can repeat itself multiple times among rows. I want to calculate the average value for each name and create a new dataframe with the average value for each name. I don't really know how to walk through rows and how to calculate the average. Any help will be highly appreciated. Thank you!

Name     DataType    TimeStamp           Value Quality
Food   Float       2019-01-01 13:00:00  105.75      122
Food    Float      2019-01-01 17:30:00  11.8110352  122
Food   Float       2019-01-01 17:45:00  12.7932892      122
Water       Float      2019-01-01 14:01:00  16446.875   122
Water       Float      2019-01-01 14:00:00  146.875     122

RangeIndex: 11140487 entries, 0 to 11140486
Data columns (total 6 columns):
Name              object
Value        object

This is what I have and I know it is really noob ish but I am having a difficult time walking through rows.

for i in range(0, len(df):
if((df.iloc[i]['DataType']!='Undefined')):
    print df.loc[df['Name'] == df.iloc[i]['Name'], df.iloc[i]['Value']].mean()

I could answer I only you have posted a copyable data sample instead of an image, so that I could test my code. Hint: use groupby and mean. — Serge Ballesta
– Serge Ballesta, Commented Feb 21, 2022 at 13:35

Terranigmus · Accepted Answer · 2022-02-21 13:51:03Z

3

You don't need to walk through the rows, you can just take all of the fields that match your criteria

d = {'col1': [1,2,1,2,1,2], 'col2': [3, 4,5,6,7,8]}
df = pd.DataFrame(data=d)

#iterate over all unique entries in col1
for entry in  df["col1"].unique():
    # get all the col2 values where col1 is the current iter of col1 entries
    meanofcurrententry=df[df["col1"]==entry]["col2"].mean()
    print(meanofcurrententry)

This is not a full solution, but I think it helps more to understand the necessary logic. You still need to wrap it up into your own dataframe, however it hopefuly helps to understand how to use the indexing

edited Feb 21, 2022 at 13:51

answered Feb 21, 2022 at 13:46

Terranigmus

417 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

JANO · Accepted Answer · 2022-02-21 13:50:11Z

2

It sounds like the groupby() functionality is what you want. You define the column where your groups are and then you can take the mean() of each group. An example from the documentation:

df = pd.DataFrame({'A': [1, 1, 2, 1, 2],
                   'B': [np.nan, 2, 3, 4, 5],
                   'C': [1, 2, 1, 1, 2]}, columns=['A', 'B', 'C'])

df.groupby('A').mean()

Output:

     B         C
A
1  3.0  1.333333
2  4.0  1.500000

In your case it would be something like this:

df.groupby('TagName')['Samples_value'].mean()

Edit: So, I applied the code to your provided input dataframe and following is the output:

TagName
Steam        1.081447e+06
Utilities    3.536931e+05
Name: Sample_value, dtype: float64

Is this what you are looking for?

edited Feb 21, 2022 at 13:50

answered Feb 21, 2022 at 13:34

JANO

3,0962 gold badges17 silver badges37 bronze badges

4 Comments

AGc Over a year ago

yes, thank you for your fast reply. but the problem is Samples_value is object type, not numerical. I tried to covert it to numerical and I am getting an weird error ("(alueError: Unable to parse string "Sheet break Start" at position 972918)

JANO Over a year ago

Is the error occurring, when you perform this code: df['Sample_value'] = df['Sample_value'].astype('float64') or is it occuring when you perform the groupby()? Also does the column Samples_value contain anything other than floats?

AGc Over a year ago

after performing your line you've just given me i get this error: ValueError: could not convert string to float.

JANO Over a year ago

Could be that there are some strange values in your columns. With df[~df['Sample_value'].map(np.isreal)] you can filter out the rows where the value is no float. Then you can for example delete these specific rows.

Serge Ballesta · Accepted Answer · 2022-02-21 13:57:11Z

1

You should avoid as much as possible to iterate rows in a dataframe, because it is very unefficient...

groupby is the way to go when you want to apply the same processing to various groups of rows identified by their values in one or more columns. Here what you want is (*):

df.groupby('TagName')['Sample_value'].mean().reset_index()

it gives as expected:

     TagName  Sample_value
0      Steam  1.081447e+06
1  Utilities  3.536931e+05

Details on the magic words:

groupby: identifies the column(s) used to group the rows (same values)
['Sample_values']: restrict the groupby object to the column of interest
mean(): computes the mean per group
reset_index(): by default the grouping columns go into the index, which is fine for the mean operation. reset_index make them back normal columns

answered Feb 21, 2022 at 13:57

Serge Ballesta

150k13 gold badges137 silver badges267 bronze badges

2 Comments

AGc Over a year ago

thank you for your fast reply. There is a little problem, seems like Samples_Value is object type. I tried to convert it to numeric like this: df['Samples_Value'] = pd.to_numeric(df['Samples_Value'], downcast='Float') BUT I still get an error (alueError: Unable to parse string "Sheet break Start" at position 972918)

AGc Over a year ago

df['Samples_Value'] = pd.to_numeric(df['Samples_Value'],errors='ignore'), still doesn't work. Tells me there are no numerical values to aggregate.

Collectives™ on Stack Overflow

How to calculate mean of specific rows in python dataframe?

3 Answers 3

Comments

4 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

4 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related