1

I am a super beginner for Python. Long story short, I want to groupby with one column, apply one function to one column, apply another function to another column, and plot the results(the first column to the x-axis, the second column to the y-axis).

I have a pandas data frame df which contains many columns. Two columns of them are tour_id and tour_distance.

tour_id    tour_distance    
      A               10
      A               10
      A               10
      A               10
      B               20
      B               20
      C               40
      C               40
      C               40
      C               40
      C               40
      :                :
      :                :

Since I assume that the longer tour_distance becomes, the more rows each tour_id has, I want to plot a histogram of tour_distance vs row counts in each group of tour_id.

Question 1: what's the simplest solution for this groupby and plot problem?

Question 2: how can I improve my failed attempt?

My attempt: I thought it would be easier to make a new data frame like this.

tour_id    tour_distance  row_counts
      A               10           3
      B               20           2
      C               40           5
      :                :           :

In this way I can use matplotlib and do like this,

import matplotlib.pyplot as plt
x = df.tour_distance
y = df.row_counts
plt.bar(x,y)

However, I can't make this data frame.

df_tour_distance = df.groupby('tour_id').tour_distance.head(1)
df_tour_distance = pd.DataFrame(df_tour_distance)
df_size = df.groupby('tour_id').tour_distance.size()
df_size = pd.DataFrame(df_size)
df = pd.merge(df_size, df_tour_distance, on='tour_id')

>>> KeyError: 'tour_id'

This also failed:

g = df.groupby('tour_id')
result = g.agg({'Count':lambda x:x.size(), 
            'tour_distance_grouped':lambda x:x.head(1)})
result

>>> KeyError: 'Count'
1
  • 1
    Please check your spelling ;-) Commented Jul 20, 2018 at 16:38

2 Answers 2

2

The problem in your code is that once you groupby tour_id, it becomes index. You have to specify as_index=False or use reset_index() in order to use it. Also, you do not need to find a series and then merge it back.

You need:

g = df.groupby(['tour_id', 'tour_distance']).size().reset_index(name='count')
plt.bar(g['tour_id'],g['count'])

Output:

enter image description here

Sign up to request clarification or add additional context in comments.

Comments

0

Could be implemented somewhat easier:

import pandas as pd

tour_id = ['A']*4+['B']*2+['C']*5
tour_distance = [10]*4+[20]*2+[40]*5

df = pd.DataFrame({'tour_id': tour_id, 'tour_distance': tour_distance})
df = df.set_index('tour_id')

df2 = pd.DataFrame()
df2['tour_distance'] = df.groupby('tour_id')['tour_distance'].head(1)
df2['row_counts'] = df.groupby('tour_id').count()
print(df2)

Result:

         tour_distance  row_counts
tour_id                           
A                   10           4
B                   20           2
C                   40           5

1 Comment

You have to use groupby twice. :((

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.