Python - Pivot and create histograms from Pandas column, with missing values

Question

Having the following Data Frame:

   name  value  count  total_count
0     A      0      1           20
1     A      1      2           20
2     A      2      2           20
3     A      3      2           20
4     A      4      3           20
5     A      5      3           20
6     A      6      2           20
7     A      7      2           20
8     A      8      2           20
9     A      9      1           20
----------------------------------
10    B      0     10           75
11    B      5     30           75
12    B      6     20           75
13    B      8     10           75
14    B      9      5           75

I would like to pivot the data, grouping each row by the name value, then create columns based on the value & count columns aggregated into bins.

Explanation: I have 10 possible values, range 0-9, not all the values are present in each group. In the above example group B is missing values 1,2,3,4,7. I would like to create an histogram with 5 bins, ignore missing values and calculate the percentage of count for each bin. So the result will look like so:

  name       0-1  2-3  4-5  6-7       8-9
0    A  0.150000  0.2  0.3  0.2  0.150000
1    B  0.133333  0.0  0.4  0.4  0.066667

For example for bin 0-1 of group A the calculation is the sum of count for the values 0,1 (1+2) divided by the total_count of group A

  name       0-1
0    A       (1+2)/20 = 0.15

I was looking into hist method and this StackOverflow question, but still struggling with figuring out what is the right approach.

Kavi Sek · Accepted Answer · 2018-09-16 13:00:35Z

2

Use pd.cut to bin your feature, then use a df.groupby().count() and the .unstack() method to get the dataframe you are looking for. During the group by you can use any aggregation function (.sum(), .count(), etc) to get the results you are looking for. The code below works if you are looking for an example.

import pandas as pd
import numpy as np

df = pd.DataFrame(
    data ={'name': ['Group A','Group B']*5,
           'number': np.arange(0,10), 
           'value': np.arange(30,40)})
df['number_bin'] = pd.cut(df['number'], bins=np.arange(0,10))
# Option 1: Sums
df.groupby(['number_bin','name'])['value'].sum().unstack(0)
# Options 2: Counts
df.groupby(['number_bin','name'])['value'].count().unstack(0)

The null values in the original data will not affect the result.

edited Sep 16, 2018 at 13:00

answered Sep 16, 2018 at 12:35

Kavi Sek

2321 silver badge10 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Sai Kumar · Accepted Answer · 2018-09-17 06:00:05Z

1

To get the exact result you could try this.

bins=range(10)
res = df.groupby('name')['count'].sum()
intervals = pd.cut(df.value, bins=bins, include_lowest=True)
df1 = (df.groupby([intervals,"name"])['count'].sum()/res).unstack(0)

df1.columns = df1.columns.astype(str)  # convert the cols to string
df1.columns = ['a','b','c','d','e','f','g','h','i']  # rename the cols
cols = ['a',"b","d","f","h"]

df1 = df1.add(df1.iloc[:,1:].shift(-1, axis=1), fill_value=0)[cols]
print(df1)

You can manually rename the cols later.

# Output:
           a         b     d        f        h 
name                    
   A    0.150000    0.2   0.3   0.200000    0.15
   B    0.133333    NaN   0.4   0.266667    0.20

You can replace the NaN values using df1.fillna("0.0")

answered Sep 17, 2018 at 6:00

Sai Kumar

7452 gold badges9 silver badges21 bronze badges

2 Comments

Shlomi Schwartz Over a year ago

Boom! that's what I was looking for

Sai Kumar Over a year ago

@ShlomiSchwartz Happy to help. :)

Collectives™ on Stack Overflow

Python - Pivot and create histograms from Pandas column, with missing values

2 Answers 2

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related