0

Having the following Data Frame:

   name  value  count  total_count
0     A      0      1           20
1     A      1      2           20
2     A      2      2           20
3     A      3      2           20
4     A      4      3           20
5     A      5      3           20
6     A      6      2           20
7     A      7      2           20
8     A      8      2           20
9     A      9      1           20
----------------------------------
10    B      0     10           75
11    B      5     30           75
12    B      6     20           75
13    B      8     10           75
14    B      9      5           75

I would like to pivot the data, grouping each row by the name value, then create columns based on the value & count columns aggregated into bins.

Explanation: I have 10 possible values, range 0-9, not all the values are present in each group. In the above example group B is missing values 1,2,3,4,7. I would like to create an histogram with 5 bins, ignore missing values and calculate the percentage of count for each bin. So the result will look like so:

  name       0-1  2-3  4-5  6-7       8-9
0    A  0.150000  0.2  0.3  0.2  0.150000
1    B  0.133333  0.0  0.4  0.4  0.066667

For example for bin 0-1 of group A the calculation is the sum of count for the values 0,1 (1+2) divided by the total_count of group A

  name       0-1
0    A       (1+2)/20 = 0.15

I was looking into hist method and this StackOverflow question, but still struggling with figuring out what is the right approach.

2 Answers 2

2

Use pd.cut to bin your feature, then use a df.groupby().count() and the .unstack() method to get the dataframe you are looking for. During the group by you can use any aggregation function (.sum(), .count(), etc) to get the results you are looking for. The code below works if you are looking for an example.

import pandas as pd
import numpy as np

df = pd.DataFrame(
    data ={'name': ['Group A','Group B']*5,
           'number': np.arange(0,10), 
           'value': np.arange(30,40)})
df['number_bin'] = pd.cut(df['number'], bins=np.arange(0,10))
# Option 1: Sums
df.groupby(['number_bin','name'])['value'].sum().unstack(0)
# Options 2: Counts
df.groupby(['number_bin','name'])['value'].count().unstack(0)

The null values in the original data will not affect the result.

Sign up to request clarification or add additional context in comments.

Comments

1

To get the exact result you could try this.

bins=range(10)
res = df.groupby('name')['count'].sum()
intervals = pd.cut(df.value, bins=bins, include_lowest=True)
df1 = (df.groupby([intervals,"name"])['count'].sum()/res).unstack(0)

df1.columns = df1.columns.astype(str)  # convert the cols to string
df1.columns = ['a','b','c','d','e','f','g','h','i']  # rename the cols
cols = ['a',"b","d","f","h"]

df1 = df1.add(df1.iloc[:,1:].shift(-1, axis=1), fill_value=0)[cols]
print(df1)

You can manually rename the cols later.

# Output:
           a         b     d        f        h 
name                    
   A    0.150000    0.2   0.3   0.200000    0.15
   B    0.133333    NaN   0.4   0.266667    0.20

You can replace the NaN values using df1.fillna("0.0")

2 Comments

Boom! that's what I was looking for
@ShlomiSchwartz Happy to help. :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.