Filter python pandas dataframe by grouping multiple columns

Question

This is a little hard to explain so bear with me please.

Assume I have a table, like below

How can I create a new dataframe, that matches criteria below

Has 5 rows, for each row, will be values from Column A that between a range, say that first row are between (200, 311), second row between (312, 370) etc.
Has 3 columns, for each column, will be values from Column B that between a range, say that first column are between (1, 16), second column between (17, 50) etc.
Value of each cell, will be sum of values from Column C which matches corresponding Column and Row.

Example:

Any illustration? Numbers are random, you don't need to follow my example.

Thanks a lot!

My solution was pre-define row criteria and column criteria in two lists, then run embedded loops to fill each cell value into new dataframe. It works and not that slow, but I am wondering since this is pandas dataframe, there should be a way doing so in query, without any loop.

Thanks again!

root · Accepted Answer · 2017-03-27 19:48:14Z

3

You can use cut to get your ranges, and then supply them to pivot_table to get the sums:

# Setup example data.
np.random.seed([3, 1415])
n = 100
df = pd.DataFrame({
    'A': np.random.randint(200, 601, size=n),
    'B': np.random.randint(1, 101, size=n),
    'C': np.random.randint(25, size=n)
    })

# Use cut to get the ranges.
a_bins = pd.cut(df['A'], bins=[200, 311, 370, 450, 550, 600], include_lowest=True)
b_bins = pd.cut(df['B'], bins=[1, 16, 67, 100], include_lowest=True)

# Pivot to get the sums.
df2 = df.pivot_table(index=a_bins, columns=b_bins, values='C', aggfunc='sum', fill_value=0)

The resulting output:

B           [1, 16]  (16, 67]  (67, 100]
A                                       
[200, 311]       82       118        153
(311, 370]       68        56         45
(370, 450]       41       129         40
(450, 550]       32       121         57
(550, 600]        0       112         47

answered Mar 27, 2017 at 19:48

root

34.1k6 gold badges77 silver badges89 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Windtalker Over a year ago

Thanks, so what if I have an interval which is just equal to a particular number? Say c = 333, instead of between a range. How I define this in the bin?

root Over a year ago

Assuming you only have integer values, you can define a bin of length 1, e.g. in pd.cut use bins=[...,332, 333, ...], and if it's your first bucket omit the include_lowest=True so the lower value isn't included. This would give you (332, 333] as a bucket, which would include 333 but not 332. This won't work if you have floats though, as 332.8 is included in (332, 333], so another method would be necessary in that case.

Windtalker Over a year ago

Oh, my bad. Such a simple question ...Thanks again!

Community · Accepted Answer · 2017-05-23 11:54:10Z

1

I really like @root's solution! Here is a slightly modified one-liner version, which uses pd.crosstab method:

In [102]: pd.crosstab(
     ...:     pd.cut(df['A'], bins=[200, 311, 370, 450, 550, 600], include_lowest=True),
     ...:     pd.cut(df['B'], bins=[1, 16, 67, 100], include_lowest=True),
     ...:     df['C'],
     ...:     aggfunc='sum'
     ...: )
     ...:
Out[102]:
B           [1, 16]  (16, 67]  (67, 100]
A
[200, 311]       31       157        117
(311, 370]       23        90         38
(370, 450]      110       168         60
(450, 550]       37       117        115
(550, 600]       35        19         49

edited May 23, 2017 at 11:54

CommunityBot

11 silver badge

answered Mar 27, 2017 at 20:46

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

2 Comments

Windtalker Over a year ago

Thanks, so what if I have an interval which is just equal to a particular number? Say c = 333, instead of between a range. How I define this in the bin?

MaxU - stand with Ukraine Over a year ago

@Windtalker, use np.arange or np.linspace for generating bins

Collectives™ on Stack Overflow

Filter python pandas dataframe by grouping multiple columns

2 Answers 2

3 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related