1

This is a little hard to explain so bear with me please.

Assume I have a table, like below

enter image description here

How can I create a new dataframe, that matches criteria below

  1. Has 5 rows, for each row, will be values from Column A that between a range, say that first row are between (200, 311), second row between (312, 370) etc.

  2. Has 3 columns, for each column, will be values from Column B that between a range, say that first column are between (1, 16), second column between (17, 50) etc.

  3. Value of each cell, will be sum of values from Column C which matches corresponding Column and Row.

Example:

enter image description here

Any illustration? Numbers are random, you don't need to follow my example.

Thanks a lot!


My solution was pre-define row criteria and column criteria in two lists, then run embedded loops to fill each cell value into new dataframe. It works and not that slow, but I am wondering since this is pandas dataframe, there should be a way doing so in query, without any loop.

Thanks again!

2 Answers 2

3

You can use cut to get your ranges, and then supply them to pivot_table to get the sums:

# Setup example data.
np.random.seed([3, 1415])
n = 100
df = pd.DataFrame({
    'A': np.random.randint(200, 601, size=n),
    'B': np.random.randint(1, 101, size=n),
    'C': np.random.randint(25, size=n)
    })

# Use cut to get the ranges.
a_bins = pd.cut(df['A'], bins=[200, 311, 370, 450, 550, 600], include_lowest=True)
b_bins = pd.cut(df['B'], bins=[1, 16, 67, 100], include_lowest=True)

# Pivot to get the sums.
df2 = df.pivot_table(index=a_bins, columns=b_bins, values='C', aggfunc='sum', fill_value=0)

The resulting output:

B           [1, 16]  (16, 67]  (67, 100]
A                                       
[200, 311]       82       118        153
(311, 370]       68        56         45
(370, 450]       41       129         40
(450, 550]       32       121         57
(550, 600]        0       112         47
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks, so what if I have an interval which is just equal to a particular number? Say c = 333, instead of between a range. How I define this in the bin?
Assuming you only have integer values, you can define a bin of length 1, e.g. in pd.cut use bins=[...,332, 333, ...], and if it's your first bucket omit the include_lowest=True so the lower value isn't included. This would give you (332, 333] as a bucket, which would include 333 but not 332. This won't work if you have floats though, as 332.8 is included in (332, 333], so another method would be necessary in that case.
Oh, my bad. Such a simple question ...Thanks again!
1

I really like @root's solution! Here is a slightly modified one-liner version, which uses pd.crosstab method:

In [102]: pd.crosstab(
     ...:     pd.cut(df['A'], bins=[200, 311, 370, 450, 550, 600], include_lowest=True),
     ...:     pd.cut(df['B'], bins=[1, 16, 67, 100], include_lowest=True),
     ...:     df['C'],
     ...:     aggfunc='sum'
     ...: )
     ...:
Out[102]:
B           [1, 16]  (16, 67]  (67, 100]
A
[200, 311]       31       157        117
(311, 370]       23        90         38
(370, 450]      110       168         60
(450, 550]       37       117        115
(550, 600]       35        19         49

2 Comments

Thanks, so what if I have an interval which is just equal to a particular number? Say c = 333, instead of between a range. How I define this in the bin?
@Windtalker, use np.arange or np.linspace for generating bins

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.