Pandas Data Frame Summary Table

Question

How can I make a summary of a data frame in Pandas, stacking individual operations.

For example, I used the following code:

 df=pd.DataFrame(wb)

# Get list with headers
header1 = list(df)
count=df.count()

NaNs=df.isnull().sum()
sum=df.sum(0)
mean=df.mean()
median=df.median()
min= df.min()
max= df.max()
standardeviation= df.std()
nints=df.dtypes

But I can only print them as individual results. I get something like this for each calculation:

Unnamed: 0                  60
region                      50
IV_bins                     60
N                           60
meanbase                    60
cash                        60
dtype: int64

Finally, I tried creating a summarytable=[] table at the beginning and trying something like summarytable.append(count) and so on with all the calculations, but couldn't get it right. What I am looking for is some table like this, which I believe also involves transposing the calculations:

          A    B 
Count     100  98
NANs      5    7
Mean      10   12.5
Median    14   8
...
Nints     95   96
NStr      5    2

One last thing to take into account. I noticed that for some calculations, like sum(), it doesn't make sense to count strings, so, when I print the results, the strings columns don't print anything. This is the result for print(sum): (Notice how region doesn't appear)

Unnamed: 0                                                               1830
IV_bins                     [0,2.31e+06](2.31e+06,5.7e+06](5.7e+06,1.07e+0...
N                                                                     3680163
meanbase                                                              3.46248
cash                                                              9.00091e+09

sum=df.sum(0), min= df.min(), max= df.max() - you just destroyed three useful built-in functions. — DYZ
– DYZ, Commented Feb 19, 2018 at 23:24
You show us a lot of outputs but not the code that produces them. Please include it. Also, what exactly is your question? — DYZ
– DYZ, Commented Feb 19, 2018 at 23:25
What do you mean I destroyed them? Those outputs are for simple print(count) and print(sum). What I am looking for is a summary table of all this functions, as in the example output I posted. — AntonioAgAl
– AntonioAgAl, Commented Feb 19, 2018 at 23:48
sum=df.sum(0) makes the buil-in function sum() unavailable (same with the outher two functions). — DYZ
– DYZ, Commented Feb 19, 2018 at 23:58
Have you tried df.describe() on your data? It will give you a statistical summary of all numeric columns in your data frame. — KRKirov
– KRKirov, Commented Feb 20, 2018 at 0:08

Brad Solomon · Accepted Answer · 2018-02-20 02:03:31Z

2

Seems like you may get use out of DataFrame.agg(), with which you can essentially build a customized .describe() output. Here's an example to get you started:

import pandas as pd
import numpy as np

df = pd.DataFrame({ 'object': ['a', 'b', 'c'],
                    'numeric': [1, 2, 3],
                    'numeric2': [1.1, 2.5, 50.],
                    'categorical': pd.Categorical(['d','e','f'])
                  })


def nullcounts(ser):
    return ser.isnull().sum()


def custom_describe(frame, func=[nullcounts, 'sum', 'mean', 'median', 'max'],
                    numeric_only=True, **kwargs):
    if numeric_only:
        frame = frame.select_dtypes(include=np.number)
    return frame.agg(func, **kwargs)


custom_describe(df)

            numeric   numeric2
nullcounts      0.0   0.000000
sum             6.0  53.600000
mean            2.0  17.866667
median          2.0   2.500000
max             3.0  50.000000

answered Feb 20, 2018 at 2:03

Brad Solomon

41.2k39 gold badges167 silver badges260 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Michael Over a year ago

If you want to use the quantile function instead of the median for the 99% percentile, how can you pass the q argument in the function?

Brad Solomon Over a year ago

@Michael follow what's done for nullcounts() here: def quantile(ser): return ser.quantile(). Then replace 'median' with quantile in the function

firefly · Accepted Answer · 2019-01-14 12:16:05Z

1

It seems like there is a library that does exactly that. Check out pandas-summary. For each column, it gives you the count, min,max,std,mean,variance,count of all, count of uniques, missing values, type of column, and much more.

answered Jan 14, 2019 at 12:16

firefly

514 bronze badges

Collectives™ on Stack Overflow

Pandas Data Frame Summary Table

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related