0

I have a data frame (data) that has multiple data types (outliers were previously removed and marked with the string: Outlier). I am looking to summarize this data into a new data frame (analysis), but am running into issues when it comes to data types.

The issue that I am facing is some of the columns are descriptors (categories, names, countries, etc) and are not pulled in the numerical lists (mean, med, sd). This creates a mismatch in the number of rows in the lists (len(title) = 64, len(mean) = 61).

I'd like for the data frame to match up to all 64, with those descriptors being marked as 'NaN' for numerical fields such as mean (because I know you cant take the mean of ['Blue','Red','Yellow'])

Sample Data:

ORG|PROGRAM|YEAR|INDUSTRY|Responses|# of Questions|New Zone|Q1|Q2
USA|MO|2018|PRD - LF|64|44|High|4.75806451612903|4.70967741935484
CAN|ALB|2017|FS - B|247|43|Medium|4.61382113821138|4.66803278688525
UK|IRE|2018|RES - U|236|46|Low|4.13617021276596|4.30932203389831

enter image description here

Code:

title = list(data.keys())
n = list(data.count())
mean = list(data.mean())
med = list(data.median())
sd = list(data.std())

analysis = pd.DataFrame({'Mean':mean,'Median':med,'SD':sd})
print(analysis)

Current Output:

enter image description here

Desired Output: Additional rows should be shown with the NaN value if no numerical values exist (i.e. if it's a category or country). This would increase the amount of rows to 64 rather than 61, and allow for the additional columns to be added (Title, count, etc)

Insert

6
  • Please add sample data (e.g. by copying the output from data.head(), so we have a MCVE (stackoverflow.com/help/mcve) Commented Jan 23, 2019 at 18:55
  • Why are you converting everything to a list anyway? If you don't do that, your objects are pandas Series, which have an index. Then if you create a dataframe from these Series (instead of lists) you can pass in the index you want to use explictly (analysis = pd.DataFrame(list-of-the-series, index=the_index_you_want) Commented Jan 23, 2019 at 18:59
  • Sample data added @Jondiedoop Commented Jan 23, 2019 at 19:08
  • @Lagerbaer I am open to other ways of doing this (I am relatively new to python). From the sample data above, I am looking to get summaries (mean, median , etc) of all of the numerical values (year, responses, q1, q2). It's also worth noting that theres a mixture of strings and floats in q1 and q2 as some values are marked as 'Outlier.' Commented Jan 23, 2019 at 19:11
  • Thanks for putting in the effort. It would be much more helpful if you could copy the data as text (mainly so that people who would like to help can copy and test) Commented Jan 23, 2019 at 19:11

2 Answers 2

1

If I understand correctly, you could combine using agg with the DataFrame-constructor where you add which columns you want to add. Finally, transpose to switch rows and columns:

pd.DataFrame(df.agg(['mean', 'median', 'std']), columns=df.columns).transpose()
# Output
#                    mean       median         std
#ORG                     NaN          NaN         NaN
#PROGRAM                 NaN          NaN         NaN
#YEAR            2017.666667  2018.000000    0.577350
#INDUSTRY                NaN          NaN         NaN
#Responses        182.333333   236.000000  102.627157
## of Questions    44.333333    44.000000    1.527525
#New Zone                NaN          NaN         NaN
#Q1                 4.502685     4.613821    0.325502
#Q2                 4.562344     4.668033    0.220111

EDIT You can also write custom function with which to aggregate the columns, e.g.

def more_than_4(col):
'''Count nr of entries larger than 4'''
    return (col > 4).sum()
df.agg(['mean', more_than_4])

#Output
#                        YEAR   Responses    ...           Q1        Q2
#mean         2017.666667  182.333333    ...     4.502685  4.562344
#more_than_4     3.000000    3.000000    ...     3.000000  3.000000
Sign up to request clarification or add additional context in comments.

6 Comments

This setup works for me, but what if I am looking to add different indicators other than mean, median, or std (or custom indicators such as count the number of entries > 4)? Is there a library that defines which indicators can be added
The sky is the limit - I just added an example to show how you can use custom functions to the answer. Does that fully answer your question?
Hmm, this may be a seperate question but whenever I implement the 'more than 4' function, it outputs everything as > 4 (which in the sample data, was true). But if I change it to 'more than 4.5' does it change to 2 for Q2 for you?
Yes, it does for me in fact
Whenever I try this, it shows all of the values (3) as >4. And when I change it to <4, it still results in all values (3). But when I change it to == 4, then it adjusts correctly... Thoughts on why this would happen?
|
1

This is actually quite simple.

import pandas as pd
df = pd.read_csv('database.txt', sep='|') # Read data.
print(df.describe(include='all').T) # Just force .describe() to show all columns then transpose.

Should output:

               count unique       top freq     mean       std      min      25%      50%      75%      max
ORG                3      3        UK    1      NaN       NaN      NaN      NaN      NaN      NaN      NaN
PROGRAM            3      3       IRE    1      NaN       NaN      NaN      NaN      NaN      NaN      NaN
YEAR               3    NaN       NaN  NaN  2017.67   0.57735     2017   2017.5     2018     2018     2018
INDUSTRY           3      3  PRD - LF    1      NaN       NaN      NaN      NaN      NaN      NaN      NaN
Responses          3    NaN       NaN  NaN  182.333   102.627       64      150      236    241.5      247
# of Questions     3    NaN       NaN  NaN  44.3333   1.52753       43     43.5       44       45       46
New Zone           3      3       Low    1      NaN       NaN      NaN      NaN      NaN      NaN      NaN
Q1                 3    NaN       NaN  NaN  4.50269  0.325502  4.13617    4.375  4.61382  4.68594  4.75806
Q2                 3    NaN       NaN  NaN  4.56234  0.220111  4.30932  4.48868  4.66803  4.68886  4.70968

Median should be the 50% column.

EDIT:

If there's a risk of columns being inserted with strings and messing up the values/dtypes, just coerce them to numeric.

import pandas as pd
df = pd.read_csv('database.txt', sep='|')
main_cols = ['Q1', 'Q2']
df[main_cols] = df[main_cols].apply(pd.to_numeric, errors='coerce')
print(df.describe(include='all').T)

Tested with the following data (note the BLAHHHHHHH value in last row):

ORG|PROGRAM|YEAR|INDUSTRY|Responses|# of Questions|New Zone|Q1|Q2
USA|MO|2018|PRD - LF|64|44|High|4.75806451612903|4.70967741935484
CAN|ALB|2017|FS - B|247|43|Medium|4.61382113821138|4.66803278688525
UK|IRE|2018|RES - U|236|46|Low|BLAHHHHHHH|4.30932203389831

Output is:

               count unique     top freq     mean       std      min      25%      50%      75%      max
ORG                3      3     CAN    1      NaN       NaN      NaN      NaN      NaN      NaN      NaN
PROGRAM            3      3     IRE    1      NaN       NaN      NaN      NaN      NaN      NaN      NaN
YEAR               3    NaN     NaN  NaN  2017.67   0.57735     2017   2017.5     2018     2018     2018
INDUSTRY           3      3  FS - B    1      NaN       NaN      NaN      NaN      NaN      NaN      NaN
Responses          3    NaN     NaN  NaN  182.333   102.627       64      150      236    241.5      247
# of Questions     3    NaN     NaN  NaN  44.3333   1.52753       43     43.5       44       45       46
New Zone           3      3     Low    1      NaN       NaN      NaN      NaN      NaN      NaN      NaN
Q1                 2    NaN     NaN  NaN  4.68594  0.101995  4.61382  4.64988  4.68594    4.722  4.75806
Q2                 3    NaN     NaN  NaN  4.56234  0.220111  4.30932  4.48868  4.66803  4.68886  4.70968

5 Comments

Thanks Jerome - this is where I actually started, but my issue is that whenever a string is added to a column, the describe function treats the entire column like a string (ex. one of the questions within Q1 was replaced with teh string 'Outlier', so the 'unique' would be 3 and mean/std/etc would be NaN)
@RCarmody See additional answer.
Oh interesting! I have never heard of the coercing function before.. Is that line pretty much telling the code, given a list of columns convert these into floats?
@RCarmody Pretty much, yes. And all that cannot be converted will be changed to NaN. Fun part is you can extend describe to do more stuff that is not here.
Quick question on that... Could I create a conditional statement to add to the describe dataframe? I am looking to create a new column displaying the total number of data points are greater than 'x'. So for example, if x=4.65, the new column would show Q1 = 1 & Q2 = 2 (since only 4.758 > x for Q1 while 4.709 and 4.66 > x for Q2).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.