2

I have dataset consists of categorical and numerical columns. For instance: salary dataset

columns: ['job', 'country_origin', 'age', 'salary', 'degree','marital_status']

four categorical columns and two numerical columns and I want to use three aggregate functions:

cat_col = ['job', 'country_origin','degree','marital_status']
num_col = [ 'age', 'salary']
aggregate_function = ['avg','max','sum']

Currently, I have my Python code that using raw query, while my objective is to get the group-by query results from all combinations from lists above:

my query: "SELECT cat_col[0], aggregate_function[0](num_col[0]) from DB where marital_status = 'married' groub by cat_col[0]"

So queries are:

q1 = select job, avg(age) from DB where marietal_status='married' groub by job
q2 = select job, avg(salary) from DB where marietal_status='married' groub by job

etc

I used for loop to get the result from all combinations.

My problem is, I want to change that query to Pandas query. I've spent a couple of hours but could not solve it.

Pandas has a different way to querying data.

Sample dataframe:

df2 = pd.DataFrame(np.array([['programmer', 'US', 28,4000, 'master','unmarried'], 
                             ['data scientist', 'UK', 30,5000, 'PhD','unmarried'],
                             ['manager', 'US', 48,9000, 'master','married']]),
                   columns=[['job', 'country_origin', 'age', 'salary', 'degree','marital_status']])
2
  • Could you include a sample of the Dataframe? Commented Oct 1, 2019 at 18:35
  • @DanielMesejo I put just a sample of dataframe. The real data is on my Postgre database. While using raw query is fine for me, but I need to change it using Pandas to make my operation easier. Thanks Commented Oct 1, 2019 at 18:49

2 Answers 2

2

First import the libaries

import pandas as pd

Build the sample dataframe

df = pd.DataFrame( { 
    "job" : ["programmer","data scientist","manager"] , 
    "country_origin" : ["US","UK","US"],
    "age": [28,30,48],
    "salary": [4000,5000,9000],
    "degree": ["master","PhD","master"],
    "marital_status": ["unmarried","unmarried","married"]} )

apply the where clause, save as a new dataframe (not necessary, but easier to read), you can of course use the filtered df inside the groupby

married=df[df['marital_status']=='married']

q1 = select job, avg(age) from DB where marietal_status='married' group by job

married.groupby('job').agg( {"age":"mean"} )
or
df[df['marital_status']=='married'].groupby('job').agg( {"age":"mean"} )

         age
job
manager   48

q2 = select job, avg(salary) from DB where marietal_status='married' group by job

married.groupby('job').agg( {"salary":"mean"} )

         salary
job
manager    9000

You can flatten the table by resetting the index

df[df['marital_status']=='married'].groupby('job').agg( {"age":"mean"} ).reset_index()

       job  age
0  manager   48

output the two stats together:

df[df['marital_status']=='married'].groupby('job').agg( {"age":"mean","salary":"mean"} ).reset_index()

       job  age  salary
0  manager   48    9000
Sign up to request clarification or add additional context in comments.

Comments

1

After you create your dataframe (df), the following command builds your desired table.

df.groupby(['job', 'country_origin','degree'])[['age', 'salary']].agg([np.mean,max,sum])

Here is a complete example:

import numpy as np
import pandas as pd
df=pd.DataFrame()
df['job']=['tech','coder','admin','admin','admin','tech']
df['country_origin']=['japan','japan','US','US','India','India']
df['degree']=['cert','bs','bs','ms','bs','cert']
df['age']=[22,23,30,35,40,28]
df['salary']=[30,50,60,90,65,40]
df.groupby(['job', 'country_origin','degree'])[['age', 'salary']].agg([np.mean,max,sum])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.