1

I want to select rows from a dataframe based on different values of a certain column variable and make histograms.

import numpy as np
import pandas as pd
import csv
import matplotlib.pyplot as plt

df_train=pd.read_csv(r'C:\users\visha\downloads\1994_census\adult.data')
df_train.columns = ["age", "workclass", "fnlwgt", "education",
"educationnum", "maritalstatus", "occupation",
"relationship", "race", "sex", "capitalgain",
"capitalloss", "hoursperweek", "nativecountry",
"incomelevel"]

df_train.dropna(how='any')
df_train.loc[(df_train!=0).any(axis=1)]
#df_train.incomelevel = pd.to_numeric(df_train.incomelevel, errors = 
'coerce').fillna(0).astype('Int64')
df_train.drop(columns='fnlwgt', inplace = True)

#df_test=pd.read_csv(r'C:\users\visha\downloads\1994_census\adult.test')

#df_train.boxplot(column = 'age', by = 'incomelevel', grid = False)

df_train.loc[df_train['incomelevel'] == '<=50K']
#df_train.loc[df_train['incomelevel'] == '>50K']

Output: Empty DataFrame Columns: [age, workclass, fnlwgt, education, educationnum, maritalstatus, occupation, relationship, race, sex, capitalgain, capitalloss, hoursperweek, nativecountry, incomelevel] Index: []

From the above lines you can derive that I'm trying to select rows that have income level of '<=50K'. The 'incomelevel' column is of object datatype. But when I try to print it, it just returns all the column names and mentions the dataframe as 'empty'. Or when I run it as is in jupyter notebook without the print function, it just displays the dataframe with all the column names, except nothing under those columns.

6
  • 2
    are you filtering specifically for '<=50k` as a string or values below 50,000 some data that replicates your data frame would go a long way Commented Jun 5, 2020 at 15:39
  • 3
    Can you please post a sample of your df? Commented Jun 5, 2020 at 15:39
  • @Datanovice It's a string. It's like a feature that spans over thousands of rows. Commented Jun 5, 2020 at 16:06
  • @NYCCoder archive.ics.uci.edu/ml/datasets/Census+Income here is the dataset Commented Jun 5, 2020 at 16:13
  • Ok, check the answer below. Commented Jun 5, 2020 at 16:27

1 Answer 1

2

You should call the csv with skipinitialspace=True because there are spaces in the front of each value, then it works:

df = pd.read_csv('adult.data', header=None, skipinitialspace=True)
df.columns = ["age", "workclass", "fnlwgt", "education",
"educationnum", "maritalstatus", "occupation",
"relationship", "race", "sex", "capitalgain",
"capitalloss", "hoursperweek", "nativecountry",
"incomelevel"]
df = df[df['incomelevel']=='<=50K']
print(df.head())

  age         workclass  fnlwgt  education  educationnum       maritalstatus  ...     sex capitalgain capitalloss hoursperweek  nativecountry  incomelevel
0   39         State-gov   77516  Bachelors            13       Never-married  ...    Male        2174           0           40  United-States        <=50K
1   50  Self-emp-not-inc   83311  Bachelors            13  Married-civ-spouse  ...    Male           0           0           13  United-States        <=50K
2   38           Private  215646    HS-grad             9            Divorced  ...    Male           0           0           40  United-States        <=50K
3   53           Private  234721       11th             7  Married-civ-spouse  ...    Male           0           0           40  United-States        <=50K
4   28           Private  338409  Bachelors            13  Married-civ-spouse  ...  Female           0           0           40           Cuba        <=50K
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.