Selecting rows based on certain column values returns empty dataframe

Question

I want to select rows from a dataframe based on different values of a certain column variable and make histograms.

import numpy as np
import pandas as pd
import csv
import matplotlib.pyplot as plt

df_train=pd.read_csv(r'C:\users\visha\downloads\1994_census\adult.data')
df_train.columns = ["age", "workclass", "fnlwgt", "education",
"educationnum", "maritalstatus", "occupation",
"relationship", "race", "sex", "capitalgain",
"capitalloss", "hoursperweek", "nativecountry",
"incomelevel"]

df_train.dropna(how='any')
df_train.loc[(df_train!=0).any(axis=1)]
#df_train.incomelevel = pd.to_numeric(df_train.incomelevel, errors = 
'coerce').fillna(0).astype('Int64')
df_train.drop(columns='fnlwgt', inplace = True)

#df_test=pd.read_csv(r'C:\users\visha\downloads\1994_census\adult.test')

#df_train.boxplot(column = 'age', by = 'incomelevel', grid = False)

df_train.loc[df_train['incomelevel'] == '<=50K']
#df_train.loc[df_train['incomelevel'] == '>50K']

Output: Empty DataFrame Columns: [age, workclass, fnlwgt, education, educationnum, maritalstatus, occupation, relationship, race, sex, capitalgain, capitalloss, hoursperweek, nativecountry, incomelevel] Index: []

From the above lines you can derive that I'm trying to select rows that have income level of '<=50K'. The 'incomelevel' column is of object datatype. But when I try to print it, it just returns all the column names and mentions the dataframe as 'empty'. Or when I run it as is in jupyter notebook without the print function, it just displays the dataframe with all the column names, except nothing under those columns.

are you filtering specifically for '<=50k` as a string or values below 50,000 some data that replicates your data frame would go a long way — Umar.H
– Umar.H, Commented Jun 5, 2020 at 15:39
@Datanovice It's a string. It's like a feature that spans over thousands of rows. — Vishal Pallikonda
– Vishal Pallikonda, Commented Jun 5, 2020 at 16:06
@NYCCoder archive.ics.uci.edu/ml/datasets/Census+Income here is the dataset — Vishal Pallikonda
– Vishal Pallikonda, Commented Jun 5, 2020 at 16:13

NYC Coder · Accepted Answer · 2020-06-05 16:30:47Z

You should call the csv with skipinitialspace=True because there are spaces in the front of each value, then it works:

df = pd.read_csv('adult.data', header=None, skipinitialspace=True)
df.columns = ["age", "workclass", "fnlwgt", "education",
"educationnum", "maritalstatus", "occupation",
"relationship", "race", "sex", "capitalgain",
"capitalloss", "hoursperweek", "nativecountry",
"incomelevel"]
df = df[df['incomelevel']=='<=50K']
print(df.head())

  age         workclass  fnlwgt  education  educationnum       maritalstatus  ...     sex capitalgain capitalloss hoursperweek  nativecountry  incomelevel
0   39         State-gov   77516  Bachelors            13       Never-married  ...    Male        2174           0           40  United-States        <=50K
1   50  Self-emp-not-inc   83311  Bachelors            13  Married-civ-spouse  ...    Male           0           0           13  United-States        <=50K
2   38           Private  215646    HS-grad             9            Divorced  ...    Male           0           0           40  United-States        <=50K
3   53           Private  234721       11th             7  Married-civ-spouse  ...    Male           0           0           40  United-States        <=50K
4   28           Private  338409  Bachelors            13  Married-civ-spouse  ...  Female           0           0           40           Cuba        <=50K

Collectives™ on Stack Overflow

Selecting rows based on certain column values returns empty dataframe

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related