handling bad lines in a python read_csv execution

Question

I have a series of VERY dirty CSV files.

They look like this:

,"File Inputs",,,,,,,,,,,"Email Category",,"Contact Info Category",
RecCtr,Attom_ID,PeopleID,"First Name","Last Name",AddressFullStreet,City,State,Zip," ","Individual Level Match"," ","Email Address"," ",Phone,"Phone Type"
1,19536969,80209511,ANTHONY1,MACCA1,"123 Main RD","Anytown",MA,12345
2,169874349,80707224,ANTHONY2,MACCA2,"123 Main RD","Anytown",MA,12345
3,1057347,81837554,ANTHONY3,MACCA3,"123 Main RD","Anytown",MA,12345
4,36946575,81869227,ANTHONY3,MACCA4,"123 Main RD","Anytown",MA,12345,,YES,,,,1234567890,Mobile

as you can see above, there are 16 elements. lines 1,2,3 are bad, line 4 is good.

I am using this piece of code in an attempt to read them.

df = pd.read_csv(file, skiprows=2, dtype=str, header=None)

df.columns = ['RecCtr', 'Attom_ID', 'PeopleID', 'First_Name', 'Last_Name', 'AddressFullStreet', 'City', 'State', 'Zip', 'blank1', 'Individual_Level_Match', 'blank2', 'Email_Address', 'blank3', 'Phone', 'Phone_Type'
]
df = df.replace({pd.np.nan: None})

my problem is that I don't know how to tell the system I have 16 elements, and any of the lines that are NOT 16 elements should be skipped.

It appears that line 1 in my code forces lines1-3 to be good, and then line 4 becomes bad.

How do I specify how many columns there are in order for line 1 to be skipped as bad. along with the others.

Thanks

change 1:

headers = ['RecCtr', 'Attom_ID', 'PeopleID', 'First_Name', 'Last_Name', 'AddressFullStreet', 'City', 'State', 'Zip', 'blank1', 'Individual_Level_Match', 'blank2', 'Email_Address', 'blank3', 'Phone', 'Phone_Type']
df = pd.read_csv(file, skiprows=2, dtype=str, header=headers)

Response:

    raise ValueError("header must be integer or list of integers")
ValueError: header must be integer or list of integers

AChampion · Accepted Answer · 2019-12-15 03:10:03Z

4

Unfortunately you can't skip rows if they have too few values only too many (error_bad_lines=False).
By using header=None it takes the 1st not-skipped row as the correct number of columns which then means the 4th row is bad (too many columns).

You can either read the column names from the file or pass the column names to read_csv(), e.g.

df = pd.read_csv(file, skiprows=1, dtype=str, header=0)

Or:

cols = ['RecCtr', 'Attom_ID', 'PeopleID', 'First_Name', 'Last_Name', ...]
df = pd.read_csv(file, skiprows=2, dtype=str, names=cols)

Which fixes the correct number of columns and then it would parse lines 1-4 without error and fill the missing columns of 1-3 with NaNs.

If you know that the last column (or any other column) should have values then you can drop the rows with NaN in that column:

df.dropna(subset=['Phone Type'])

Or:

df[df['Phone Type'].notnull()]

edited Dec 15, 2019 at 3:10

answered Dec 14, 2019 at 2:04

AChampion

30.5k4 gold badges63 silver badges79 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

arcee123 Over a year ago

I used your solution, but it didn't work. I changed it to names=headers and that worked.

AChampion Over a year ago

Yes, my mistake, fixed.

saiprakash reddy · Accepted Answer · 2020-10-31 06:49:40Z

0

If there are no column headers in your data and you want to add then try in this way, it worked for me!!!

headers = ["col1", "col2", "col3", .....]
df = pd.read_csv("your filename.csv", names = headers)
df

answered Oct 31, 2020 at 6:49

saiprakash reddy

1

Collectives™ on Stack Overflow

handling bad lines in a python read_csv execution

change 1:

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

change 1:

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related