I have a series of VERY dirty CSV files.
They look like this:
,"File Inputs",,,,,,,,,,,"Email Category",,"Contact Info Category",
RecCtr,Attom_ID,PeopleID,"First Name","Last Name",AddressFullStreet,City,State,Zip," ","Individual Level Match"," ","Email Address"," ",Phone,"Phone Type"
1,19536969,80209511,ANTHONY1,MACCA1,"123 Main RD","Anytown",MA,12345
2,169874349,80707224,ANTHONY2,MACCA2,"123 Main RD","Anytown",MA,12345
3,1057347,81837554,ANTHONY3,MACCA3,"123 Main RD","Anytown",MA,12345
4,36946575,81869227,ANTHONY3,MACCA4,"123 Main RD","Anytown",MA,12345,,YES,,,,1234567890,Mobile
as you can see above, there are 16 elements. lines 1,2,3 are bad, line 4 is good.
I am using this piece of code in an attempt to read them.
df = pd.read_csv(file, skiprows=2, dtype=str, header=None)
df.columns = ['RecCtr', 'Attom_ID', 'PeopleID', 'First_Name', 'Last_Name', 'AddressFullStreet', 'City', 'State', 'Zip', 'blank1', 'Individual_Level_Match', 'blank2', 'Email_Address', 'blank3', 'Phone', 'Phone_Type'
]
df = df.replace({pd.np.nan: None})
my problem is that I don't know how to tell the system I have 16 elements, and any of the lines that are NOT 16 elements should be skipped.
It appears that line 1 in my code forces lines1-3 to be good, and then line 4 becomes bad.
How do I specify how many columns there are in order for line 1 to be skipped as bad. along with the others.
Thanks
change 1:
headers = ['RecCtr', 'Attom_ID', 'PeopleID', 'First_Name', 'Last_Name', 'AddressFullStreet', 'City', 'State', 'Zip', 'blank1', 'Individual_Level_Match', 'blank2', 'Email_Address', 'blank3', 'Phone', 'Phone_Type']
df = pd.read_csv(file, skiprows=2, dtype=str, header=headers)
Response:
raise ValueError("header must be integer or list of integers")
ValueError: header must be integer or list of integers