Pandas: handle missing column

Question

I'm using the following code to read a CSV file in chunks using pandas read_csv

headers = ["1","2","3","4","5"]
fields = ["1", "5"]

for chunk in pandas.read_csv(fileName, names=headers, header=0, usecols=fields, chunksize=chunkSize):

Sometimes my CSV won't have column "5" and I want to be able to handle this case and specify some default values. Is there a way to read just the headers of my CSV file without reading the whole file so I can handle this manually? Or may be any other clever way to default the value for the missing column?

@cᴏʟᴅsᴘᴇᴇᴅ the thing is I need the value for column "5" for each row, however sometimes the whole column "5" will be missing so I have to fallback to default values. error_bad_lines=False will just ignore the row, no? — Anton Belev
– Anton Belev, Commented Aug 2, 2017 at 14:54
Yes, you're right. Not sure about this one. I always believed pandas would fill NaNs by default. — cs95
– cs95, Commented Aug 2, 2017 at 14:57

EdChum · Accepted Answer · 2017-08-02 15:06:14Z

3

If you pass nrows=0 this reads just the column row, you can then call intersection to find the common column values and avoid any errors:

In[14]:
t="""1,2,3,5,6
0,1,2,3,4"""
headers = ["1","2","3","4","5"]
fields = ["1", "5"]
cols = pd.read_csv(io.StringIO(t), nrows=0).columns
cols

Out[14]: Index(['1', '2', '3', '5', '6'], dtype='object')

So now we have column names we can call intersection to find the valid columns against your expected and actual columns:

In[15]:
valid_cols = cols.intersection(headers)
valid_cols

Out[15]: Index(['1', '2', '3', '5'], dtype='object')

You can do the same with fields and then you can pass these to your current code to avoid any exceptions

Just to demonstrate that passing nrows=0 just reads the header row:

In[16]:
pd.read_csv(io.StringIO(t), nrows=0)

Out[16]: 
Empty DataFrame
Columns: [1, 2, 3, 5, 6]
Index: []

edited Aug 2, 2017 at 15:06

answered Aug 2, 2017 at 14:58

EdChum

397k204 gold badges836 silver badges583 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Anton Belev Over a year ago

yeah I just found about nrows but I was about to test it with nrows=1, didn't know the count starts from 0 (should've guessed) I will give it a try thanks!

EdChum Over a year ago

Yeah it's not obvious that you can do this, will update to prove this

Collectives™ on Stack Overflow

Pandas: handle missing column

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related