pandas.read_csv() strange behavior for empty (default) values

Question

I have following input trans.csv file:

Date,Currenncy,Symbol,Type,Units,UnitPrice,Cost,Tax
2012-03-14,USD,AAPL,BUY,1000
2012-05-12,USD,SBUX,SELL,500

The fields UnitPrice, Cost and Tax are optional. If they are not specified I expect NaN in the DataFrame cell.

I read the csv file with:

t = pandas.read_csv('trans.csv', parse_dates=True, index_col=0)

and got the following result:

           Currenncy Symbol  Type  Units   UnitPrice       Cost       Tax
Date                                                                     
2012-03-14       USD   AAPL   BUY   1000  2012-05-12  012-05-12  12-05-12
2012-02-05       USD   SBUX  SELL    500         NaN        NaN       NaN

Why are there no NaN in the first row and is the Date repeated? Any workaround to get NaN for the unspecified fields?

Added this as an issue on github. The answer I posted should fix it for now (it catches when there is data in some of the columns)... — Andy Hayden
– Andy Hayden, Commented Jan 9, 2013 at 15:26

Fredrick Brennan · Accepted Answer · 2013-01-09 15:10:57Z

3

Your CSV file is malformed. I get the same answer as you in Pandas 0.10, and while I admit that it is indeed very, very strange, you shouldn't be feeding it malformed data.

Date,Currenncy,Symbol,Type,Units,UnitPrice,Cost,Tax
2012-03-14,USD,AAPL,BUY,1000,,,
2012-05-12,USD,SBUX,SELL,500,,,

returns the expected

>>> import pandas as pd
>>> t = pd.read_csv('pandas_test', parse_dates=True, index_col=0)
>>> t
           Currenncy Symbol  Type  Units  UnitPrice  Cost  Tax
Date                                                          
2012-03-14       USD   AAPL   BUY   1000        NaN   NaN  NaN
2012-05-12       USD   SBUX  SELL    500        NaN   NaN  NaN

edited Jan 9, 2013 at 15:10

answered Jan 9, 2013 at 14:54

Fredrick Brennan

7,3533 gold badges35 silver badges67 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

ronnydw Over a year ago

That's one comma too much, now there are NaN Units, without an error message! Can the extra commas not be made optional by pandas. Looks cleaner.

Fredrick Brennan Over a year ago

"Looks cleaner". Why do you care what the original data looks like if you're parsing it? It doesn't matter that it looks cleaner if it's incorrect.

ronnydw Over a year ago

@hayden, indeed feeding malformed data is life. Can't we expect from pandas that this is handled gracefully ? It's not that malformed.

Andy Hayden Over a year ago

@rdw I think so. It looks like a bug.

Fredrick Brennan Over a year ago

"that malformed". Being malformed is binary, it's malformed or it isn't malformed.

|

Andy Hayden · Accepted Answer · 2013-01-09 15:27:48Z

2

Here's a method which can handle some more cases (when there is some data in UnitCost, Cost, etc.).

In [1]: df = pd.read_csv('trans.csv', header=None)

In [2]: df.columns = df.ix[0]

In [3]: df[1:].set_index('Date')
Out[3]: 
           Currenncy Symbol  Type Units UnitPrice Cost  Tax
Date                                                       
2012-03-14       USD   AAPL   BUY  1000       NaN  NaN  NaN
2012-05-12       USD   SBUX  SELL   500       NaN  NaN  NaN
2012-05-12       USD   SBUX  SELL   500       NaN  NaN  NaN

It's worth noting that the dtype of the these columns will be object.

However, I think this should be caught by to_csv so I posted as an issue on github.

edited Jan 9, 2013 at 15:27

answered Jan 9, 2013 at 15:15

Andy Hayden

378k110 gold badges640 silver badges546 bronze badges

Collectives™ on Stack Overflow

pandas.read_csv() strange behavior for empty (default) values

2 Answers 2

7 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related