How to specify dtype when using pandas.read_csv to load data from csv files?

Question

I have some text files with the following format:

000423|东阿阿胶|     300|1|0.15000|            |
000425|徐工机械|     600|1|0.15000|            |
000503|海虹控股|     400|1|0.15000|            |
000522|白云山Ａ|        |2|       |    1982.080|
000527|美的电器|     900|1|0.15000|            |
000528|柳    工|     300|1|0.15000|            |

when I use read_csv to load them into DataFrame, it doesn't generate correct dtype for some columns. For example, the first column is parsed as int, not unicode str, the third column is parsed as unicode str, not int, because of one missing data... Is there a way to preset the dtype of the DataFrame, just like the numpy.genfromtxt does?

Updates: I used read_csv like this which caused the problem:

data = pandas.read_csv(StringIO(etf_info), sep='|', skiprows=14, index_col=0, 
                       skip_footer=1, names=['ticker', 'name', 'vol', 'sign', 
                       'ratio', 'cash', 'price'], encoding='gbk')

In order to solve both the dtype and encoding problems, I need to use unicode() and numpy.genfromtxt first:

etf_info = unicode(urllib2.urlopen(etf_url).read(), 'gbk')
nd_data = np.genfromtxt(StringIO(etf_info), delimiter='|', 
                        skiprows=14, skip_footer=1, dtype=ETF_DTYPE)
data = pandas.DataFrame(nd_data, index=nd_data['ticker'],
                        columns=['name', 'vol', 'sign', 
                                 'ratio', 'cash', 'price'])

It would be nice if read_csv can add dtype and usecols settings. Sorry for my greed. ^_^

Indeed, some more work is needed on the file readers. See here: github.com/pydata/pandas/issues/926. Hopefully a magical developer will come out of the woodwork and help me out with this. — Wes McKinney
– Wes McKinney, Commented Mar 16, 2012 at 15:10

sapo_cosmico · Accepted Answer · 2018-03-01 17:00:17Z

5

Simply put: no, not yet. More work (read: more active developers) is needed on this particular area. If you could post how you're using read_csv it might help. I suspect that the whitespace between the bars may be the problem

EDIT: this is now obsolete. This behavior is covered natively by read_csv

edited Mar 1, 2018 at 17:00

sapo_cosmico

6,58212 gold badges49 silver badges60 bronze badges

answered Mar 15, 2012 at 0:13

Wes McKinney

106k32 gold badges146 silver badges109 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Deadwood Over a year ago

Thanks Wes. Just watched your PyCon video on Data analysis in Python with pandas from youtube. Great help!

Community · Accepted Answer · 2017-05-23 11:45:32Z

1

You can now use dtype in read_csv.

PS: Kudos to Wes McKinney for answering, it feels quite awkward to contradict the "past Wes".

edited May 23, 2017 at 11:45

CommunityBot

11 silver badge

answered Jan 28, 2017 at 16:30

sapo_cosmico

6,58212 gold badges49 silver badges60 bronze badges

Collectives™ on Stack Overflow

How to specify dtype when using pandas.read_csv to load data from csv files?

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related