Specifying data type in Pandas csv reader

Question

I am just getting started with Pandas and I am reading in a csv file using the read_csv() method. The difficulty I am having is preventing pandas from converting my telephone numbers to large numbers, instead of keeping them as strings. I defined a converter which just left the numbers alone, but then they still converted to numbers. When I changed my converter to prepend a 'z' to the phone numbers, then they stayed strings. Is there some way to keep them strings without modifying the values of the fields?

Please show us your code

Mike Pennington
– Mike Pennington

2012-05-15 01:48:30 +00:00
Commented May 15, 2012 at 1:48 — Mike Pennington
– Mike Pennington, Commented May 15, 2012 at 1:48
@Gardner: have you considered accepting an answer?

tumultous_rooster
– tumultous_rooster

2015-12-14 02:58:13 +00:00
Commented Dec 14, 2015 at 2:58 — tumultous_rooster
– tumultous_rooster, Commented Dec 14, 2015 at 2:58

zero323 · Accepted Answer · 2015-10-01 03:19:13Z

104

Since Pandas 0.11.0 you can use dtype argument to explicitly specify data type for each column:

d = pandas.read_csv('foo.csv', dtype={'BAR': 'S10'})

edited Oct 1, 2015 at 3:19

answered Aug 29, 2013 at 1:22

zero323

331k108 gold badges981 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

ReneSac Over a year ago

Note that this is not available (yet, hopefully) for some other input functions, like pandas.read_fwf()

zero323 Over a year ago

I revisited the topic and support for dtype has been already added to the pandas.read_fwf :)

Samyak Upadhyay Over a year ago

This method doesn't work for large datasets is there any other way to read a csv and only particular columns.

Itamar Katz Over a year ago

This doesn't work when the input is a bytes io object, I get error EmptyDataError: No columns to parse from file. Any way to solve this?

pasx Over a year ago

To convert to a string the documentation recommends using 'str'

lbolla · Accepted Answer · 2012-05-28 08:16:28Z

21

It looks like you can't avoid pandas from trying to convert numeric/boolean values in the CSV file. Take a look at the source code of pandas for the IO parsers, in particular functions _convert_to_ndarrays, and _convert_types. https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py

You can always assign the type you want after you have read the file:

df.phone = df.phone.astype(str)

answered May 28, 2012 at 8:16

lbolla

5,4311 gold badge25 silver badges35 bronze badges

4 Comments

nom-mon-ir Over a year ago

Thanks @lbolla, this helped in one of my bugfix, where a float value was read as string since another column was string, and later causing issues in aggregation functions. I had to do df['col'] = df['col'].astype(float64)

hihell Over a year ago

say I have a column of ids (which is all int) that I'd like to use as string, but by some condition pandas will read them as float, 1->1.0, 2->2.0, then without convert it back to int first, it will be converted to '1.0', '2.0' which is not desirable. that's why I just want pandas to read it as string.

Natacha Over a year ago

This is not the answer. Your solution doesn't solve tproblems as memory error on big files.

Colin Anthony Over a year ago

this won't solve issues where there are leading zeros that get lost

wjandrea · Accepted Answer · 2024-11-04 20:40:50Z

0

I had luck by reading the entire file in as string, then manually specifying datatypes later. In my situation, I had a column which had IDs that could contain strings like "08" which would be different from an ID of "8".

The first thing I tried was df = pd.read_csv(dtype={"ID": str}) but for some reason, this was still converting "08" to "8" (at least it was still a string, but it must have been interpreted as an integer first, which removed the leading 0).

The thing that worked for me was this: df = pd.read_csv(dtype=str) And then I could go through and manually assign other columns their datatypes as needed like @lbolla mentioned.

For some reason, applying the data type across the entire document skipped the type inference step I suppose. Annoying this isn't the default behavior when specifying a specific column data type :(

edited Nov 4, 2024 at 20:40

wjandrea

33.8k10 gold badges69 silver badges105 bronze badges

answered Nov 4, 2024 at 19:16

Jimmy LeBaron

1

2 Comments

wjandrea Over a year ago

"applying the data type across the entire document skipped the type inference step I suppose" - Yeah, what else would it do? I'm not sure if I'm confused about what you're saying or if you're confused about how read_csv works.

wjandrea Over a year ago

You might be interested in DataFrame.infer_objects()

Paras · Accepted Answer · 2024-11-05 05:32:48Z

0

Use low_memory=False while reading the file to skip dtype detection.

df = pd.read_csv('somefile.csv', low_memory=False)
Define dtypes while reading the file to force column to be read as an object.

df = pandas.read_csv('somefile.csv', dtype={'phone': object})

Official Pandas Docs

answered Nov 5, 2024 at 5:32

Paras

176 bronze badges

Collectives™ on Stack Overflow

Specifying data type in Pandas csv reader

4 Answers 4

5 Comments

4 Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

5 Comments

4 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related