Python Dataframe - Keep data as string while loading from_csv [duplicate]

Question

from_csv picks up a '04' as one of the values and converts it to a string. How do I make sure that all columns being picked up are as string? I would want to avoid handling individual columns as there are 114 columns and I do not want to go thru the exercise of analyzing while columns are impacted.

Not really a duplicate. load from csv is not a problem. Problem is when you use from_csv method is DataFrame — Pankaj Singh
– Pankaj Singh, Commented Mar 7, 2017 at 15:53
@PankajSingh you can edit your question to include corrections... — Jon Clements
– Jon Clements, Commented Mar 7, 2017 at 15:55
pandas.pydata.org/pandas-docs/stable/generated/… dtype : Type name or dict of column -> type, default None Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} (Unsupported with engine=’python’). Use str or object to preserve and not interpret dtype. -- > dtype=str — mkaran
– mkaran, Commented Mar 7, 2017 at 16:07
You can just do df = pd.read_csv(your_filepath, dtype=str) — EdChum
– EdChum, Commented Mar 7, 2017 at 16:07

EdChum · Accepted Answer · 2017-03-07 16:08:51Z

6

If you want all columns to be str then pass dtype=str to read_csv:

df = pd.read_csv(file_path, dtype=str)

will preserve any leading zeroes

Example:

In [54]:
t="""a,b
001,230
01,003"""
df = pd.read_csv(io.StringIO(t), dtype=str)
df

Out[54]:
     a    b
0  001  230
1   01  003

here the dtypes will be listed as object which is the correct dtype for str here:

In [55]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
a    2 non-null object
b    2 non-null object
dtypes: object(2)
memory usage: 112.0+ bytes

answered Mar 7, 2017 at 16:08

EdChum

397k204 gold badges836 silver badges583 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Community · Accepted Answer · 2017-05-23 12:17:11Z

1

If you have only a limited number of columns to read as strings:

Instead of from_csv use read_csv (here the documentation) and set

dtype={ 'your_column_name':np.str_ }

If all the data should be considered a string:

Edit: As pointed out in the comments, the suggested solution removes trailing zeroes from the data. EdChum's answer handles this case as requested.

Just convert the data after reading it with df.asType(np.str_). You can also convert a set of columns (of which you will still need the names though) by putting all the names in a list and then doing df[list_of_column_names] = df[list_of_column_names].asType(np.str_)

edited May 23, 2017 at 12:17

CommunityBot

11 silver badge

answered Mar 7, 2017 at 15:53

GPhilo

19.3k9 gold badges70 silver badges91 bronze badges

4 Comments

Pankaj Singh Over a year ago

That is exactly what I wanted to avoid. I have 114 columns. Above suggestion would make it set datatype for 114 columns

GPhilo Over a year ago

See the updated answer for more options to avoid specifying all column names

mkaran Over a year ago

Correct me if I'm wrong but If converted after reading then '04' that was converted to 4 will become '4', which means information may be lost.

GPhilo Over a year ago

In that case trailing zeroes are removed, yes.

Collectives™ on Stack Overflow

Python Dataframe - Keep data as string while loading from_csv [duplicate]

2 Answers 2

Comments

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

4 Comments

Linked

Related