2

from_csv picks up a '04' as one of the values and converts it to a string. How do I make sure that all columns being picked up are as string? I would want to avoid handling individual columns as there are 114 columns and I do not want to go thru the exercise of analyzing while columns are impacted.

5
  • CORRETION: and converts it to an INT Commented Mar 7, 2017 at 15:49
  • Not really a duplicate. load from csv is not a problem. Problem is when you use from_csv method is DataFrame Commented Mar 7, 2017 at 15:53
  • @PankajSingh you can edit your question to include corrections... Commented Mar 7, 2017 at 15:55
  • pandas.pydata.org/pandas-docs/stable/generated/… dtype : Type name or dict of column -> type, default None Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} (Unsupported with engine=’python’). Use str or object to preserve and not interpret dtype. -- > dtype=str Commented Mar 7, 2017 at 16:07
  • You can just do df = pd.read_csv(your_filepath, dtype=str) Commented Mar 7, 2017 at 16:07

2 Answers 2

6

If you want all columns to be str then pass dtype=str to read_csv:

df = pd.read_csv(file_path, dtype=str)

will preserve any leading zeroes

Example:

In [54]:
t="""a,b
001,230
01,003"""
df = pd.read_csv(io.StringIO(t), dtype=str)
df

Out[54]:
     a    b
0  001  230
1   01  003

here the dtypes will be listed as object which is the correct dtype for str here:

In [55]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
a    2 non-null object
b    2 non-null object
dtypes: object(2)
memory usage: 112.0+ bytes
Sign up to request clarification or add additional context in comments.

Comments

1

If you have only a limited number of columns to read as strings:

Instead of from_csv use read_csv (here the documentation) and set

dtype={ 'your_column_name':np.str_ }

If all the data should be considered a string:

Edit: As pointed out in the comments, the suggested solution removes trailing zeroes from the data. EdChum's answer handles this case as requested.

Just convert the data after reading it with df.asType(np.str_). You can also convert a set of columns (of which you will still need the names though) by putting all the names in a list and then doing df[list_of_column_names] = df[list_of_column_names].asType(np.str_)

4 Comments

That is exactly what I wanted to avoid. I have 114 columns. Above suggestion would make it set datatype for 114 columns
See the updated answer for more options to avoid specifying all column names
Correct me if I'm wrong but If converted after reading then '04' that was converted to 4 will become '4', which means information may be lost.
In that case trailing zeroes are removed, yes.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.