Pandas read columns from csv in a given data type with unknown column name

Question

I am trying to import a dataframe (df_model) from an excel file. The first column of this dataframe in excel file has integers 1,2,3,4,5 and I want to read them as integers instead of decimal or float values. But whenever, I try reading them through pandas, it converts the values in first column as decimal like 1.0,2.0,3.0,4.0,5.0. The values in rest of the columns however remain the way I want. Here is the dataframe that pandas read.

    Std S_Ultra S_Classic  ... SMV34_Ultra SMV34_Classic SMV34_Ultra for Flow
0    1.0      1A        1A  ...         1.0           1.0                  2.0
1    2.0      2A        2A  ...         2.0           2.0               2 SP=5
2    3.0      3A        3A  ...      2 SP=5        2 SP=5                  3.0
3    4.0      4A        4A  ...         3.0           3.0               3 SP=5
4    5.0      5A        5A  ...      3 SP=5        3 SP=5                  NaN
..   ...     ...       ...  ...         ...           ...                  ...
100  NaN     NaN       NaN  ...         NaN           NaN                  NaN

Is it possible that pandas doesnt convert the first column to decimal values by default?

Himanshu Poddar · Accepted Answer · 2022-07-13 06:42:37Z

Yes, you can specify the type of the column while reading using pandas read_csv

df = pd.read_csv('filename.csv', dtype={'Std': 'Int32'})

And pandas will set the missing values as <NA>

EDIT : As discussed in the comments, the name of the columns are not known before hand, however what is known here is that first column or nth column will contain int, float, string data

While reading the data we can specify the column number and the data type. The column will be read in the datatype you specify. We will skip the header row and will read that separately and assign the header later.

0 is the first column number here

df = pd.read_csv(r'filename.csv', skiprows = 1,  dtype={'0': 'int'}, header = None)
headers = pd.read_csv(r"filename.csv", nrows=0).columns
df.columns = headers

The above code will give you the expected output

EDIT2 : Its not possible to know before hand without doing a one pass over the csv to check which columns are integer, float and string. You need to have this information beforehand if you don't want pandas to read a int column as object data type. And lets say if at all you are doing one pass to get this information, why not convert the columns after reading only. Either way you will have to either do one pass or need to know what all column numbers are going to contain what data type.

Geisson · Accepted Answer · 2022-07-12 06:28:02Z

0

With pandas read_excel() or read_csv() function, you can provide it the 'dtype' param, where you can specify the type you want any column to have, for example:

In your case, you can add that param like this:

df_model= pd.read_excel('filename.xlsx', dtype={'Std': int})

edited Jul 12, 2022 at 6:28

answered Jul 12, 2022 at 6:27

Geisson

813 bronze badges

5 Comments

Muhammad Farzan Bashir Over a year ago

Hi Geisson, I understand this, but can you please let me know how to avoid this in future by using a general column name in the modification that you have suggested. I mean is there a way to just say that the 1st column of the data frame will be read as int instead of saying 'Std'. JUST TO KEEP IN GENERAL. BECAUSE IN CASE SOMEONE CHANGES THE FIRST COLUMN NAME, YOUR MODIFICATION STILL REMAINS VALID.

Himanshu Poddar Over a year ago

Hi @MuhammadFarzanBashir in that case you can read the data with header = False, and then specify the column number in dtype. Let me know if you need the code for this

Muhammad Farzan Bashir Over a year ago

Hi @HimanshuPoddar, buddy can you please provide the code when we want that all the columns bearing numeric only values, show integer only values (as done by your modification for Std column). This is bcz my dataframe is a dynamic one in which I am not sure which columns in future will have numeric values only. so I want something which can convert all numeric only columns to int type

Himanshu Poddar Over a year ago

Hi @MuhammadFarzanBashir can you try my updated answer and let me know if it works for you

Himanshu Poddar Over a year ago

@MuhammadFarzanBashir the fact that you need to read all numeric data to be read as numeric and smae for other types cannot be known without doing a one pass of your data by actually going through the data and getting to know the data type of your column and then again reading the dataframe with the figured out data type

Collectives™ on Stack Overflow

Pandas read columns from csv in a given data type with unknown column name

2 Answers 2

Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related