1

the below code simply reads in an excel file, stores it as a df and writes the df back into an excel file. When I open the output file in excel, the columns (Dates, numbers) are not the same... some are text , some or numbers ect..

import pandas as pd
df = pd.read_csv("test.csv", encoding = "ISO-8859-1", dtype=object)


writer = pd.ExcelWriter('outputt.xlsx', engine='xlsxwriter') 
df.to_excel(writer, index = False, sheet_name='Sheet1') #drop the index
writer.save()

Is there a way to have the column types (as defined in the initial file) be preserved or revert back to the datatypes when the file was read in?

1
  • 1
    You are not reading an excel file, you are reading a csv file. CSV is just a plain text file, no information about data types is stored there. So in fact you are reading a csv, and then outputting it to an excel file type then opening in excel. If sounds like pd.DataFrame.to_csv('filename.csv') might be worth trying. At least save it in the format y ou read it and see how that works for you Commented Mar 28, 2019 at 19:18

1 Answer 1

2

You are reading in a csv file which is certainly not the same as an excel file. You can read a csv file with excel in Windows, but the encoding is different when the file is saved. You can certainly format cells according xlsxwriter specifications.

However, it is important to note that xlsxwriter cannot format any cells that already have a format such as the header or index, or dates or datetime objects. If you have multiple datatypes in a single column, that will also be problematic, as pandas will then default that column to object. An item of type "object" will be inferred in output, so again it will be dynamically assigned as a "best guess".

When you read your csv in you should specify the format if you want it to be maintained. Right now you are having pandas do this dynamically (Pandas will try to guess column types using the first 100 or so rows).

Change the line where you read in to include dtypes and they will be preserved in output. I am going to assume your columns have headers "ColumnA", "ColumnB", "ColumnC":

import pandas as pd
from datetime import datetime
df = pd.read_csv("test.csv", encoding = "ISO-8859-1", dtype={'ColumnA': int,
                                                             'ColumnB': float,
                                                             'ColumnC': str})

Let's use "ColumnC" as a column example of dates. I like to first read in dates as a string, then ensure the formatting I desire. So you could add this:

df['ColumnC'] = pd.to_datetime(df['ColumnC'].dt.strftime('%m/%d/%Y')
# date would look like: 06/08/2016, but you can look at other formatting for dt.strftime

This will ensure specific types in output. Further formatting can be applied such as the number of decimals in a float, including percents in output by following guides here.

My advice if you have columns with multiple data types: Don't. This is unorganized and makes use cases much more complex for downstream applications. Spend more time organizing data on the front end so you have less headache on the back end.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you. When assigning the column data types how would I handle dates ? I am currently formatting dates to match “short date “ in excel using %#m/dd/%Y , but the when I open the output in excel , the date is “general “ not short date. This causes problems later down the road when running excel macros downstream that use the date ( let me know if that makes sense ).
I updated answer. I generally like to read in dates as str and then parse dates myself because that way I know what they will look like!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.