python pandas incorrectly reading excel dates

Question

I have an excel file with dates formatted as such:

22.10.07 16:00
22.10.07 17:00
22.10.07 18:00
22.10.07 19:00

After using the parse method of pandas to read the data, the dates are read almost correctly:

In [55]: nts.data['Tid'][10000:10005]
Out[55]:
10000    2007-10-22 15:59:59.997905
10001    2007-10-22 16:59:59.997904
10002    2007-10-22 17:59:59.997904
10003    2007-10-22 18:59:59.997904

What do I need to do to either a) get it to work correctly, or b) is there a trick to fix this easily? (e.g. some kind of 'round' function for datetime)

Current fix: create a new time vector: time = [t0 + dt.timedelta(hours=i) for i in range(len(nts.data))] — John
– John, Commented Apr 8, 2013 at 8:53
Can you put the Excel file (at least the relevant records, maybe obfuscating or deleting columns) online? — Raffaele
– Raffaele, Commented Apr 8, 2013 at 9:12
Raffaele, here's an example dl.dropbox.com/u/18478297/dummydata.xlsx — John
– John, Commented Apr 8, 2013 at 14:38

Ben · Accepted Answer · 2014-06-06 16:48:43Z

3

I encountered the same issue and got around it by not parsing the dates using Pandas, but rather applying my own function (shown below) to the relevant column(s) of the dataframe:

def ExcelDateToDateTime(xlDate):
    epoch = dt.datetime(1899, 12, 30)
    delta = dt.timedelta(hours = round(xlDate*24))
    return epoch + delta

df = pd.DataFrame.from_csv('path')

df['Date'] = df['Date'].apply(ExcelDateToDateTime)

Note: This will ignore any time granularity below the hour level, but that's all I need, and it looks from your example that this could be the case for you too.

answered Jun 6, 2014 at 16:48

Ben

1,7182 gold badges15 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Raffaele · Accepted Answer · 2013-04-08 21:25:02Z

Excel serializes datetimes with a ddddd.tttttt format, where the d part is an integer number representing the offset from a reference day (like Dec 31st, 1899), and the t part is a fraction between 0.0 and 1.0 which stands for the part of the day at the given time (for example at 12:00 it's 0.5, at 18:00 it's 0.75 and so on).

I asked you to upload a file with sample data. .xlsx files are really ZIP archives which contains your XML-serialized worksheets. This are the dates I extracted from the relevant column. Excerpt:

38961.666666666628
38961.708333333292
38961.749999999956

When you try to manually deserialize you get the same datetimes as Panda. Unfortunately, the way Excel stores times makes it impossible to represent some values exactly, so you have to round them for displaying purposes. I'm not sure if rounded data is needed for analysis, though.

This is the script I used to test that deserialized datetimes are really the same ones as Panda:

from datetime import date, datetime, time, timedelta
from urllib2 import urlopen

def deserialize(text):
  tokens = text.split(".")
  date_tok = tokens[0]
  time_tok = tokens[1] if len(tokens) == 2 else "0"
  d = date(1899, 12, 31) + timedelta(int(date_tok))
  t = time(*helper(float("0." + time_tok), (24, 60, 60, 1000000)))
  return datetime.combine(d, t)

def helper(factor, units):
  result = list()
  for unit  in units:
    value, factor = divmod(factor * unit, 1)
    result.append(int(value))
  return result

url = "https://gist.github.com/RaffaeleSgarro/877d7449bd19722b44cb/raw/" \
      "45d5f0b339d4abf3359fe673fcd2976374ed61b8/dates.txt"

for line in urlopen(url):
  print deserialize(line)

Collectives™ on Stack Overflow

python pandas incorrectly reading excel dates

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related