-3

I've been using

pd.read_csv('file.csv',parse_dates=['date_time']) 

to parse dates and then run DateTimeIndex to read year, month, day from date_time variable. When done correctly, 'date_time' should be formatted as datetime64. But something is in the data column that I keep getting 'object' as variable format so I receive ValueError when DateTimeIndex it. My data is too big for me to find out what exactly happened. How should I handle this so I can perhaps change the anomalies to missing and get the data_time variable parsed? Thanks.

Update:

I did what Edchum suggested except pretty manually. Here is my guess why the data is out of sort: one date was supposed to be 2016 instead it shows 2161. does anyone know why python wouldn't parse date time in this case? How can I identify all rows that are similar to this and delete all?

10
  • Add sample data, in case there is something unique about it. Check syntax for parse_dates. Commented Jun 7, 2016 at 20:03
  • I tried the method suggested in the thread but got below error: Commented Jun 7, 2016 at 20:04
  • OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1-255-255 00:00:00 Commented Jun 7, 2016 at 20:04
  • You can use the binary search approach to find the problematic record (take the 1st/2nd half, see which has problem, then split further). Commented Jun 7, 2016 at 20:04
  • I am quite new to python, what is the binary method? Commented Jun 7, 2016 at 20:06

2 Answers 2

2

Try this:

import pandas as pd 
df = pd.read_csv('test.csv.gz', compression='infer',date_parser=True, usecols=([0,1,3]))
print df.head()

#       id            date_time  posa_continent
#    0   0  2015-09-03 17:09:54               3
#    1   1  2015-09-24 17:38:35               3
#    2   2  2015-06-07 15:53:02               3
#    3   3  2015-09-14 14:49:10               3
#    4   4  2015-07-17 09:32:04               3
Sign up to request clarification or add additional context in comments.

3 Comments

This is a neat way to load data directly from gz files. it still doesn't quite solve my problem. apology I cited date_time in the question (to stay general) but it is actually srch_ci column (search check in date) that had the issue. I played around with the data more and confirmed that it is that one particular row (row id=312920) causing all the trouble. So I had it deleted and everything went fine. But thanks for the help!!!
@CWlearner, if it one row edit that row and add back in.
Yeah, I think you are right. I am going to change its value to missing so it can be treated later on.
1

This will help you diagnose the problem. Please run this snippet and post the output of bad_rows

df = pd.read_csv('file.csv')
bad rows = []
good_rows = []
for row, date in enumerate(df['date_time']):
    try:
        good_rows.append((row,dateutil.parser.parse(date)))
    except Exception as e:
        print(str(e))
        bad_rows.append((row,date))

5 Comments

I ran the code but bad_rows seems to capture all the rows vs. good_rows captures none..
edited. Please give at least a few examples of the date that could not be parsed and the corresponding Exception
@michael_j_ward, thanks for following up. I ran it again but got error again. It says "ValueError: I/O operation on closed file"
@Merlin it is the test file from kaggle expedia contest. here is the link to data: kaggle.com/c/expedia-hotel-recommendations/data
I will also try to add a few lines to show as example

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.