1

I want to extract dates from a pandas dataframe column of URLs. Here is my code:

import dateutil.parser as dparser
import pandas as pd

    df_results["URL"] = df_results["URL"].astype("str")  # String conversion
    URLs = df_results["URL"].tolist()                    # List creation
    for URL in URLs:                                     # Loop through list
        date = dparser.parse(URL,fuzzy=True)             # Parse date
        print date                                       # Print date

However, I receive a ValueError: Unknown string format:

ValueError                                Traceback (most recent call last)
<ipython-input-23-fd55da2e8e1e> in <module>()
     69 
     70 
---> 71 df_results = parse_URL(df_final) # parse 2
     72 
     73 print df_results.head()

<ipython-input-23-fd55da2e8e1e> in parse_URL(df_final)
     51     URLs = df_results["URL"].tolist()
     52     for URL in URLs:
---> 53         test = dparser.parse(URL,fuzzy=True)
     54         print test
"_")

C:\Python27\lib\site-packages\dateutil\parser.pyc in parse(timestr, parserinfo, **kwargs)
   1180         return parser(parserinfo).parse(timestr, **kwargs)
   1181     else:
-> 1182         return DEFAULTPARSER.parse(timestr, **kwargs)
   1183 
   1184 

C:\Python27\lib\site-packages\dateutil\parser.pyc in parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
    557 
    558         if res is None:
--> 559             raise ValueError("Unknown string format")
    560 
    561         if len(res) == 0:

ValueError: Unknown string format

I assume that the URLs are stored as some sort of hyperlink. However, df.info() shows an object dtype for URL.

Q1: How to covert a pandas column of URLs to raw string dtype?

Q2: How to extract dates from a pandas dataframe column of URLs and save them to a new column?

1 Answer 1

1

It seems you need to_datetime with errors='coerce' for NaT for unparseable datetimes, but first parse url:

from urllib.parse import urlsplit, parse_qs

df = pd.read_csv('data_sample.csv')

f = lambda x: pd.Series({k: v[0] for k, v in parse_qs(urlsplit(x).query).items()})
df_results = df['URL'].apply(f)
df_results["checkinDate"] = pd.to_datetime(df_results["checkinDate"], errors='coerce')
df_results["checkoutDate"] = pd.to_datetime(df_results["checkoutDate"], errors='coerce')
print (df_results)
Sign up to request clarification or add additional context in comments.

2 Comments

this leaves a column full of 'NaT'
Can you add data sample?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.