How can I clean date ranges in multiple formats using python pandas?

Question

I have a dataframe that contains some dates in mixed format as follows:

import pandas as pd

dates = ['Dec-03',
         '03/11/2003 - 05/04/2004',
         'Apr-04',
         '2004 - 2005',
         '01/02/2005 - 31/03/2005']

df = pd.DataFrame(dates, columns = ["date_range"])

The dates can come in three formats as shown in the example above: two years; a single month; two dates together.

I wish to find an efficient and "pythonic" way to create "start date" and "end date" columns in the dataframe with the following result:

    date_range                         start_dates  end_dates
0   Dec-03                             01/12/2003   31/12/2003
1   03/11/2003 - 05/04/2004            03/11/2003   05/04/2004
2   Apr-04                             01/04/2004   30/04/2004
3   2004 - 2005                        01/01/2004   31/12/2005
4   01/02/2005 - 31/03/2005            01/02/2005   31/03/2005

I have experimented with solutions involving df.iterrows and some if statements, but I was wondering if there is a more efficient method to solve this problem. The full dataset contains millions of rows so something that uses a vectorised function or similar would work well.

Roy2012 · Accepted Answer · 2020-06-06 15:26:06Z

5

I don't think there's a way to do this in one vectorized operation. What you can do, however, is slice the dataframe into several chunks - each with its own data range format. For each of these slices, you can calculate the start and end dates in a vectorized manner. Since the number of date format is much smaller than the number of records, it should be pretty fast.

Here's an implementation:

from pandas.tseries.offsets import MonthEnd, YearEnd

df["start_time"] = pd.NaT
df["end_time"] = pd.NaT

mask = df.date_range.str.match(r"\w{3}-\d{2}")
df.loc[mask, "start_time"] = pd.to_datetime(df.loc[mask, "date_range"], format = "%b-%y")
df.loc[mask, "end_time"] = df.loc[mask, "start_time"] + MonthEnd(1)

mask = df.date_range.str.match(r"\d{4}\s*-\s*\d{4}")
df.loc[mask, "start_time"] = pd.to_datetime(df.loc[mask, "date_range"].str.split("-", expand=True)[0].str.strip(), 
                                            format="%Y")
df.loc[mask, "end_time"] = pd.to_datetime(df.loc[mask, "date_range"].str.split("-", expand=True)[1].str.strip(), 
                                            format="%Y") + YearEnd(1) 


mask = df.date_range.str.match(r"\d{2}/\d{2}/\d{4} - \d{2}/\d{2}/\d{4}")

df.loc[mask, "start_time"] = pd.to_datetime(df.loc[mask, "date_range"].str.split("-", expand=True)[0].str.strip(), 
                                            format="%d/%m/%Y")

df.loc[mask, "end_time"] = pd.to_datetime(df.loc[mask, "date_range"].str.split("-", expand=True)[1].str.strip(), 
                                            format="%d/%m/%Y")

The result is:

                date_range start_time   end_time
0                   Dec-03 2003-12-01 2003-12-31
1  03/11/2003 - 05/04/2004 2003-11-03 2004-04-05
2                   Apr-04 2004-04-01 2004-04-30
3              2004 - 2005 2004-01-01 2005-12-31
4  01/02/2005 - 31/03/2005 2005-02-01 2005-03-31

answered Jun 6, 2020 at 15:26

Roy2012

12.7k3 gold badges28 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Python on Toast Over a year ago

Brilliant answer, using str.match on the whole column vs individual rows was not something I had thought of. This sped up my code by over 100x.

Roy2012 Over a year ago

Great to hear! X

Collectives™ on Stack Overflow

How can I clean date ranges in multiple formats using python pandas?

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related