Extracting and parsing dates in a pandas dataframe

Question

I am trying to convert a messy notebook with dates into a sorted date series in pandas.

0           03/25/93 Total time of visit (in minutes):\n
1                         6/18/85 Primary Care Doctor:\n
2      sshe plans to move as of 7/8/71 In-Home Servic...
3                  7 on 9/27/75 Audit C Score Current:\n
4      2/6/96 sleep studyPain Treatment Pain Level (N...
5                      .Per 7/06/79 Movement D/O note:\n
6      4, 5/18/78 Patient's thoughts about current su...
7      10/24/89 CPT Code: 90801 - Psychiatric Diagnos...
8                           3/7/86 SOS-10 Total Score:\n
9               (4/10/71)Score-1Audit C Score Current:\n
10     (5/11/85) Crt-1.96, BUN-26; AST/ALT-16/22; WBC...
11                         4/09/75 SOS-10 Total Score:\n
12     8/01/98 Communication with referring physician...
13     1/26/72 Communication with referring physician...
14     5/24/1990 CPT Code: 90792: With medical servic...
15     1/25/2011 CPT Code: 90792: With medical servic...

I have multiple dates formats such as 04/20/2009; 04/20/09; 4/20/09; 4/3/09. And I would like to convert all these into mm/dd/yyyy to a new column.

So far I have done

df2['date']= df2['text'].str.extractall(r'(\d{1,2}[/-]\d{1,2}[/-]\d{2,})')

Also, I do not how to extract all lines with only mm/yy or yyyy format date without interfering with the code above. Bear in mind that with the absence of day or month I would consider 1st and January as default values.

cs95 · Accepted Answer · 2020-07-28 23:10:44Z

1

You can use pd.Series.str.extract with a regex, and then apply pd.to_datetime:

pd.to_datetime(df['Text'].str.extract(
    r'(?P<Date>\d+(?:\/\d+){2})', expand=False), errors='coerce')

0    1993-03-25
1    1985-06-18
2    1971-07-08
3    1975-09-27
4    1996-02-06
5    1979-07-06
6    1978-05-18
7    1989-10-24
8    1986-03-07
9    1971-04-10
10   1985-05-11
11   1975-04-09
12   1998-08-01
13   1972-01-26
14   1990-05-24
15   2011-01-25
Name: Date, dtype: datetime64[ns]

str.extract returns a series of strings that look like this:

array(['03/25/93', '6/18/85', '7/8/71', '9/27/75', '2/6/96', '7/06/79',
       '5/18/78', '10/24/89', '3/7/86', '4/10/71', '5/11/85', '4/09/75',
       '8/01/98', '1/26/72', '5/24/1990', '1/25/2011'], dtype=object)

Regex Details

(?P<Date>\d+(?:\/\d+){2})

(?P<Date>....) - named capturing group
\d+ 1 or more digits
(?:\/\d+){2} - non-capturing group repeating twice, where
- \/ - escaped forward slash
- {2} - repeater (two times)

Regex for Missing Days

To handle optional days, a slightly modified regex is required:

(?P<Date>(?:\d+\/)?\d+/\d+)

Details

(?P<Date>....) - named capturing group
(?:\d+\/)? - nested group (non-capturing) where \d+\/ is optional.

\d+ 1 or more digits
\/ escaped forward slash

The rest is the same. Substitute this regex in place of the current one. pd.to_datetime will handle missing days.

edited Jul 28, 2020 at 23:10

answered Sep 4, 2017 at 21:48

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Sn3akyP3t3 Over a year ago

I found this answer to be useful. I do want to add a note that the capture group is absolutely necessary, but a named capture group doesn't provide value. This would be an equivalent: df['Date'] = df.Text.str.extract(r'(\d+(?:\/\d+){2})', expand=False).apply(pd.to_datetime)

cs95 Over a year ago

@Sn3akyP3t3 Improved the answer so you don't have to call apply() anymore (it's slow).

Collectives™ on Stack Overflow

Extracting and parsing dates in a pandas dataframe

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related