1

I am trying to convert a messy notebook with dates into a sorted date series in pandas.

0           03/25/93 Total time of visit (in minutes):\n
1                         6/18/85 Primary Care Doctor:\n
2      sshe plans to move as of 7/8/71 In-Home Servic...
3                  7 on 9/27/75 Audit C Score Current:\n
4      2/6/96 sleep studyPain Treatment Pain Level (N...
5                      .Per 7/06/79 Movement D/O note:\n
6      4, 5/18/78 Patient's thoughts about current su...
7      10/24/89 CPT Code: 90801 - Psychiatric Diagnos...
8                           3/7/86 SOS-10 Total Score:\n
9               (4/10/71)Score-1Audit C Score Current:\n
10     (5/11/85) Crt-1.96, BUN-26; AST/ALT-16/22; WBC...
11                         4/09/75 SOS-10 Total Score:\n
12     8/01/98 Communication with referring physician...
13     1/26/72 Communication with referring physician...
14     5/24/1990 CPT Code: 90792: With medical servic...
15     1/25/2011 CPT Code: 90792: With medical servic...

I have multiple dates formats such as 04/20/2009; 04/20/09; 4/20/09; 4/3/09. And I would like to convert all these into mm/dd/yyyy to a new column.

So far I have done

df2['date']= df2['text'].str.extractall(r'(\d{1,2}[/-]\d{1,2}[/-]\d{2,})')

Also, I do not how to extract all lines with only mm/yy or yyyy format date without interfering with the code above. Bear in mind that with the absence of day or month I would consider 1st and January as default values.

1 Answer 1

1

You can use pd.Series.str.extract with a regex, and then apply pd.to_datetime:

pd.to_datetime(df['Text'].str.extract(
    r'(?P<Date>\d+(?:\/\d+){2})', expand=False), errors='coerce')

0    1993-03-25
1    1985-06-18
2    1971-07-08
3    1975-09-27
4    1996-02-06
5    1979-07-06
6    1978-05-18
7    1989-10-24
8    1986-03-07
9    1971-04-10
10   1985-05-11
11   1975-04-09
12   1998-08-01
13   1972-01-26
14   1990-05-24
15   2011-01-25
Name: Date, dtype: datetime64[ns]

str.extract returns a series of strings that look like this:

array(['03/25/93', '6/18/85', '7/8/71', '9/27/75', '2/6/96', '7/06/79',
       '5/18/78', '10/24/89', '3/7/86', '4/10/71', '5/11/85', '4/09/75',
       '8/01/98', '1/26/72', '5/24/1990', '1/25/2011'], dtype=object)

Regex Details

(?P<Date>\d+(?:\/\d+){2})
  • (?P<Date>....) - named capturing group
  • \d+ 1 or more digits
  • (?:\/\d+){2} - non-capturing group repeating twice, where
    • \/ - escaped forward slash
    • {2} - repeater (two times)

Regex for Missing Days

To handle optional days, a slightly modified regex is required:

(?P<Date>(?:\d+\/)?\d+/\d+)

Details

  • (?P<Date>....) - named capturing group
  • (?:\d+\/)? - nested group (non-capturing) where \d+\/ is optional.
  • \d+ 1 or more digits
  • \/ escaped forward slash

The rest is the same. Substitute this regex in place of the current one. pd.to_datetime will handle missing days.

Sign up to request clarification or add additional context in comments.

2 Comments

I found this answer to be useful. I do want to add a note that the capture group is absolutely necessary, but a named capture group doesn't provide value. This would be an equivalent: df['Date'] = df.Text.str.extract(r'(\d+(?:\/\d+){2})', expand=False).apply(pd.to_datetime)
@Sn3akyP3t3 Improved the answer so you don't have to call apply() anymore (it's slow).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.