0

Based on a data frame like

import pandas as pd
string_1 = 'for urgent evaluation/treatment till first visit with Dr. Toney Winkler IN EIGHT WEEKS on 24 Jan 2001.'
string_2 = '03/25/93 Total time of visit (in minutes):'
string_3 = 'April 11, 1990 CPT Code: 90791: No medical services'
df = pd.Series([string_1,string_2,string_3])

each of the following statements succesfully extracts the date of exactly one row:

print(df.str.extract(r'((?P<month>\d{1,2})[/-](?P<day>\d{1,2})[/-](?P<year>\d{2,4}))').dropna())
   0           month day year
1  03/25/93    03    25  93

print(df.str.extract(r'(?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sept|Oct|Nov|Dec)[a-z\.]*)[\s\.\-\,](?P<day>\d{2})[\-\,\s]*(?P<year>\d{4})').dropna())
   month day  year
2  April  11  1990

print(df.str.extract(r'((?P<day>\d{2})\s(?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sept|Oct|Nov|Dec)[a-z\.]*)[\s\.\-\,]*(?P<year>\d{4}))').dropna())
   0            day  month  year
0  24 Jan 2001  24   Jan    2001

How can the statements be combined to create the data frame

     day   month  year
0    24    Jan    2001
1    25    03     93 
2    11    April  1990 

Where the indices need to be the original indices?

2 Answers 2

4

You may use PyPi regex module (install using pip install regex) and join the patterns with OR inside a branch reset group:

import regex
import pandas as pd

string_1 = 'for urgent evaluation/treatment till first visit with Dr. Toney Winkler IN EIGHT WEEKS on 24 Jan 2001.'
string_2 = '03/25/93 Total time of visit (in minutes):'
string_3 = 'April 11, 1990 CPT Code: 90791: No medical services'
df = pd.Series([string_1,string_2,string_3])

pat1 = r'(?P<month>\d{1,2})[/-](?P<day>\d{1,2})[/-](?P<year>\d{2,4})'
pat2 = r'(?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sept|Oct|Nov|Dec)[a-z.]*)[\s.,-](?P<day>\d{2})[-,\s]*(?P<year>\d{4})'
pat3 = r'(?P<day>\d{2})\s(?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sept|Oct|Nov|Dec)[a-z.]*)[\s.,-]*(?P<year>\d{4})'

rx = regex.compile(r"(?|{}|{}|{})".format(pat1,pat2,pat3))

empty_val = pd.Series(["","",""], index=['month','day','year'])

def extract_regex(seq):
    m = rx.search(seq)
    if m:
        return pd.Series(list(m.groupdict().values()), index=['month','day','year'])
    else:
        return empty_val

df2 = df.apply(extract_regex)

Output:

>>> df2
   month day  year
0    Jan  24  2001
1     03  25    93
2  April  11  1990
Sign up to request clarification or add additional context in comments.

Comments

1
string_1 = 'for urgent evaluation/treatment till first visit with Dr. Toney Winkler IN EIGHT WEEKS on 24 Jan 2001.'
string_2 = '03/25/93 Total time of visit (in minutes):'
string_3 = 'April 11, 1990 CPT Code: 90791: No medical services'
df = pd.DataFrame([string_1,string_2,string_3])

patterns = [r'(?P<day>\d{1,2}) (?P<month>(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)) (?P<year>\d{4})',
            r'(?P<month>\d{1,2})[/-](?P<day>\d{1,2})[/-](?P<year>\d{2,4})',
            r'(?P<month>(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sept|Oct|Nov|Dec)[a-z\.]*) (?P<day>\d{2}), (?P<year>\d{4})']


def extract_date(s):
    result = None, None, None
        for p in patterns:
        m = re.search(p, s)
        if m:
            result = m.group('year'), m.group('month'), m.group('day')
            break
    return result

df['year'], df['month'], df['day'] = zip(*df[0].apply(lambda s: extract_date(s)))

enter image description here

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.