1

I have a dataframe data with 2 columns ID and Text. The goal is to split the values in the Text column into multiple columns based on dates. Typically, a date starts a series of a string value that needs to be in a column Except when the date is at the end of the string (in such case, then it's considered part of the string that started with the preceding date).

data:
ID      Text
10      6/26/06 begin tramadol, penicilin X 6 CYCLES. 1000mg tylenol X 1 YR after 11/2007
20      7/17/06-advil, qui;
10      7/19/06-ibuprofen. 8/31/06-penicilin, tramadol;
40      9/26/06-penicilin, tramadol;
91      5/23/06-penicilin, amoxicilin, tylenol;
84      10/20/06-ibuprofen, tramadol;
17      12/19/06-vit D, tramadol. 12/1/09 -6/18/10 vit D only for 5 months. 3/7/11 f/up
23      12/19/06-vit D, tramadol; 12/1/09 -6/18/10 vit D; 3/7/11 video follow-up
15      Follow up appt. scheduled
69      talk to care giver
32      12/15/06-2/16/07 everyday Follow-up; 6/8/16 discharged after 2 months
70      12/1/06?Follow up but no serious allergies
70      12/12/06-tylenol, vit D,advil; 1/26/07 scheduled surgery but had to cancel due to severe allergic reactions to advil

Expected output:

ID      Text                                                                                    Text2                                                                                   Text3
10      6/26/06 begin tramadol, penicilin X 6 CYCLES. 1000mg tylenol X 1 YR after 11/2007
20      7/17/06-advil, qui;
10      7/19/06-ibuprofen.                                                                      8/31/06-penicilin, tramadol;
40      9/26/06-penicilin, tramadol;
91      5/23/06-penicilin, amoxicilin, tylenol;
84      10/20/06-ibuprofen, tramadol;
17      12/19/06-vit D, tramadol.                                                               12/1/09 -6/18/10 vit D only for 5 months.                                               3/7/11 f/up
23      12/19/06-vit D, tramadol;                                                               12/1/09 -6/18/10 vit D;                                                                 3/7/11 video follow-up
15      Follow up appt. scheduled
69      talk to care giver
32      12/15/06-2/16/07 everyday Follow-up;                                                    6/8/16 discharged after 2 months
70      12/1/06?Follow up but no serious allergies
70      12/12/06-tylenol, vit D,advil;                                                          1/26/07 scheduled surgery but had to cancel due to severe allergic reactions to advil

My code so far:

d = []
for i in data.Text:
    d = list(datefinder.find_dates(i)) #I can get the dates so far but still want to format the date values as %m/%d/%Y

if len(d) > 1:#Checks for every record that has more than 1 date
    for j in range(0,len(d)):
        i = " " + " ".join(re.split(r'[^a-z 0-9 / -]',i.lower())) + " " #cleans the text strings of any special characters
        #data.Text[j] = d[j]r'[/^(.*?)]'d[j+1]'/'#this is not working

        #The goal is for the Text column to retain the string from the first date up to before the second date. Then create a new Text1, get every value from the second date up to before the third date. And if there are more dates, create Textn and so on. 
        #Exception, if a date immediately follows a date (i.e. 12/1/09 -6/18/10) or a date ends a value string (i.e. 6/26/06 begin tramadol, penicilin X 6 CYCLES. 1000mg tylenol X 1 YR after 11/2007), they should be considered to be in the same column

Any thoughts on how to make this work will save my day. Thank you!

6
  • Will all the relevant date formats be in mm/dd/yy format? Commented Jul 14, 2017 at 18:01
  • @Brad Solomon - It's preferable for them to be in mm/dd/yyyy. Thank you! Commented Jul 14, 2017 at 18:07
  • I mean in your input data Commented Jul 14, 2017 at 18:08
  • @Brad No. The format is inconsistent. That's one of my challenges with the dataset and the primary reason behind using the datefinder.find_dates() Commented Jul 14, 2017 at 18:10
  • There are duplicated IDs, is that normal? Commented Jul 14, 2017 at 18:41

1 Answer 1

1

There you go

from itertools import chain, starmap, zip_longest
import itertools
import re
import pandas as pd

ids = [10, 20, 10, 40, 91, 84, 17, 23, 15, 69, 32, 70, 70]

text = [
    "6/26/06 begin tramadol, penicilin X 6 CYCLES. 1000mg tylenol X 1 YR after 11/2007",
    "7/17/06-advil, qui;",
    "7/19/06-ibuprofen. 8/31/06-penicilin, tramadol;",
    "9/26/06-penicilin, tramadol;",
    "5/23/06-penicilin, amoxicilin, tylenol;",
    "10/20/06-ibuprofen, tramadol;",
    "12/19/06-vit D, tramadol. 12/1/09 -6/18/10 vit D only for 5 months. 3/7/11 f/up",
    "12/19/06-vit D, tramadol; 12/1/09 -6/18/10 vit D; 3/7/11 video follow-up",
    "Follow up appt. scheduled",
    "talk to care giver",
    "12/15/06-2/16/07 everyday Follow-up; 6/8/16 discharged after 2 months",
    "12/1/06?Follow up but no serious allergies",
        "12/12/06-tylenol, vit D,advil; 1/26/07 scheduled surgery but had to cancel due to severe allergic reactions to advil"]

by_date = re.compile(
    """((?:0?[1-9]|1[012])/(?:0?[1-9]|[12]\d|3[01])/\d\d\s*"""
    """(?:(?:-|to |through )\s*(?:0?[1-9]|1[012])/(?:0?[1-9]|[12]\d|3[01])/\d\d)?\s*\S)""")


def to_items(line):
    starts = [m.start() for m in by_date.finditer(line)]
    if not starts or starts[0] > 0:
        starts.insert(0, 0)
    stops = iter(starts)
    next(stops)
    return map(line.__getitem__, starmap(slice, zip_longest(starts, stops)))


cleaned = zip_longest(*map(to_items, text))
col_names = chain(["Text"], map("Text{}".format, itertools.count(2)))
df = pd.DataFrame(dict(zip(col_names, cleaned), ID=ids))

print(df)
Sign up to request clarification or add additional context in comments.

6 Comments

you are a life saver. Thank you! Quick observation: I found dates at the end of a string still being pulled into a new column - which isn't supposed to be. I mean, any date at the end of a string should be considered to be part of that string therefore, it should be in the same column. how do we get rid of such false separation?
Please see the comment above. Thank you.
@CodeLearner are you talking about line line in records? Sorry, I don't see the date at the end of the string forming new column. Are you using other data for testing? The regular expression used has a \S at the end to make sure there are contents after the date.
@ frogcoder Sorry I'm just responding to this. To answer your question: Yes, I'm using other data to test this. It's supposed to a very low case of false positive but turns out I'm getting quite large amount of FP. This isn't your code, it due to some other key words that are used for merging 2 dates. i.e. instead of 01/01/09-01/12/09, it could be 01/01/09 to 01/12/09 or 01/01/09 - 01/12/09 or 01/01/09 through 01/12/09 or 01/01/09 through to 01/12/09. And these are just the ones I've seen. For all these instances, the code splits the data on the first space after the date.
@CodeLearner an easy fix is to add what you discovered into the regular expression, and add more as you find out more cases. For what you mentioned so far, replacing (?:- with (?:(?:-|to | through )* would suffice. I'll change the answer to reflect this change.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.