How to use date to Split a dataframe column into multiple columns in python

Question

I have a dataframe data with 2 columns ID and Text. The goal is to split the values in the Text column into multiple columns based on dates. Typically, a date starts a series of a string value that needs to be in a column Except when the date is at the end of the string (in such case, then it's considered part of the string that started with the preceding date).

data:
ID      Text
10      6/26/06 begin tramadol, penicilin X 6 CYCLES. 1000mg tylenol X 1 YR after 11/2007
20      7/17/06-advil, qui;
10      7/19/06-ibuprofen. 8/31/06-penicilin, tramadol;
40      9/26/06-penicilin, tramadol;
91      5/23/06-penicilin, amoxicilin, tylenol;
84      10/20/06-ibuprofen, tramadol;
17      12/19/06-vit D, tramadol. 12/1/09 -6/18/10 vit D only for 5 months. 3/7/11 f/up
23      12/19/06-vit D, tramadol; 12/1/09 -6/18/10 vit D; 3/7/11 video follow-up
15      Follow up appt. scheduled
69      talk to care giver
32      12/15/06-2/16/07 everyday Follow-up; 6/8/16 discharged after 2 months
70      12/1/06?Follow up but no serious allergies
70      12/12/06-tylenol, vit D,advil; 1/26/07 scheduled surgery but had to cancel due to severe allergic reactions to advil

Expected output:

ID      Text                                                                                    Text2                                                                                   Text3
10      6/26/06 begin tramadol, penicilin X 6 CYCLES. 1000mg tylenol X 1 YR after 11/2007
20      7/17/06-advil, qui;
10      7/19/06-ibuprofen.                                                                      8/31/06-penicilin, tramadol;
40      9/26/06-penicilin, tramadol;
91      5/23/06-penicilin, amoxicilin, tylenol;
84      10/20/06-ibuprofen, tramadol;
17      12/19/06-vit D, tramadol.                                                               12/1/09 -6/18/10 vit D only for 5 months.                                               3/7/11 f/up
23      12/19/06-vit D, tramadol;                                                               12/1/09 -6/18/10 vit D;                                                                 3/7/11 video follow-up
15      Follow up appt. scheduled
69      talk to care giver
32      12/15/06-2/16/07 everyday Follow-up;                                                    6/8/16 discharged after 2 months
70      12/1/06?Follow up but no serious allergies
70      12/12/06-tylenol, vit D,advil;                                                          1/26/07 scheduled surgery but had to cancel due to severe allergic reactions to advil

My code so far:

d = []
for i in data.Text:
    d = list(datefinder.find_dates(i)) #I can get the dates so far but still want to format the date values as %m/%d/%Y

if len(d) > 1:#Checks for every record that has more than 1 date
    for j in range(0,len(d)):
        i = " " + " ".join(re.split(r'[^a-z 0-9 / -]',i.lower())) + " " #cleans the text strings of any special characters
        #data.Text[j] = d[j]r'[/^(.*?)]'d[j+1]'/'#this is not working

        #The goal is for the Text column to retain the string from the first date up to before the second date. Then create a new Text1, get every value from the second date up to before the third date. And if there are more dates, create Textn and so on. 
        #Exception, if a date immediately follows a date (i.e. 12/1/09 -6/18/10) or a date ends a value string (i.e. 6/26/06 begin tramadol, penicilin X 6 CYCLES. 1000mg tylenol X 1 YR after 11/2007), they should be considered to be in the same column

Any thoughts on how to make this work will save my day. Thank you!

@Brad Solomon - It's preferable for them to be in mm/dd/yyyy. Thank you! — CodeLearner
– CodeLearner, Commented Jul 14, 2017 at 18:07
@Brad No. The format is inconsistent. That's one of my challenges with the dataset and the primary reason behind using the datefinder.find_dates() — CodeLearner
– CodeLearner, Commented Jul 14, 2017 at 18:10

frogcoder · Accepted Answer · 2017-07-19 14:49:30Z

1

There you go

from itertools import chain, starmap, zip_longest
import itertools
import re
import pandas as pd

ids = [10, 20, 10, 40, 91, 84, 17, 23, 15, 69, 32, 70, 70]

text = [
    "6/26/06 begin tramadol, penicilin X 6 CYCLES. 1000mg tylenol X 1 YR after 11/2007",
    "7/17/06-advil, qui;",
    "7/19/06-ibuprofen. 8/31/06-penicilin, tramadol;",
    "9/26/06-penicilin, tramadol;",
    "5/23/06-penicilin, amoxicilin, tylenol;",
    "10/20/06-ibuprofen, tramadol;",
    "12/19/06-vit D, tramadol. 12/1/09 -6/18/10 vit D only for 5 months. 3/7/11 f/up",
    "12/19/06-vit D, tramadol; 12/1/09 -6/18/10 vit D; 3/7/11 video follow-up",
    "Follow up appt. scheduled",
    "talk to care giver",
    "12/15/06-2/16/07 everyday Follow-up; 6/8/16 discharged after 2 months",
    "12/1/06?Follow up but no serious allergies",
        "12/12/06-tylenol, vit D,advil; 1/26/07 scheduled surgery but had to cancel due to severe allergic reactions to advil"]

by_date = re.compile(
    """((?:0?[1-9]|1[012])/(?:0?[1-9]|[12]\d|3[01])/\d\d\s*"""
    """(?:(?:-|to |through )\s*(?:0?[1-9]|1[012])/(?:0?[1-9]|[12]\d|3[01])/\d\d)?\s*\S)""")


def to_items(line):
    starts = [m.start() for m in by_date.finditer(line)]
    if not starts or starts[0] > 0:
        starts.insert(0, 0)
    stops = iter(starts)
    next(stops)
    return map(line.__getitem__, starmap(slice, zip_longest(starts, stops)))


cleaned = zip_longest(*map(to_items, text))
col_names = chain(["Text"], map("Text{}".format, itertools.count(2)))
df = pd.DataFrame(dict(zip(col_names, cleaned), ID=ids))

print(df)

edited Jul 19, 2017 at 14:49

answered Jul 14, 2017 at 19:57

frogcoder

1,0031 gold badge8 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

CodeLearner Over a year ago

you are a life saver. Thank you! Quick observation: I found dates at the end of a string still being pulled into a new column - which isn't supposed to be. I mean, any date at the end of a string should be considered to be part of that string therefore, it should be in the same column. how do we get rid of such false separation?

CodeLearner Over a year ago

Please see the comment above. Thank you.

frogcoder Over a year ago

@CodeLearner are you talking about line line in records? Sorry, I don't see the date at the end of the string forming new column. Are you using other data for testing? The regular expression used has a \S at the end to make sure there are contents after the date.

CodeLearner Over a year ago

@ frogcoder Sorry I'm just responding to this. To answer your question: Yes, I'm using other data to test this. It's supposed to a very low case of false positive but turns out I'm getting quite large amount of FP. This isn't your code, it due to some other key words that are used for merging 2 dates. i.e. instead of 01/01/09-01/12/09, it could be 01/01/09 to 01/12/09 or 01/01/09 - 01/12/09 or 01/01/09 through 01/12/09 or 01/01/09 through to 01/12/09. And these are just the ones I've seen. For all these instances, the code splits the data on the first space after the date.

frogcoder Over a year ago

@CodeLearner an easy fix is to add what you discovered into the regular expression, and add more as you find out more cases. For what you mentioned so far, replacing (?:- with (?:(?:-|to | through )* would suffice. I'll change the answer to reflect this change.

|

Collectives™ on Stack Overflow

How to use date to Split a dataframe column into multiple columns in python

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related