I have a dataframe data with 2 columns ID and Text. The goal is to split the values in the Text column into multiple columns based on dates. Typically, a date starts a series of a string value that needs to be in a column Except when the date is at the end of the string (in such case, then it's considered part of the string that started with the preceding date).
data:
ID Text
10 6/26/06 begin tramadol, penicilin X 6 CYCLES. 1000mg tylenol X 1 YR after 11/2007
20 7/17/06-advil, qui;
10 7/19/06-ibuprofen. 8/31/06-penicilin, tramadol;
40 9/26/06-penicilin, tramadol;
91 5/23/06-penicilin, amoxicilin, tylenol;
84 10/20/06-ibuprofen, tramadol;
17 12/19/06-vit D, tramadol. 12/1/09 -6/18/10 vit D only for 5 months. 3/7/11 f/up
23 12/19/06-vit D, tramadol; 12/1/09 -6/18/10 vit D; 3/7/11 video follow-up
15 Follow up appt. scheduled
69 talk to care giver
32 12/15/06-2/16/07 everyday Follow-up; 6/8/16 discharged after 2 months
70 12/1/06?Follow up but no serious allergies
70 12/12/06-tylenol, vit D,advil; 1/26/07 scheduled surgery but had to cancel due to severe allergic reactions to advil
Expected output:
ID Text Text2 Text3
10 6/26/06 begin tramadol, penicilin X 6 CYCLES. 1000mg tylenol X 1 YR after 11/2007
20 7/17/06-advil, qui;
10 7/19/06-ibuprofen. 8/31/06-penicilin, tramadol;
40 9/26/06-penicilin, tramadol;
91 5/23/06-penicilin, amoxicilin, tylenol;
84 10/20/06-ibuprofen, tramadol;
17 12/19/06-vit D, tramadol. 12/1/09 -6/18/10 vit D only for 5 months. 3/7/11 f/up
23 12/19/06-vit D, tramadol; 12/1/09 -6/18/10 vit D; 3/7/11 video follow-up
15 Follow up appt. scheduled
69 talk to care giver
32 12/15/06-2/16/07 everyday Follow-up; 6/8/16 discharged after 2 months
70 12/1/06?Follow up but no serious allergies
70 12/12/06-tylenol, vit D,advil; 1/26/07 scheduled surgery but had to cancel due to severe allergic reactions to advil
My code so far:
d = []
for i in data.Text:
d = list(datefinder.find_dates(i)) #I can get the dates so far but still want to format the date values as %m/%d/%Y
if len(d) > 1:#Checks for every record that has more than 1 date
for j in range(0,len(d)):
i = " " + " ".join(re.split(r'[^a-z 0-9 / -]',i.lower())) + " " #cleans the text strings of any special characters
#data.Text[j] = d[j]r'[/^(.*?)]'d[j+1]'/'#this is not working
#The goal is for the Text column to retain the string from the first date up to before the second date. Then create a new Text1, get every value from the second date up to before the third date. And if there are more dates, create Textn and so on.
#Exception, if a date immediately follows a date (i.e. 12/1/09 -6/18/10) or a date ends a value string (i.e. 6/26/06 begin tramadol, penicilin X 6 CYCLES. 1000mg tylenol X 1 YR after 11/2007), they should be considered to be in the same column
Any thoughts on how to make this work will save my day. Thank you!