I have a bunch of strings users have entered of various comments concatenated together. Sometimes they entered a date if there were comments on multiple days. I'm trying to find a way to split each date and the corresponding comment. The text comments might look like this:
raw_text = ['3/30: The dog is red. 4/01: The dog is blue', 'there is a green door', '3-25:Foobar baz']
I would like to transform that text to:
df = pd.DataFrame([[0,'3/30','The dog is red.'],[0,'4/01','The dog is blue'],[1,np.nan,'there is a green door'],[2,'3-25','Foobar baz']],columns = 'row_id','date','text')
print(df)
row_id date text
0 0 3/30 The dog is red.
1 0 4/01 The dog is blue
2 1 NaN there is a green door
3 2 3-25 Foobar baz
I think what I need to do is find the semicolons, then work back to the first number before that semicolon to identify the dates (sometimes they use / to separate and sometimes -).
Any ideas on how to approach this with regex would be appreciated - it's beyond my simple split/findall knowledge.
Thanks!
