3

I have a bunch of strings users have entered of various comments concatenated together. Sometimes they entered a date if there were comments on multiple days. I'm trying to find a way to split each date and the corresponding comment. The text comments might look like this:

raw_text = ['3/30: The dog is red. 4/01: The dog is blue', 'there is a green door', '3-25:Foobar baz'] 

I would like to transform that text to:

df = pd.DataFrame([[0,'3/30','The dog is red.'],[0,'4/01','The dog is blue'],[1,np.nan,'there is a green door'],[2,'3-25','Foobar baz']],columns = 'row_id','date','text')

print(df)

   row_id  date                   text
0       0  3/30        The dog is red.
1       0  4/01        The dog is blue
2       1   NaN  there is a green door
3       2  3-25             Foobar baz

I think what I need to do is find the semicolons, then work back to the first number before that semicolon to identify the dates (sometimes they use / to separate and sometimes -).

Any ideas on how to approach this with regex would be appreciated - it's beyond my simple split/findall knowledge.

Thanks!

2 Answers 2

2

I do not know regex very well (so there probably is be a better solution) but this seems to work...

# sample list
raw_text = ['10-30: The dog is red. 4/01: The dog is blue', 'there is a green door',
            '3-25:Foobar baz', '11-25:Foobar baz. 12/20: something else']

# create regex (e.g., the variable 'n' in the comment below represents a number)
# if 'nn/nn' OR 'nn-nn' OR ' n-nn' OR ' n/nn' OR ' nn-nn' OR ' nn/nn' OR string starts with a number
regex = r'(?=\d\d/\d\d:)|(?=\d\d-\d\d:)|(?= \d-\d\d:)|(?= \d/\d\d:)|(?= \d\d-\d\d:)|(?= \d\d/\d\d:)|(?=^\d)'
# if string starts with alpha characters or there is a ':'
regex2 = r'(?=^\D)|:'

# create a Series by splitting on regex and explode
s = pd.DataFrame(raw_text)[0].str.split(regex).explode()
# boolean indexing to remove blanks
s2 = s[(s != '') & (s != ' ')]

# strip leading or trailing white space then split on regex2
df = s2.str.strip().str.split(regex2, expand=True).reset_index()
# rename columns
df.columns = ['row_id', 'date', 'text']


   row_id   date                         text
0       0  10-30   The dog is red until 5/15.
1       0   4/01              The dog is blue
2       1               there is a green door
3       2   3-25                   Foobar baz
4       3  11-25                  Foobar baz.
5       3  12/20               something else
Sign up to request clarification or add additional context in comments.

3 Comments

Close! has to be tolerant of dates in comment strings that aren't demarcated with the colon as the start of the text. If we change raw data to: raw_text = ['10-30: The dog is red until 5/15. 4/01: The dog is blue', 'there is a green door', '3-25:Foobar baz', '11-25:Foobar baz. 12/20: something else'] it breaks
Fixed it by changing your regex to: regex = r'(?=\d\d/\d\d:)|(?=\d\d-\d\d:)|(?= \d-\d\d:)|(?= \d/\d\d:)|(?= \d\d-\d\d:)|(?= \d\d/\d\d:)|(?=^\d:)' - can you edit answer? I'll give credit
Just updated the answer, but you got it before I could correct.
0

Data

df=pd.DataFrame({'raw_text':['3/30: The dog is red.', '4/01: The dog is blue', 'there is a green door', '3-25:Foobar baz']})
df

Create date column

df['date']=df.raw_text.str.extract(r"([\d+\/\-+\d+]+(?=\:))")
df

Create text column

df['text']=df.raw_text.str.extract(r"((?:-)?[^\s:][A-Za-z\s]+[^s])", expand=True)
df

Create row-id column match the text 'The dog' and create temporary column index k= 'The dog'

dicto ={'The dog':0}
df['index']=df['raw_text'].str.extract('('+ k + ')', expand=False).map(dicto)
df

Input row_id utilising the index column

df['row_id']=df['index'].isna().astype('int64')

mask rows with the text 'The dog' and add digits to the rest of the rows auto incrementally

    m=df['row_id']!=0
    df.loc[m,'row_id']=np.arange(start=1, stop=3,step=1)# please note the stop may need to be increased if df is longer
df.drop(columns=['index'], inplace=True)

Output

enter image description here

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.