3

So I have a dataframe with some text in one column. I'm trying to find 2 strings within each row of the column, and then slice the row text between those two strings to get a substring. Something like this:

startinds = df[column].str.find("First Event = ")
endinds   = df[column].str.find("\nLast Event = ")

df["first_timestamp"] = df[column].str.slice(startinds,endinds)

Now this doesn't work because startinds and endinds are series, so I can't use them as indices for slicing the strings in column.

Anyone know a way I can access the values to do the substrings on each row?

Example Input:

    Data
0   "Blahblah
     First Event = 09/20/2017 12:00:00
     Last Event = 09/20/2017 13:00:00
     Blahblahblah"
1   "Blahblahblahblah
     Blahablahblah
     First Event = 09/20/2017 12:30:00
     Last Event = 09/20/2017 12:45:00
     Blahblahblah"

Output:

    first_timestamp
0   "First Event = 09/20/2017 12:00:00"
1   "First Event = 09/20/2017 12:30:00"
2
  • 2
    It's an open issue on github. You'll most likely have to do it manually. Commented Sep 20, 2017 at 14:01
  • 2
    Do "First Event = " + df.Data.str.extract('(?<=First Event = )(.*)(?=\\\\nLast Event)', expand=False)? Commented Sep 20, 2017 at 14:02

2 Answers 2

4

To complete your slicing method you can use lambda i.e store the startinds and endinds in df and then slice the string based on columns using lambda across column i.e (note you need an escape character to get the \n)

df['startinds'] = df['Data'].str.find("First Event = ")
df['endinds']  = df['Data'].str.find("\\nLast Event = ")

df.apply(lambda x : str(x['Data'])[x['startinds']:x['endinds']],1 )

Output:

0    First Event = 09/20/2017 12:00:00
1    First Event = 09/20/2017 12:30:00
dtype: object
Sign up to request clarification or add additional context in comments.

3 Comments

My bad. The \n is a newline character. I just threw them in the sample data instead of doing actual newlines. but it's not a literal backslash. I've edited the original
A small doubt is First Event always in the second line?
No. It can be anywhere. Sometimes it might not actually be in the data. I've realized i'll have to use the regex solution because this string slicing doesn't work when the keyword doesn't show up.
2

Not unlike the answer in the comments, this approach with Series.str.extract should work:

df['first_timestamp'] = df['Data'].str.extract('(First Event = .+)')

#                                                 Data  \
# 0  Blahblah\nFirst Event = 09/20/2017 12:00:00\nL...   
# 1  Blahblahblahblah\nFirst Event = 09/20/2017 12:...   
# 
#                      first_timestamp  
# 0  First Event = 09/20/2017 12:00:00  
# 1  First Event = 09/20/2017 12:30:00

The pattern '(First Event = .+)' captures a group (i.e. ()) with "First Event = " followed by one or more characters (i.e. .+), stopping at a newline (the . character matches anything except a newline).

3 Comments

@andraiamatrix the . character in regular expressions matches anything except a line break (so .+ matches one or more of anything except a line break). Based on your updated question, it looks like df['Data'].str.extract('(First Event = .+)') will capture your first_timestamp group. I'll update my answer.
So I noticed .+ stops at a newline, but it doesn't stop at a carriage-return, \r (which it turns out is what is in my data). Is there something that will stop at either? I tried (First Event = .+)[\r\n] but that didn't stop the carriage-returns from appearing in my output.
Instead of using ., can you try this? df['Data'].str.extract('(First Event = [^\n\r]+)')

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.