slicing pandas column using values from another column

Question

So I have a dataframe with some text in one column. I'm trying to find 2 strings within each row of the column, and then slice the row text between those two strings to get a substring. Something like this:

startinds = df[column].str.find("First Event = ")
endinds   = df[column].str.find("\nLast Event = ")

df["first_timestamp"] = df[column].str.slice(startinds,endinds)

Now this doesn't work because startinds and endinds are series, so I can't use them as indices for slicing the strings in column.

Anyone know a way I can access the values to do the substrings on each row?

Example Input:

    Data
0   "Blahblah
     First Event = 09/20/2017 12:00:00
     Last Event = 09/20/2017 13:00:00
     Blahblahblah"
1   "Blahblahblahblah
     Blahablahblah
     First Event = 09/20/2017 12:30:00
     Last Event = 09/20/2017 12:45:00
     Blahblahblah"

Output:

    first_timestamp
0   "First Event = 09/20/2017 12:00:00"
1   "First Event = 09/20/2017 12:30:00"

It's an open issue on github. You'll most likely have to do it manually. — IanS
– IanS, Commented Sep 20, 2017 at 14:01
Do "First Event = " + df.Data.str.extract('(?<=First Event = )(.*)(?=\\\\nLast Event)', expand=False)? — Zero
– Zero, Commented Sep 20, 2017 at 14:02

Bharath M Shetty · Accepted Answer · 2017-09-20 14:38:11Z

4

To complete your slicing method you can use lambda i.e store the startinds and endinds in df and then slice the string based on columns using lambda across column i.e (note you need an escape character to get the \n)

df['startinds'] = df['Data'].str.find("First Event = ")
df['endinds']  = df['Data'].str.find("\\nLast Event = ")

df.apply(lambda x : str(x['Data'])[x['startinds']:x['endinds']],1 )

Output:

0    First Event = 09/20/2017 12:00:00
1    First Event = 09/20/2017 12:30:00
dtype: object

answered Sep 20, 2017 at 14:38

Bharath M Shetty

30.6k6 gold badges65 silver badges111 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

andraiamatrix Over a year ago

My bad. The \n is a newline character. I just threw them in the sample data instead of doing actual newlines. but it's not a literal backslash. I've edited the original

Bharath M Shetty Over a year ago

A small doubt is First Event always in the second line?

andraiamatrix Over a year ago

No. It can be anywhere. Sometimes it might not actually be in the data. I've realized i'll have to use the regex solution because this string slicing doesn't work when the keyword doesn't show up.

cmaher · Accepted Answer · 2017-09-20 15:55:41Z

2

Not unlike the answer in the comments, this approach with Series.str.extract should work:

df['first_timestamp'] = df['Data'].str.extract('(First Event = .+)')

#                                                 Data  \
# 0  Blahblah\nFirst Event = 09/20/2017 12:00:00\nL...   
# 1  Blahblahblahblah\nFirst Event = 09/20/2017 12:...   
# 
#                      first_timestamp  
# 0  First Event = 09/20/2017 12:00:00  
# 1  First Event = 09/20/2017 12:30:00

The pattern '(First Event = .+)' captures a group (i.e. ()) with "First Event = " followed by one or more characters (i.e. .+), stopping at a newline (the . character matches anything except a newline).

edited Sep 20, 2017 at 15:55

answered Sep 20, 2017 at 14:30

cmaher

5,2641 gold badge24 silver badges34 bronze badges

3 Comments

cmaher Over a year ago

@andraiamatrix the . character in regular expressions matches anything except a line break (so .+ matches one or more of anything except a line break). Based on your updated question, it looks like df['Data'].str.extract('(First Event = .+)') will capture your first_timestamp group. I'll update my answer.

andraiamatrix Over a year ago

So I noticed .+ stops at a newline, but it doesn't stop at a carriage-return, \r (which it turns out is what is in my data). Is there something that will stop at either? I tried (First Event = .+)[\r\n] but that didn't stop the carriage-returns from appearing in my output.

cmaher Over a year ago

Instead of using ., can you try this? df['Data'].str.extract('(First Event = [^\n\r]+)')

Collectives™ on Stack Overflow

slicing pandas column using values from another column

2 Answers 2

3 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related