1

I'm reading an excel file into a dataframe where one of its columns has 1 to many values, delimited by a space. I need to search the column for a string of text, and if found return the complete value between its delimiters(not the entire cell value).

Input would look something like this –

import pandas as pd

df = pd.DataFrame({'command': ['abc123', 'abcdef', 'hold', 'why'],
                   'name': ['/l1/l2/good/a/b/c /l1/l2/bad/b /l1/l2/fred/x /la/lb/sure/blah',
                            '/l1/l2/fred/a/b/c /l1/l2/bad/b /l1/l2/fred/x',
                            '/l2/l3/betty/a/b/c /l1/l2/bad/b /l1/l2/fred/x /l1/l2/good/ha',
                            '/la/lb2/sure/a/b/c'],
                   'date': ['2020-05', '2020-05', '2020-05', '2020-06']})

With a print it returns -

  command     date                                               name
0  abc123  2020-05  /l1/l2/good/a/b/c /l1/l2/bad/b /l1/l2/fred/x /la/lb/sure/blah
1  abcdef  2020-05       /l1/l2/fred/a/b/c /l1/l2/bad/b /l1/l2/fred/x
2    hold  2020-05  /l2/l3/betty/a/b/c /l1/l2/bad/b /l1/l2/fred/x /l1/l2/good/ha
3     why  2020-06                                 /la/lb2/sure/a/b/c

I'm using the following to parse for the strings I'm after –

terms = ['/l1/l2/good', '/la/lb/sure']
df = df[df['name'].str.contains('|'.join(terms))]

which returns –

  command     date                                               name
0  abc123  2020-05  /l1/l2/good/a/b/c /l1/l2/bad/b /l1/l2/fred/x /la/lb/sure/blah
2    hold  2020-05  /l2/l3/betty/a/b/c /l1/l2/bad/b /l1/l2/fred/x /l1/l2/good/ha 
3     why  2020-06                                  /la/lb/sure/a/b/c

What I would like returned is –

  command     date                                               name
0  abc123  2020-05                  /l1/l2/good/a/b/c  /la/lb/sure/blah
2    hold  2020-05                                       /l1/l2/good/ha 
3     why  2020-06                                    /la/lb/sure/a/b/c

I had tried performing a split on the space delimiter, but then I'm unable to loop through those values to parse them as needed.

Thanks.

2 Answers 2

2

The problem with using replace() is that it will only apply to what matches, and not to the items you want to skip/remove from the results. Instead, use findall():

pat = f"((?:{'|'.join(terms)})[^ ]*)"
# `pat` is ((?:/l1/l2/good|/la/lb/sure)[^ ]*)

That regex attempts to match with each pattern and everything that follows it except a space.

df['name'].str.findall(pat)  # will give all the matches, as a series:

0    [/l1/l2/good/a/b/c, /la/lb/sure/blah]
1                                       []
2                         [/l1/l2/good/ha]
3                      [/la/lb/sure/a/b/c]
Name: name, dtype: object

join the results with a space and assign it back to 'name':

df['name'] = df['name'].str.findall(pat).str.join(' ')

Result df:

  command                                name     date
0  abc123  /l1/l2/good/a/b/c /la/lb/sure/blah  2020-05
1  abcdef                                      2020-05
2    hold                      /l1/l2/good/ha  2020-05
3     why                   /la/lb/sure/a/b/c  2020-06

(Btw, in your DataFrame definition, your last entry is '/la/lb2/sure/a/b/c' instead of '/la/lb/sure/a/b/c'. I've changed that. And date is the last column in your def.)

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks! Looks to do exactly what I'm after.
0

This splits the string in the columns at each space, and for the resulting entries looks if the entries in your terms list are a part of them. If there is a perfect match, the respective strings are rejoined:

df['name'].apply(lambda x: ' '.join([a for a in x.split(' ') for t in terms if a.find(t)!=-1]))

Limitations: this does not do partial matches. For instance:

'/l1/l2/good/a/b/c' contains '/l1/l2/good' and therefore would be retained. 
'/l1/l3/good/a/b/c' does not contain '/l1/l2/good' and therefore would not be retained.

All of that said, this way of storing data does not lead to very easily accessible dataframes.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.