1

I have a series of free-text comments in a Pandas dataframe. I what to be able to identify those fields that match a given regex that includes a negative look behind. As a trivial example, I have fields such as the following:

frogs seen
green frog seen
no frogs seen
no green frogs seen
frogs not seen
green frogs not seen

I only want to identify those lines where frogs have been seen. In the real dateset, there may be lots of other text included and the phrases shown are contained within the larger text string. The regex I came up with is the following:

(?<!no\s)(?:(?:green\s)?frogs?\s)(?!not\s)(?:seen)?

This almost works. It matches 'frogs seen' and 'green frog seen' as expected. It also does NOT match 'no frogs seen', 'frogs not seen' and 'green frogs not seen' which is exactly what is wanted. However, in the phrase 'no green frogs seen', the regex matches the text 'frogs seen'.

As far as I understand, negative look behinds can only be a fixed number of characters (i.e. it's not possible to use *, + or ? to allow variable string lengths). I thought that including (?:green) in the (?:frogs?) non-capture group would work to find that whole group and negate it if preceded by a fixed length negative-look-behind. However, this does not seem to be the case.

Any suggestions how to fix this issue would be very much appreciated.

6
  • Can there be phrases like 'no red frogs seen'? Commented Aug 18, 2019 at 7:34
  • can't you just split words and check if negations are there, check if "frog" is there? Commented Aug 18, 2019 at 7:44
  • @Austin. Probably not but there may be some variation in spelling e.g. 'gren' or 'grenn'. (There's already some variation in spelling included to allow for frog vs frogs.) I was trying to find a method that matched a group and then negated the match based on the negative look behind. Commented Aug 18, 2019 at 7:45
  • @Jean-FrançoisFabre. In the real-world dataframe, there may be additional text included which may include negation terms that are not related to the phrases being searched for. Commented Aug 18, 2019 at 7:52
  • 1
    If you are allowed to use other external modules beyond pandas you might try using regex module pypi.org/project/regex which provides variable length lookbehinds. Commented Aug 18, 2019 at 8:32

2 Answers 2

2

I came up with this regex (regex101):

test_cases = [
'frogs seen',
'green frog seen',
'no frogs seen',
'no green frogs seen',
'frogs not seen',
'green frogs not seen'
]

import re

for test_case in test_cases:
    m = re.findall(r'^((?!(?:(?:\bno\b.*frogs?)|(?:frogs?.*\bnot\b.*seen))).)*$', test_case)
    if m:
        print('{} matches!'.format(test_case))

Prints:

frogs seen matches!
green frog seen matches!
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the response. When I try out your regex at pythex.org, it does still match some components of each phrase. I should have explained more carefully that in the real dataset, there may be additional text included and the phrases shown in the example would be included somewhere within the text; I've edited the original question to be clearer.
1

The reason why your lookbehind doesn't work, I believe, is because you have (?:green\s)?, making 'green ' optional. When the scanner arrives at 'frog', it looks back three characters looking for 'no ' and doesn't find it, so it accepts 'no green frogs seen' as a match. If you had instead (?:green\s), so that 'green ' was not optional, this test case would be rejected. So, instead of using negative lookbehind, try negative lookahead:

import re

test_cases = [
'frogs seen',
'green frog seen',
'no frogs seen',
'no green frogs seen',
'frogs not seen',
'green frogs not seen'
]

regex = re.compile(r'(?!no\s+)(?:(?:green\s+)?frogs?)(?=\s+seen)')
for test_case in test_cases:
    if re.match(regex, test_case):
        print(test_case)

Prints:

frogs seen
green frog seen

2 Comments

Thanks for the response. Although this doesn't produce the answer I needed, I've upvoted because your answer explains why the issue exists. Thanks.
@user1718097 Why does this not satisfy the problem? It's not because I didn't print out the word "matches!" is it?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.