Python Regex Negative Lookbehind

Question

I have a large database of CT scan results and impressions. I am attempting to build a regular expression which searches for an integer or floating number followed by 'mm' that is neighboring the word 'nodule' ahead or behind. This is the regular expression I have for this so far:

nodule_4mm_size = "(?s).*?([0-4]*\.*[0-9]+\s*[mM]{2})[\w\W]{0,24}[Nn]odule|(?s)[Nn]odule[\w\W]{0,24}.*?([0-4]*\.*[0-9]+\s*[mM]{2})”

However, I need to ensure that these findings are not preceded by previous or prior measurements. Radiologists referring to previous scans. So I am trying a negative lookbehind, like this:

(?<!previously measured)\?[Nn]odule[\w\W]{0,24}[^\.\d]([0-4]\s*[mM]{2}|[0-3]\.[0-9]\s*[mM]{2}|4\.0+\s*[mM]{2})

However, I can't get it to work. Take for instance the following paragraph.

"For example, the largest nodule which is located in the right lower lobe and currently measures 4.4 mm (image #82, series 3) previously measured 3.6 mm on 09/01/2011."

In this case, I would like the regex to hit on 4.4 mm not 3.6 mm. Furthermore, if multiple hits are found I would like to only keep the largest size found. For example,

"For example, the largest nodule which is located in the right lower lobe and currently measures 4.4 mm (image #82, series 3) previously measured 3.6 mm on 09/01/2011. Another nodule was found measuring 2.2 mm.

In this case I would like to ensure only 4.4 mm is identified.

Any help would truly be appreciated. Just can't get this negative lookbehind to work! Thanks!

[previously measured] means match a character in the following set: p, r, e, v, i, o, u, s, l, y, " ", m, a, and d. [^\.\d] means match a character not in the following set: "\", ., [0-9]. — Sam
– Sam, Commented Sep 21, 2015 at 21:16
How is it supposed to know with multiple hits that a hit isn't just a reference to a previous scan? Do the measurements always follow a certain series? — lintmouse
– lintmouse, Commented Sep 21, 2015 at 21:35
Thanks for the feedback, as per Sam's suggestion on character class the negative lookbehind statement should not be in brackets, my mistake. I have edited this. What I mean by taking the max is that, by getting this statement to work and finding all hits that do not qualify as coming after previously measured within a certain amount of white space, by one method, I can do p = re.findall(nodule_size,row['Result']) and take the max per row from the output. However, there may be a better way, I am relatively new to regular expressions. Thanks for the help! — David McCoy
– David McCoy, Commented Sep 21, 2015 at 21:48

Vinay Sajip · Accepted Answer · 2018-10-09 17:37:06Z

Let's break it down, keeping the relevant parts. By now you have 2 options:

Option 1 (number followed by "nodule"):

([0-4]\.\d+\s*[mM]{2})[\s\S]{0,24}[Nn]odule

Option 2 ("nodule" followed by number):

[Nn]odule[\s\S]{0,24}([0-4]\.\d+\s*[mM]{2})

You should know the regex engine is greedy. It means that [\s\S]{1,24} will try to match as much as it can, matching the number that is not necessarily closest to "nodule". For example,

Pattern: [Nn]odule[\s\S]{0,24}([0-4]\.\d+\s*[mM]{2})

Text: ... nodule measured 1.4 mm. Another 3.2 mm ...
                                          ^    ^
                                          |    |
          matches this second occurence.  +----+

To fix this, add an extra ? after a quantifier to make it lazy. So, instead of using [\s\S]{0,24}, use [\s\S]{0,24}?.

For example, the largest nodule which is located in the right lower lobe and currently measures 4.4 mm

This example here has "nodule" separated by more than 24 chars. You should increase the number of characters in between. Maybe [\s\S]{0,70}?.

So I am trying a negative lookbehind

Lookbehinds only assert text that is immediately before a certain position. To avoid it, I recommend matching the text "previously measured", consuming some characters around it. So, how do you know not to consider those cases? Easy, don't create a capture. So you will be matching something like

[\s\S]{0,10}previously measured[\s\S]{0,10}

and discarding the match because it didn't return any groups. Moreover, you could include different exceptions here:

[\s\S]{0,10}(?:previously measured|previous scan|another patient|incorrectly measured)[\s\S]{0,10}

if multiple hits are found I would like to only keep the largest size found

You can't do that with regex. Loop in your code to find the largest.

Result:

With these conditions, we have:

[\s\S]{0,10}previously measured[\s\S]{0,10}|([0-4]\.\d+\s*[mM]{2})[\s\S]{0,70}?[Nn]odule|[Nn]odule[\s\S]{0,70}?([0-4]\.\d+\s*[mM]{2})

DEMO

Extra conditions to check

Maybe, one of the following options turns useful in order to reduce false positives:

Don't allow to match after a newline.
Don't match if there's a full stop between "nodule" and the number.
Look for a date near the measure.

Casimir et Hippolyte · Accepted Answer · 2015-09-21 22:36:08Z

1

Two possibilities:

1) using lookbehinds:

(?<!previously measured )(?<![0-9.])([0-9]+(?:\.[0-9]+)?) ?mm

The first checks if "previously measured " is not before the number, the second checks if there are no digits or a dot before the number (otherwise the 4 after the dot will match. Keep in mind that a regex engine returns the first result on the left).

2) using capture groups:

previously measured [0-9]+(?:\.[0-9]+)? ?mm|([0-9]+(?:\.[0-9]+)?) ?mm

The idea is to match what you want to avoid before. When the capture group 1 exists, you have got a result.

About the biggest number, use the re.findall method and take the biggest result after (a regex can't solve this kind of things).

answered Sep 21, 2015 at 22:36

Casimir et Hippolyte

90k5 gold badges102 silver badges131 bronze badges

Comments

m.cekiera · Accepted Answer · 2015-09-21 23:21:29Z

If there need to be nodule word nearby, you can try with:

(?:((?<!previously measured\s)\d+.\d+\s*mm)(?:[^.?!\n]*?)?nodule|nodule(?:[^.?!\n]*?((?<!previously measured\s)\d+.\d+\s*mm))?)

DEMO

It will match if:

the nodule is in the same sentence as value in mm (the [^.?!\n] should prevent it, however word like Mr.,decimals, etc. will disturb the match), you can replace it with .+? (DEMO) however it could match between sentences
the value is before, or after word nodule (in this oreder, if there is value before, it will be matched first),
values will be captured in groups: before - \1, after - \2,
it should be used with g and i modes

1 Comment

Mariano Over a year ago

Thank you for contributing with the solution that you actually implemented. It's very constructive to see what a question turns up to.

Collectives™ on Stack Overflow

Python Regex Negative Lookbehind

4 Answers 4

Result:

Extra conditions to check

Comments

Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Result:

Extra conditions to check

Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related