1

I have a large database of CT scan results and impressions. I am attempting to build a regular expression which searches for an integer or floating number followed by 'mm' that is neighboring the word 'nodule' ahead or behind. This is the regular expression I have for this so far:

nodule_4mm_size = "(?s).*?([0-4]*\.*[0-9]+\s*[mM]{2})[\w\W]{0,24}[Nn]odule|(?s)[Nn]odule[\w\W]{0,24}.*?([0-4]*\.*[0-9]+\s*[mM]{2})”

However, I need to ensure that these findings are not preceded by previous or prior measurements. Radiologists referring to previous scans. So I am trying a negative lookbehind, like this:

(?<!previously measured)\?[Nn]odule[\w\W]{0,24}[^\.\d]([0-4]\s*[mM]{2}|[0-3]\.[0-9]\s*[mM]{2}|4\.0+\s*[mM]{2})

However, I can't get it to work. Take for instance the following paragraph.

"For example, the largest nodule which is located in the right lower lobe and currently measures 4.4 mm (image #82, series 3) previously measured 3.6 mm on 09/01/2011."

In this case, I would like the regex to hit on 4.4 mm not 3.6 mm. Furthermore, if multiple hits are found I would like to only keep the largest size found. For example,

"For example, the largest nodule which is located in the right lower lobe and currently measures 4.4 mm (image #82, series 3) previously measured 3.6 mm on 09/01/2011. Another nodule was found measuring 2.2 mm.

In this case I would like to ensure only 4.4 mm is identified.

Any help would truly be appreciated. Just can't get this negative lookbehind to work! Thanks!

4
  • [previously measured] means match a character in the following set: p, r, e, v, i, o, u, s, l, y, " ", m, a, and d. [^\.\d] means match a character not in the following set: "\", ., [0-9]. Commented Sep 21, 2015 at 21:16
  • Learn what a character class is. Commented Sep 21, 2015 at 21:17
  • How is it supposed to know with multiple hits that a hit isn't just a reference to a previous scan? Do the measurements always follow a certain series? Commented Sep 21, 2015 at 21:35
  • Thanks for the feedback, as per Sam's suggestion on character class the negative lookbehind statement should not be in brackets, my mistake. I have edited this. What I mean by taking the max is that, by getting this statement to work and finding all hits that do not qualify as coming after previously measured within a certain amount of white space, by one method, I can do p = re.findall(nodule_size,row['Result']) and take the max per row from the output. However, there may be a better way, I am relatively new to regular expressions. Thanks for the help! Commented Sep 21, 2015 at 21:48

4 Answers 4

1

Let's break it down, keeping the relevant parts. By now you have 2 options:

Option 1 (number followed by "nodule"):

([0-4]\.\d+\s*[mM]{2})[\s\S]{0,24}[Nn]odule

Option 2 ("nodule" followed by number):

[Nn]odule[\s\S]{0,24}([0-4]\.\d+\s*[mM]{2})

You should know the regex engine is greedy. It means that [\s\S]{1,24} will try to match as much as it can, matching the number that is not necessarily closest to "nodule". For example,

Pattern: [Nn]odule[\s\S]{0,24}([0-4]\.\d+\s*[mM]{2})

Text: ... nodule measured 1.4 mm. Another 3.2 mm ...
                                          ^    ^
                                          |    |
          matches this second occurence.  +----+

To fix this, add an extra ? after a quantifier to make it lazy. So, instead of using [\s\S]{0,24}, use [\s\S]{0,24}?.


For example, the largest nodule which is located in the right lower lobe and currently measures 4.4 mm

This example here has "nodule" separated by more than 24 chars. You should increase the number of characters in between. Maybe [\s\S]{0,70}?.


So I am trying a negative lookbehind

Lookbehinds only assert text that is immediately before a certain position. To avoid it, I recommend matching the text "previously measured", consuming some characters around it. So, how do you know not to consider those cases? Easy, don't create a capture. So you will be matching something like

[\s\S]{0,10}previously measured[\s\S]{0,10}

and discarding the match because it didn't return any groups. Moreover, you could include different exceptions here:

[\s\S]{0,10}(?:previously measured|previous scan|another patient|incorrectly measured)[\s\S]{0,10}

if multiple hits are found I would like to only keep the largest size found

You can't do that with regex. Loop in your code to find the largest.


Result:

With these conditions, we have:

[\s\S]{0,10}previously measured[\s\S]{0,10}|([0-4]\.\d+\s*[mM]{2})[\s\S]{0,70}?[Nn]odule|[Nn]odule[\s\S]{0,70}?([0-4]\.\d+\s*[mM]{2})

DEMO


Extra conditions to check

Maybe, one of the following options turns useful in order to reduce false positives:

  1. Don't allow to match after a newline.
  2. Don't match if there's a full stop between "nodule" and the number.
  3. Look for a date near the measure.
Sign up to request clarification or add additional context in comments.

Comments

1

Two possibilities:

1) using lookbehinds:

(?<!previously measured )(?<![0-9.])([0-9]+(?:\.[0-9]+)?) ?mm

The first checks if "previously measured " is not before the number, the second checks if there are no digits or a dot before the number (otherwise the 4 after the dot will match. Keep in mind that a regex engine returns the first result on the left).

2) using capture groups:

previously measured [0-9]+(?:\.[0-9]+)? ?mm|([0-9]+(?:\.[0-9]+)?) ?mm

The idea is to match what you want to avoid before. When the capture group 1 exists, you have got a result.

About the biggest number, use the re.findall method and take the biggest result after (a regex can't solve this kind of things).

Comments

1

If there need to be nodule word nearby, you can try with:

(?:((?<!previously measured\s)\d+.\d+\s*mm)(?:[^.?!\n]*?)?nodule|nodule(?:[^.?!\n]*?((?<!previously measured\s)\d+.\d+\s*mm))?)

DEMO

It will match if:

  • the nodule is in the same sentence as value in mm (the [^.?!\n] should prevent it, however word like Mr.,decimals, etc. will disturb the match), you can replace it with .+? (DEMO) however it could match between sentences
  • the value is before, or after word nodule (in this oreder, if there is value before, it will be matched first),
  • values will be captured in groups: before - \1, after - \2,
  • it should be used with g and i modes

Other similar solution would be:

(?=((?<!previously measured\s)\d+.\d+ mm)[^.?!]+nodule)|(?=nodule[^.?!]+((?<!previously measured\s)\d+\.\d+ mm))

DEMO

based only on lookarounds, it will not directly match text but zero-lenght position, and will capture values into groups.

Comments

1

In regard to this problem I ended up tokenizing the reports into individual sentences using the nltk module. The final regex expression which works for all instances is:

nodule_search = "[\s\S]{0,10}(?:previously measured|compared to )[\s\S]{0,10}|(\d[\.,]\d+|\d+|\d\d[\.,]\d+)\s*[mM]{2}[\s\S]{0,40}?[Nn]odule|[Nn]odule[\s\S]{0,40}?(\d[\.,]\d+|\d+|\d\d[\.,]\d+)\s*[mM]{2}"

So in this instance I ended up not doing a negative lookbehind but did a capture groups instead.

Thanks everyone for your input.

1 Comment

Thank you for contributing with the solution that you actually implemented. It's very constructive to see what a question turns up to.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.