I have a large database of CT scan results and impressions. I am attempting to build a regular expression which searches for an integer or floating number followed by 'mm' that is neighboring the word 'nodule' ahead or behind. This is the regular expression I have for this so far:
nodule_4mm_size = "(?s).*?([0-4]*\.*[0-9]+\s*[mM]{2})[\w\W]{0,24}[Nn]odule|(?s)[Nn]odule[\w\W]{0,24}.*?([0-4]*\.*[0-9]+\s*[mM]{2})”
However, I need to ensure that these findings are not preceded by previous or prior measurements. Radiologists referring to previous scans. So I am trying a negative lookbehind, like this:
(?<!previously measured)\?[Nn]odule[\w\W]{0,24}[^\.\d]([0-4]\s*[mM]{2}|[0-3]\.[0-9]\s*[mM]{2}|4\.0+\s*[mM]{2})
However, I can't get it to work. Take for instance the following paragraph.
"For example, the largest nodule which is located in the right lower lobe and currently measures 4.4 mm (image #82, series 3) previously measured 3.6 mm on 09/01/2011."
In this case, I would like the regex to hit on 4.4 mm not 3.6 mm. Furthermore, if multiple hits are found I would like to only keep the largest size found. For example,
"For example, the largest nodule which is located in the right lower lobe and currently measures 4.4 mm (image #82, series 3) previously measured 3.6 mm on 09/01/2011. Another nodule was found measuring 2.2 mm.
In this case I would like to ensure only 4.4 mm is identified.
Any help would truly be appreciated. Just can't get this negative lookbehind to work! Thanks!
[previously measured]means match a character in the following set:p,r,e,v,i,o,u,s,l,y, " ",m,a, andd.[^\.\d]means match a character not in the following set: "\",.,[0-9].p = re.findall(nodule_size,row['Result'])and take the max per row from the output. However, there may be a better way, I am relatively new to regular expressions. Thanks for the help!