2

I am learning the ropes with regular expression in Python. I have the code below:

import re

test = '"(Z101+Z102+Z1034+Z104)/4"'
regex = re.compile(r"[\(\+]([XYZ]\d\d\d)[\)\+]")
regex.findall(test)

It returns:

['Z101', 'Z104']

However, when I change 'Z101' to 'YZ101':

import re

test = '"(YZ101+Z102+Z1034+Z104)/4"'
regex = re.compile(r"[\(\+]([XYZ]\d\d\d)[\)\+]")
regex.findall(test)

It returns:

['Z102', 'Z104']

The purpose is to extract strings containing X, Y or Z following by any set of three digits. Therefore, the desired output for the first code would be:

['Z101', 'Z102', 'Z104']

How to fix the compile and get the correct output?

2
  • 1
    The problem is very common: the left and right hand boundaries are consuming the text, and consecutive matches are not thus detected. Use lookarounds, r"(?<=[(+])([XYZ]\d\d\d)(?=[)+])" Commented May 8, 2021 at 10:36
  • Thank you, @WiktorStribiżew. The second comment is the exact solution and explanation which I am looking for. Commented May 8, 2021 at 11:50

3 Answers 3

3

The left and right hand boundary patterns ([\(\+] and [\)\+]) are consuming the text they match, and thus consecutive matches are not thus detected.

You can solve the problem using lookarounds,

r"(?<=[(+])([XYZ]\d\d\d)(?=[)+])"
r"(?<=[(+])[XYZ]\d{3}(?=[)+])"

Details

  • (?<=[(+]) - a positive lookbehind that matches a location that is immediately preceded with ( or +
  • [XYZ] - X, Y or Z
  • \d{3} - three digits
  • (?=[)+]) - a positive lookahead that makes sure there is ) or + immediately to the right of the current location.

Note the word boundary, \b, can solve the issue in some situations, it might also help you here, too.

Sign up to request clarification or add additional context in comments.

Comments

2

Use re.findall with the pattern [XYZ]\d{3}\b:

test = '"(YZ101+Z102+Z1034+Z104)/4"'
matches = re.findall(r'[XYZ]\d{3}\b', test)
print(matches)  # ['Z101', 'Z102', 'Z104']

Comments

1

Your pattern is looking for:

  1. Either '(' or '+'
  2. Exactly one of 'X', 'Y', or 'Z'
  3. Exactly three numeric characters
  4. Either '(' or '+'

It's not selecting the 'Z101' because when you add 'Y', that substring isn't immediately preceded by '(' or '+'.

One option would be to leave 1 and 4 out of the pattern. In this example, you would get exactly what you want. That pattern would be r'[XYZ]\d\d\d'. Depending on your data, however, that might create a different problem down the road.

Another option would be to include the possibility for a prefixed character with '?'. The '?' means 'zero or one' when used as a quantifier (but it can also modify other quantifiers, but that's a different topic). To do that, your pattern would be r"[(+][XYZ]?([XYZ]\d\d\d)[)+]"

1 Comment

I added 'Y' to not selecting 'Z101' on purpose. However, It returns 'Z102' while the first code did not which I am confused. I tried the pattern r"[(+][XYZ]?([XYZ]\d\d\d)[)+]" and it yields the same result as above - still missing 'Z102'.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.