python regex: capturing group within OR

Question

I'm using python and the re module to parse some strings and extract a 4 digits code associated with a prefix. Here are 2 examples of strings I would have to parse:

str1 = "random stuff tokenA1234 more stuff"
str2 = "whatever here tokenB5678 tokenA0123 and more there"

tokenA and tokenB are the prefixes and 1234, 5678, 0123 are the digits I need to grab. token A and B are just an example here. The prefix can be something like an address http://domain.com/ (tokenA) or a string like Id: ('[Ii]d:?\s?') (tokenB).

My regex looks like:

re.findall('.*?(?:tokenA([0-9]{4})|tokenB([0-9]{4})).*?', str1)

When parsing the 2 strings above, I get:

[('1234','')]
[('','5678'),('0123','')]

And I'd like to simply get ['1234'] or ['5678','0123'] instead of a tuple. How can I modify the regex to achieve that? Thanks in advance.

Wiktor Stribiżew · Accepted Answer · 2015-12-27 20:34:55Z

1

You get tuples as a result since you have more than 1 capturing group in your regex. See re.findall reference:

If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

So, the solution is to use only one capturing group.

Since you have tokens in your regex, you can use them inside a group. Since only tokens differ, ([0-9]{4}) part is common for both, just use an alternation operator between tokens put into a non-capturing group:

(?:tokenA|tokenB)([0-9]{4})
^^^^^^^^^^^^^^^^^

The regex means:

(?:tokenA|tokenB) - match but not capture tokenA or tokenB
([0-9]{4}) - match and capture into Group 1 four digits

IDEONE demo:

import re
s = "tokenA1234tokenB34567"
print(re.findall(r'(?:tokenA|tokenB)([0-9]{4})', s))

Result: ['1234', '3456']

answered Dec 27, 2015 at 20:34

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Avinash Raj · Accepted Answer · 2015-12-27 16:11:01Z

1

Simply do this:

re.findall(r"token[AB](\d{4})", s)

Put [AB] inside a character class, so that it would match either A or B

edited Dec 27, 2015 at 16:11

answered Dec 27, 2015 at 16:09

Avinash Raj

175k32 gold badges247 silver badges289 bronze badges

Collectives™ on Stack Overflow

python regex: capturing group within OR

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related