Regex for optional end-part of substring

Question

Consider the following (highly simplified) string:

'a b a b c a b c a b c'

This is a repeating pattern of 'a b c' except at the beginning where the 'c' is missing.

I seek a regular expression which can give me the following matches by the use of re.findall():

[('a', 'b'), ('a', 'b', 'c'), ('a', 'b', 'c'), ('a', 'b', 'c')]

The string above thus have 4 matches of 'a b c' - although with the first match as a special case since the 'c' is missing.

My simplest attempt is where I try to capture 'a' and 'b' and use an optional capture for 'c':

re.findall(r'(a).*?(b).*?(c)?', 'a b a b c a b c a b c')

I get:

[('a', 'b', ''), ('a', 'b', ''), ('a', 'b', ''), ('a', 'b', '')]

Clearly, it has just ignored the c. When using non-optional capture for 'c' the search skips ahead prematurely and misses 'a' and 'b' in the second 'a b c'-substring. This results in 3 wrong matches:

[('a', 'b', 'c'), ('a', 'b', 'c'), ('a', 'b', 'c')]

I have tried several other techniques (for instance, '(?<=c)') to no avail.

Note: The string above is just a skeleton example of my "real-world" problem where the three letters above are themselves strings (from a long log-file) in between other strings and newlines from which I need to extract named groups.

I use Python 3.5.2 on Windows 7.

You need to remove the empty tuple elements "manually" after re.findall does its job. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Jul 29, 2016 at 16:42
Your example is so simplified that it makes it hard to provide a solid answer. I believe the problem lies with your use of the .*? wildcard in between a, b, and c. For starters, try using .+? instead so that the lazy operator doesn't cause it to match zero characters and start the pattern over again. — CAustin
– CAustin, Commented Jul 29, 2016 at 17:34
This regex format works in R ^ab|abc Example: x = "ababcabcabc" stringr::str_extract_all(x,"^ab|abc") [1] "ab" "abc" "abc" "abc" Not sure how that is implemented in python. — mindlessgreen
– mindlessgreen, Commented Jul 29, 2016 at 20:06

Wiktor Stribiżew · Accepted Answer · 2016-07-29 21:55:00Z

2

Since your a, b, and c are placeholders, and you cannot know if those are single characters, or character sequences, or anything else, you need to use a tempered greedy token to make sure the pattern does not overflow to the other matches in the same string, and since the c is optional, just wrap it with a (?:...)? optional non-capturing group:

(a)(?:(?!a|b).)*(b)(?:(?:(?!a|b|c).)*(c))?
   ^^^^^^^^^^^^^   ^^^ ^^^^^^^^^^^^^^    ^

See the regex demo

Details:

(a) - Group 1 capturing some a
(?:(?!a|b).)* - a tempered greedy token matching any char not starting a a or b sequences
(b) - Group 2 capturing some b
(?: - start of an optional non-capturing group, repeated 1 or 0 times
- (?:(?!a|b|c).)* - a tempered greedy token that matches any char but a newline that starts a a, b or c pattern
- (c) - Group 3 capturing some c pattern
)? - end of the optional non-capturing group.

To obtain the tuple list you need, you need to build it yourself using comprehension:

import re
r = r'(a)(?:(?!a|b).)*(b)(?:(?:(?!a|b|c).)*(c))?'
s = 'a b a b c a b c a b c'
# print(re.findall(r,s))
# That one is bad: [('a', 'b', ''), ('a', 'b', 'c'), ('a', 'b', 'c'), ('a', 'b', 'c')]
print([(a,b,c) if c else (a,b) for a,b,c in re.findall(r,s)])
# This one is good: [('a', 'b'), ('a', 'b', 'c'), ('a', 'b', 'c'), ('a', 'b', 'c')]

See the Python demo

answered Jul 29, 2016 at 21:55

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

O. Th. B. Over a year ago

Thanks. Regular expressions are good for simple things (at least). What I am trying to do cannot, I think, be done by finite state machine rules (it requires more branching logic it seems). I have just tried your approach and it still misses parts. I will seek another approach. Accepted because I learned something new :-)

Wiktor Stribiżew Over a year ago

Well, you only posted very simplified sample that I tried to generalize as most as I could. Regexps require precision and need exact precise requirements anx specifications. Best regards.

Collectives™ on Stack Overflow

Regex for optional end-part of substring

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related