1

Consider the following (highly simplified) string:

'a b a b c a b c a b c'

This is a repeating pattern of 'a b c' except at the beginning where the 'c' is missing.

I seek a regular expression which can give me the following matches by the use of re.findall():

[('a', 'b'), ('a', 'b', 'c'), ('a', 'b', 'c'), ('a', 'b', 'c')]

The string above thus have 4 matches of 'a b c' - although with the first match as a special case since the 'c' is missing.

My simplest attempt is where I try to capture 'a' and 'b' and use an optional capture for 'c':

re.findall(r'(a).*?(b).*?(c)?', 'a b a b c a b c a b c')

I get:

[('a', 'b', ''), ('a', 'b', ''), ('a', 'b', ''), ('a', 'b', '')]

Clearly, it has just ignored the c. When using non-optional capture for 'c' the search skips ahead prematurely and misses 'a' and 'b' in the second 'a b c'-substring. This results in 3 wrong matches:

[('a', 'b', 'c'), ('a', 'b', 'c'), ('a', 'b', 'c')]

I have tried several other techniques (for instance, '(?<=c)') to no avail.

Note: The string above is just a skeleton example of my "real-world" problem where the three letters above are themselves strings (from a long log-file) in between other strings and newlines from which I need to extract named groups.

I use Python 3.5.2 on Windows 7.

7
  • You need to remove the empty tuple elements "manually" after re.findall does its job. Commented Jul 29, 2016 at 16:42
  • Are you sure that you need regexes to parse your logs? Commented Jul 29, 2016 at 16:45
  • @WayneWerner Yes :) Absolutely necessary. Commented Jul 29, 2016 at 16:47
  • Your example is so simplified that it makes it hard to provide a solid answer. I believe the problem lies with your use of the .*? wildcard in between a, b, and c. For starters, try using .+? instead so that the lazy operator doesn't cause it to match zero characters and start the pattern over again. Commented Jul 29, 2016 at 17:34
  • 1
    This regex format works in R ^ab|abc Example: x = "ababcabcabc" stringr::str_extract_all(x,"^ab|abc") [1] "ab" "abc" "abc" "abc" Not sure how that is implemented in python. Commented Jul 29, 2016 at 20:06

1 Answer 1

2

Since your a, b, and c are placeholders, and you cannot know if those are single characters, or character sequences, or anything else, you need to use a tempered greedy token to make sure the pattern does not overflow to the other matches in the same string, and since the c is optional, just wrap it with a (?:...)? optional non-capturing group:

(a)(?:(?!a|b).)*(b)(?:(?:(?!a|b|c).)*(c))?
   ^^^^^^^^^^^^^   ^^^ ^^^^^^^^^^^^^^    ^

See the regex demo

Details:

  • (a) - Group 1 capturing some a
  • (?:(?!a|b).)* - a tempered greedy token matching any char not starting a a or b sequences
  • (b) - Group 2 capturing some b
  • (?: - start of an optional non-capturing group, repeated 1 or 0 times
    • (?:(?!a|b|c).)* - a tempered greedy token that matches any char but a newline that starts a a, b or c pattern
    • (c) - Group 3 capturing some c pattern
  • )? - end of the optional non-capturing group.

To obtain the tuple list you need, you need to build it yourself using comprehension:

import re
r = r'(a)(?:(?!a|b).)*(b)(?:(?:(?!a|b|c).)*(c))?'
s = 'a b a b c a b c a b c'
# print(re.findall(r,s))
# That one is bad: [('a', 'b', ''), ('a', 'b', 'c'), ('a', 'b', 'c'), ('a', 'b', 'c')]
print([(a,b,c) if c else (a,b) for a,b,c in re.findall(r,s)])
# This one is good: [('a', 'b'), ('a', 'b', 'c'), ('a', 'b', 'c'), ('a', 'b', 'c')]

See the Python demo

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks. Regular expressions are good for simple things (at least). What I am trying to do cannot, I think, be done by finite state machine rules (it requires more branching logic it seems). I have just tried your approach and it still misses parts. I will seek another approach. Accepted because I learned something new :-)
Well, you only posted very simplified sample that I tried to generalize as most as I could. Regexps require precision and need exact precise requirements anx specifications. Best regards.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.