Confusing Behaviour of regex in Python

Question

I'm trying to match a specific pattern using the re module in python. I wish to match a full sentence (More correctly I would say that they are alphanumeric string sequences separated by spaces and/or punctuation)

Eg.

"This is a regular sentence."
"this is also valid"
"so is This ONE"

I'm tried out of various combinations of regular expressions but I am unable to grasp the working of the patterns properly, with each expression giving me a different yet inexplicable result (I do admit I am a beginner, but still).

I'm tried:

"((\w+)(\s?))*"

To the best of my knowledge this should match one or more alpha alphanumerics greedily followed by either one or no white-space character and then it should match this entire pattern greedily. This is not what it seems to do, so clearly I am wrong but I would like to know why. (I expected this to return the entire sentence as the result) The result I get for the first sample string mentioned above is [('sentence', 'sentence', ''), ('', '', ''), ('', '', ''), ('', '', '')].
"(\w+ ?)*"

I'm not even sure how this one should work. The official documentation(python help('re')) says that the ,+,? Match x or x (greedy) repetitions of the preceding RE. In such a case is simply space the preceding RE for '?' or is '\w+ ' the preceding RE? And what will be the RE for the '' operator? The output I get with this is ['sentence'].
Others such as "(\w+\s?)+)" ; "((\w*)(\s??)) etc. which are basically variation of the same idea that the sentence is a set of alpha numerics followed by a single/finite number of white spaces and this pattern is repeated over and over.

Can someone tell me where I go wrong and why, and why the above expressions do not work the way I was expecting them to?

P.S I eventually got "[ \w]+" to work for me but With this I cannot limit the number of white-space characters in continuation.

How are you retrieving the results? I assume that you are using the capturing groups instead of the whole match (.group(0) or .group())? — oxc
– oxc, Commented Jul 6, 2012 at 23:36
@oxc No I'm using findall() for now. I don't really know how the .group() works exactly so I avoid using it. — ffledgling
– ffledgling, Commented Jul 6, 2012 at 23:52
I may be missing this detail somewhere, but can you tell me what the sentence boundary is? Is it multiple spaces or punctuation or ...? How do you know the difference between a word boundary and a sentence boundary? — ChipJust
– ChipJust, Commented Jul 7, 2012 at 15:10

Nolen Royalty · Accepted Answer · 2012-07-07 00:03:48Z

4

Your reasoning about the regex is correct, your problem is coming from using capturing groups with *. Here's an alternative:

>>> s="This is a regular sentence."
>>> import re
>>> re.findall(r'\w+\s?', s)
['This ', 'is ', 'a ', 'regular ', 'sentence']

In this case it might make more sense for you to use \b in order to match word boundries.

>>> re.findall(r'\w+\b', s)
['This', 'is', 'a', 'regular', 'sentence']

Alternatively you can match the entire sentence via re.match and use re.group(0) to get the whole match:

>>> r = r"((\w+)(\s?))*"
>>> s = "This is a regular sentence."
>>> import re
>>> m = re.match(r, s)
>>> m.group(0)
'This is a regular sentence'

edited Jul 7, 2012 at 0:03

answered Jul 6, 2012 at 23:35

Nolen Royalty

18.7k4 gold badges43 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

ffledgling Over a year ago

I'm looking to match the entire sentence as one regex instead of words. So the regex should return ['This is a regular sentence'].

ffledgling Over a year ago

It works, and my reasoning would seem to be correct if it does. But why does this not work with findall? This confuses me further. If my reasoning was correct then why does the same RE fail to work/give very different results with findall? Are there some fundamental differences b/w find and match?

Nolen Royalty Over a year ago

Have you taken a look at the regex documentation? To answer your question I would essentially be quoting the documentation on the functions you are asking about.

ffledgling Over a year ago

I went through the RE manpage and also the online HOWTO on docs.python.org but other than the fact that Match matches the regex to the Beginning of a string and findall finds all non-overlapping occurrences, I didn't find anything about how or why there are differences in the actual matching.

Nolen Royalty Over a year ago

@Ayos my mistake, I was probably being too harsh. I will write up an explanation in a bit.

Alex W · Accepted Answer · 2012-07-06 23:34:46Z

3

Here's an awesome Regular Expression tutorial website:

http://regexone.com/

Here's a Regular Expression that will match the examples given:

([a-zA-Z0-9,\. ]+)

answered Jul 6, 2012 at 23:34

Alex W

38.5k13 gold badges114 silver badges115 bronze badges

Comments

sean · Accepted Answer · 2012-07-06 23:39:35Z

0

Why do you want to limit the number of white space character in continuation? Because a sentence can have any number of words (sequences of alphanumeric characters) and spaces in a row, but rather a sentence is the area of text that ends with a punctuation mark or rather something that is not in the above sequence including white space.

([a-zA-Z0-9\s])*

The above regex will match a sentence wherein it is a series or spaces in series zero or more times. You can refine it to be the following though:

([a-zA-Z0-9])([a-zA-Z0-9\s])*

Which simply states that the above sequence must be prefaced with a alphanumeric character.

Hope this is what you were looking for.

answered Jul 6, 2012 at 23:39

sean

3,98523 silver badges28 bronze badges

1 Comment

ffledgling Over a year ago

I used the term Sentence to simply give a general idea of what I'm working with. I specified what Exactly I meant by a sentence in the question. Also the application I'm using it for requires me to check the number of white-spaces in between, if there are is more than one, different action needs to be taken. This answer does seem to suit my present needs. But can you tell me what the problem with the logic in my regex was?

ChipJust · Accepted Answer · 2012-07-07 15:42:23Z

0

Maybe this will help:

import re

source = """
This is a regular sentence.
this is also valid
so is This ONE
how about this one  followed by this one
"""

re_sentence = re.compile(r'[^ \n.].*?(\.|\n|  +)')

def main():
    i = 0
    for s in re_sentence.finditer(source):
        print "%d:%s" % (i, s.group(0))
        i += 1

if __name__ == '__main__':
    main()

I am using alternation in the expression (\.|\n| +) to describe the end-of-sentence condition. Note the use of two spaces in the third alternation. The second space has the '+' meta-character so that two or more spaces in a row will be an end-of-sentence.

answered Jul 7, 2012 at 15:42

ChipJust

1,42613 silver badges20 bronze badges

Collectives™ on Stack Overflow

Confusing Behaviour of regex in Python

4 Answers 4

5 Comments

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

5 Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related