python regex: capture parts of multiple strings that contain spaces

Question

I am trying to capture sub-strings from a string that looks similar to

'some string, another string, '

I want the result match group to be

('some string', 'another string')

my current solution

>>> from re import match
>>> match(2 * '(.*?), ', 'some string, another string, ').groups()
('some string', 'another string')

works, but is not practicable - what I am showing here of course is massively reduced in terms of complexity compared to what I'm doing in the real project; I want to use one 'straight' (non-computed) regex pattern only. Unfortunately, my attempts have failed so far:

This doesn't match (None as result), because {2} is applied to the space only, not to the whole string:

>>> match('.*?, {2}', 'some string, another string, ')

adding parentheses around the repeated string has the comma and space in the result

>>> match('(.*?, ){2}', 'some string, another string, ').groups()
('another string, ',)

adding another set of parantheses does fix that, but gets me too much:

>>> match('((.*?), ){2}', 'some string, another string, ').groups()
('another string, ', 'another string')

adding a non-capturing modifier improves the result, but still misses the first string

>>> match('(?:(.*?), ){2}', 'some string, another string, ').groups()
('another string',)

I feel like I'm close, but I can't really seem to find the proper way.

Can anyone help me ? Any other approaches I'm not seeing ?

Update after the first few responses:

First up, thank you very much everyone, your help is greatly appreciated! :-)

As I said in the original post, I have omitted a lot of complexity in my question for the sake of depicting the actual core problem. For starters, in the project I am working on, I am parsing large amounts of files (currently tens of thousands per day) in a number (currently 5, soon ~25, possibly in the hundreds later) of different line-based formats. There is also XML, JSON, binary and some other data file formats, but let's stay focussed.

In order to cope with the multitude of file formats and to exploit the fact that many of them are line-based, I have created a somewhat generic Python module that loads one file after the other, applies a regex to every line and returns a large data structure with the matches. This module is a prototype, the production version will require a C++ version for performance reason which will be connected over Boost::Python and will probably add the subject of regex dialects to the list of complexities.

Also, there are not 2 repetitions, but an amount varying between currently zero and 70 (or so), the comma is not always a comma and despite what I said originally, some parts of the regex pattern will have to be computed at runtime; let's just say I have reason to try and reduce the 'dynamic' amount and have as much 'fixed' pattern as possible.

So, in a word: I must use regular expressions.

Attempt to rephrase: I think the core of the problem boils down to: Is there a Python RegEx notation that e.g. involves curly braces repetitions and allows me to capture

'some string, another string, '

into

('some string', 'another string')

?

Hmmm, that probably narrows it down too far - but then, any way you do it is wrong :-D

Second attempt to rephrase: Why do I not see the first string ('some string') in the result ? Why does the regex produce a match (indicating there's gotta be 2 of something), but only returns 1 string (the second one) ?

The problem remains the same even if I use non-numeric repetition, i.e. using + instead of {2}:

>>> match('(?:(.*?), )+', 'some string, another string, ').groups()
('another string',)

Also, it's not the second string that's returned, it is the last one:

>>> match('(?:(.*?), )+', 'some string, another string, third string, ').groups()
('third string',)

Again, thanks for your help, never ceases to amaze me how helpful peer review is while trying to find out what I actually want to know...

It would help if you would say exactly what it is you want. If you want a regex that extracts the string "some stringanother string", that's simply not possible. Regex groupings are always contiguous; they're actually represented as the start and end point in the original string--there's no way to skip substrings within a grouping. — Glenn Maynard
– Glenn Maynard, Commented Mar 1, 2011 at 21:19
Your example makes it look as though [s.strip() for s in mys.split(',') if s.strip()] will work. Is there more to this problem that's not coming across? — senderle
– senderle, Commented Mar 1, 2011 at 21:24
I asked what you want because you didn't explain the core problem--if you won't respond to questions asking for clarification, nobody can help you. — Glenn Maynard
– Glenn Maynard, Commented Mar 1, 2011 at 22:15
sorry, my comment you are responding to was accidental, so i deleted it; see post update for clarification — ssc
– ssc, Commented Mar 1, 2011 at 22:39
You can't use brace matching to produce more than one grouping. Regex groupings are an exact one-to-one match to each unquoted open parenthesis in the expression. If a whole grouping is evaluated more than once, the result is the final time it matched, which is why you're seeing the last result. — Glenn Maynard
– Glenn Maynard, Commented Mar 1, 2011 at 23:32

senderle · Accepted Answer · 2011-03-02 04:33:30Z

5

Unless there's much more to this problem than you've explained, I don't see the point in using regexes. This is very simple to deal with using basic string methods:

[s.strip() for s in mys.split(',') if s.strip()]

Or if it has to be a tuple:

tuple(s.strip() for s in mys.split(',') if s.strip())

The code is more readable too. Please tell me if this fails to apply.

EDIT: Ok, there is indeed more to this problem than it initially seemed. Leaving this for historical purposes though. (Guess I'm not 'disciplined' :) )

edited Mar 2, 2011 at 4:33

answered Mar 1, 2011 at 21:38

senderle

152k36 gold badges218 silver badges244 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

dappawit Over a year ago

I agree. @ssc: unless you have to use regex for some reason, go this route.

phooji · Accepted Answer · 2011-03-01 22:58:11Z

4

As described, I think this regex works fine:

import re
thepattern = re.compile("(.+?)(?:,|$)") # lazy non-empty match 
thepattern.findall("a, b, asdf, d")     # until comma or end of line
# Result:
Out[19]: ['a', ' b', ' asdf', ' d']

The key here is to use findall rather than match. The phrasing of your question suggests you prefer match, but it isn't the right tool for the job here -- it is designed to return exactly one string for each corresponding group ( ) in the regex. Since your 'number of strings' is variable, the right approach is to use either findall or split.

If this isn't what you need, then please make the question more specific.

Edit: And if you must use tuples rather than lists:

tuple(Out[19])
# Result
Out[20]: ('a', ' b', ' asdf', ' d')

edited Mar 1, 2011 at 22:58

answered Mar 1, 2011 at 22:49

phooji

10.4k3 gold badges41 silver badges46 bronze badges

2 Comments

Alan Moore Over a year ago

+1: With your match approach, the captured content gets overwritten with each repetition of the capturing group; that's just how regexes work. You need an approach like findall or split that applies the regex multiple times and collects the matches for you.

ssc Over a year ago

Thanks for that! Unfortunately, I can't use findall as the string from the initial question is only a part of the problem, the real string is a lot longer, so findall only works if I do multiple regex findalls / matches / searches.

dappawit · Accepted Answer · 2011-03-01 21:27:45Z

2

import re

regex = " *((?:[^, ]| +[^, ])+) *, *((?:[^, ]| +[^, ])+) *, *"

print re.match(regex, 'some string, another string, ').groups()
# ('some string', 'another string')
print re.match(regex, ' some string, another string, ').groups()
# ('some string', 'another string')
print re.match(regex, ' some string , another string, ').groups()
# ('some string', 'another string')

answered Mar 1, 2011 at 21:27

dappawit

12.7k2 gold badges35 silver badges27 bronze badges

Comments

Alan Moore · Accepted Answer · 2011-03-11 06:02:48Z

1

No offense, but you obviously have a lot to learn about regexes, and what you're going to learn, ultimately, is that regexes can't handle this job. I'm sure this particular task is doable with regexes, but then what? You say you have potentially hundreds of different file formats to parse! You even mentioned JSON and XML, which are fundamentally incompatible with regexes.

Do yourself a favor: forget about regexes and learn pyparsing instead. Or skip Python entirely and use a standalone parser generator like ANTLR. In either case, you'll probably find that grammars for most of your file formats have already been written.

edited Mar 11, 2011 at 6:02

answered Mar 2, 2011 at 1:16

Alan Moore

75.6k13 gold badges110 silver badges161 bronze badges

1 Comment

ssc Over a year ago

no offense taken. thanks for the pyparsing link, hadn't heard of it before. thanks also for showing me that judging over someone's skills based on only one single question alone can be perceived as rather premature. please also observe i never said i want to handle all formats with regexes, there's magnificent xml and json parsers out there and i have no intention of re-inventing any wheels. most formats are line-based though, so proper pattern take care of the vast majority of cases already.

eyquem · Accepted Answer · 2011-03-11 09:56:50Z

0

I think the core of the problem boils down to: Is there a Python RegEx notation that e.g. involves curly braces repetitions and allows me to capture 'some string, another string, ' ?

I don't think there is such a notation.

But regexes are not a matter of only NOTATION , that is to say the RE string used to define a regex. It is also a matter of TOOLS, that is to say functions.

Unfortunately, I can't use findall as the string from the initial question is only a part of the problem, the real string is a lot longer, so findall only works if I do multiple regex findalls / matches / searches.

You should give more information without delaying: we could understand more rapidly what are the constraints. Because in my opinion, to answer to your problem as it has been exposed, findall() is indeed OK:

import re

for line in ('string one, string two, ',
             'some string, another string, third string, ',
             # the following two lines are only one string
             'Topaz, Turquoise, Moss Agate, Obsidian, '
             'Tigers-Eye, Tourmaline, Lapis Lazuli, '):

    print re.findall('(.+?), *',line)

Result

['string one', 'string two']
['some string', 'another string', 'third string']
['Topaz', 'Turquoise', 'Moss Agate', 'Obsidian', 'Tigers-Eye', 'Tourmaline', 'Lapis Lazuli']

Now, since you "have omitted a lot of complexity" in your question, findall() could incidentally be unsufficient to hold this complexity. Then finditer() will be used because it allows more flexibility in the selection of groups of a match

import re

for line in ('string one, string two, ',
             'some string, another string, third string, ',
             # the following two lines are only one string
             'Topaz, Turquoise, Moss Agate, Obsidian, '
             'Tigers-Eye, Tourmaline, Lapis Lazuli, '):

    print [ mat.group(1) for mat in re.finditer('(.+?), *',line) ]

gives the same result and can be complexified by writing other expression in place of mat.group(1)

answered Mar 11, 2011 at 9:56

eyquem

27.7k7 gold badges43 silver badges46 bronze badges

5 Comments

ssc Over a year ago

You are correct in what you are saying, but all your sample strings are homogenous, therefore findall() and finditer() do the trick; in the real world strings, the 'string one', 'string two' repetitions are just a fragment of the whole regex; that's what I meant by 'ommitted complexity'.

ssc Over a year ago

Please also understand that in order to get a solution for my problem and at the same time create universally usable information, I need to find the right amount of detail about my problem: If I put too much, possibly irrelevant and distracting information, readers will perceive this as a very special case; I may have put in too little information about constraints at first, but in the end, at least I was able to figure what detail of python regexes I did not know, so maybe this article is useful too someone after all...

eyquem Over a year ago

@ssc What do you call homogenous ? 'string one' and 'string two' are not fragments of regex, they are fragments of a string that you want to analyze. What do you tell ?

eyquem Over a year ago

@ssc Instead of giving simplified strings , and isolated from their contextual file, and then implying that all our solutions don't work because the real thing is so much more complicated, you should give one or several chunks of file's content(s) that you want to analyze. We would see something concrete, instead of speaking in the air

eyquem Over a year ago

@ssc In my opinion, the length of your question is the sign that there is a weird twist in your approach of the problem (and the solutions, by the ways)

ssc · Accepted Answer · 2011-03-10 12:50:15Z

-1

In order to sum this up, it seems I am already using the best solution by constructing the regex pattern in a 'dynamic' manner:

>>> from re import match
>>> match(2 * '(.*?), ', 'some string, another string, ').groups()
('some string', 'another string')

the

2 * '(.*?)

is what I mean by dynamic. The alternative approach

>>> match('(?:(.*?), ){2}', 'some string, another string, ').groups()
('another string',)

fails to return the desired result due to the fact that (as Glenn and Alan kindly explained)

with match, the captured content gets overwritten with each repetition of the capturing group

Thanks for your help everyone! :-)

answered Mar 10, 2011 at 12:50

ssc

9,95812 gold badges72 silver badges103 bronze badges

4 Comments

eyquem Over a year ago

You accept your own answer ?? But it doesn't address the constraint you personnally described "Also, there are not 2 repetitions, but an amount varying between currently zero and 70 (or so)," Did you really think that : "how helpful peer review is while trying to find out what I actually want to know... " ?

ssc Over a year ago

I basically just summarized the essence of the responses in my answer and accepted that in order to 1. 'close the case and 2. provide easily accessible information for anyone with a similar problem.

ssc Over a year ago

jeee, I keep hitting enter too quickly when writing comments. The '2' in the code can easily be replaced by a variable, so the varying amount is taken care of.

eyquem Over a year ago

" The '2' in the code can easily be replaced by a variable" I am eager to see the code ! Well, I help you: you mean something like that ?: '(?:(.*?), ){%s}' % n with n being the number of groups to catch in a line. Yes, yes,... yer...er,er,er... you are sure ? How do you manage to change n at each line in order that all the groups of a particular line are catched ? Are all the lines in a given file of the same format ?

Collectives™ on Stack Overflow

python regex: capture parts of multiple strings that contain spaces

6 Answers 6

1 Comment

2 Comments

Comments

1 Comment

5 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

1 Comment

2 Comments

Comments

1 Comment

5 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related