Regex matching between two strings?

Question

I can't seem to find a way to extract all comments like in following example.

>>> import re
>>> string = '''
... <!-- one 
... -->
... <!-- two -- -- -->
... <!-- three -->
... '''
>>> m = re.findall ( '<!--([^\(-->)]+)-->', string, re.MULTILINE)
>>> m
[' one \n', ' three ']

block with two -- -- is not matched most likely because of bad regex. Can someone please point me in right direction how to extract matches between two strings.

Hi I've tested what you guys suggested in comments.... here is working solution with little upgrade.

>>> m = re.findall ( '<!--(.*?)-->', string, re.MULTILINE)
>>> m
[' two -- -- ', ' three ']
>>> m = re.findall ( '<!--(.*\n?)-->', string, re.MULTILINE)
>>> m
[' one \n', ' two -- -- ', ' three ']

thanks!

anything between the [] is a single character so (-->) will not look for that grouping is part of the problem... — Joran Beasley
– Joran Beasley, Commented Oct 4, 2012 at 21:20
re.findall('', string, re.DOTALL) should do. You don't need ^\(-->) here, because the question mark makes it non-greedy. — BrtH
– BrtH, Commented Oct 4, 2012 at 21:21
You look like you're looking for just the words? If so, what's wrong with m = re.findall('[\w]+', string, re.MULTILINE)? Also, string is a really bad name for a, um, string. — Ben
– Ben, Commented Oct 4, 2012 at 21:24

iruvar · Accepted Answer · 2012-10-05 11:55:25Z

37

this should do the trick

 m = re.findall ( '<!--(.*?)-->', string, re.DOTALL)

edited Oct 5, 2012 at 11:55

answered Oct 4, 2012 at 21:24

iruvar

23.5k7 gold badges58 silver badges83 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Niko Fohr Over a year ago

In case anyone is wondering, the "re.DOTALL" flag makes the dot (.) to match any character, including newline. The (.*?) captures the text inside the parenthesis, and .*? means that "non-greedy" version of .* (i.e. capture the shortest possible matches).

Wiktor Stribiżew Over a year ago

If  should be part of the resulting list items, the capturing parentheses should be removed - re.findall ( '', string, re.DOTALL)

Wilduck · Accepted Answer · 2012-10-04 23:47:22Z

3

In general, it is impossible to do arbitrary matching between two delimiters with a regular grammar.

Specifcally, if you allow nesting,

<!-- how do you deal <!-- with nested --> comments? -->

you'll run in to issues. So, while you may be able to solve this specific problem with a regular expression, any regular expression that you write will be able to be broken by some other strange nesting of comments.

To parse arbitrary comments, you'll need to move on to a method of parsing context free grammars. A simple method to do so is to use a pushdown automaton.

edited Oct 4, 2012 at 23:47

answered Oct 4, 2012 at 21:25

Wilduck

14.2k13 gold badges63 silver badges91 bronze badges

4 Comments

Anuj Gupta Over a year ago

I don't think nested comments are all that common. Kinda defeats the point of commenting if anything inside it is processed?

Wilduck Over a year ago

And it looks like they're not possible in HTML. stackoverflow.com/questions/442786/… I'm going to leave this here, because I think it's important to recognize, but I don't expect any upvotes.

James Thiele Over a year ago

Finite state machines cannot parse context free grammars - you could use Pushdown automatons.

Wilduck Over a year ago

@JamesThiele Ahhhhh, of course. I've edited the answer to reflect this

Collectives™ on Stack Overflow

Regex matching between two strings?

2 Answers 2

2 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related