1

My previous example was not clear, I give another example :

a = '123 - 48 <!-- 456 - 251 - --> 452 - 348'

And if i do something like :

[el for el in re.split(r' - ',a)]

I catch :

['123', '48 <!-- 456', '251', '--> 452', '348']

But I want this :

['123', '48 <!-- 456 - 251 - --> 452', '348']

Thanks...

2
  • you do(get that result...), with which python version? From my experience the el is of the type string in array comprehensions as opposed to using dict(....) Commented Oct 3, 2011 at 16:53
  • Ok@Update. Still, I consider non-capturing groups with filter one of the fastest solutions, especially for longer text. (Don't forget to pick an answer.) Commented Oct 4, 2011 at 12:25

3 Answers 3

5

First remove the comments using something like this:

re.sub("<!--.*?-->", "", your_string)

then use your regex to extract numbers.

You can also use ?!... (negative lookahead assertion) but that won't be so simple.

Sign up to request clarification or add additional context in comments.

Comments

0

If you want one regex you could use something like:

(\d+)(?!(?:[^<]+|<(?!!--))*-->)

As long as there are no "invalid" -->.

It matches numbers not followed by -->, without <!-- in between.

2 Comments

it's incredibly slow (python 2.7) even for a strlen ~100. But it works
If it supports atomic groups or possessive quantifiers you could try (\d+)(?!(?:[^<-]++|<(?!!--)|-(?!->))*+-->)
-1

The result you posted is of re.findall('(\d+)',a);

re.findall('(?:\<\!--.+\d+.+--\>)|(\d+)',a)

['123', '48', '', '452', '348']

filter(None, re.findall('(?:\<\!--.+\d+.+--\>)|(\d+)',a))

['123', '48', '452', '348']

1 Comment

1 -- 2 -- 3, 1 <!--2--> 3, couple of examples that would not work.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.