python regex replace only part of NOT match

Question

I have many html codes which have <pre> python code </pre>, just like following

html code:

<pre class="c1">
# regex usage
import re
re.findall(r'abc','abcde')
</pre>

python tutorial ...python regex<br>

<pre class="c2">
# regex usage
import re
re.findall(r'abc','abcde')
</pre>

I regard regex as a keyword, and replace it to a link: <a href="link-to-regex">regex</a>,but I don't want to replace contents in label <pre>

output:

<pre class="c1">
# regex usage
import re
re.findall(r'abc','abcde')
</pre>

python tutorial ...python <a href="link-to-regex">regex</a><br>

<pre class="c2">
# regex usage
import re
re.findall(r'abc','abcde')
</pre>

I do it use placeholders

pre_list = re.compile(r'(<pre>.+?</pre>)').findall(html_code)

# use CODE_PLACEHODER to protect code sources
for index,code in enumerate(pre_list):
    html_code = html_code.replace(code, 'CODE_PLACEHOLDER_{}'.format(index))

# replace the html content here
html_code = html_code.replace('regex', '<a href="link-to-regex">regex</a>')

for index,code in enumerate(pre_list):
    html_code = html_code.replace('CODE_PLACEHOLDER_{}'.format(index), code)
    enter code here

Better method to do this?

It doesn't matter, just how to 'protect' the content in <pre> and replace others. — WeizhongTu
– WeizhongTu, Commented Oct 21, 2014 at 8:26

Avinash Raj · Accepted Answer · 2014-10-21 08:56:35Z

1

Use positive lookaround assertions to match the string regex which is not present inside the <pre> tag. And don't forget to enable DOTALL modifier.

>>> import re
>>> s = """<pre>
# regex usage
import re
re.findall(r'abc','abcde')
</pre>

python tutorial ...python regex<br>
<pre>
# regex usage
import re
re.findall(r'abc','abcde')
</pre>"""
>>> m = re.sub(r'(?s)regex(?!(?:(?!<\/?pre[^<>]*>).)*<\/pre>)', r'<a href="link-to-regex">regex</a>', s)
>>> print m
<pre>
# regex usage
import re
re.findall(r'abc','abcde')
</pre>

python tutorial ...python <a href="link-to-regex">regex</a><br>
<pre>
# regex usage
import re
re.findall(r'abc','abcde')
</pre>

DEMO

edited Oct 21, 2014 at 8:56

answered Oct 21, 2014 at 8:28

Avinash Raj

175k32 gold badges247 silver badges289 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

WeizhongTu Over a year ago

It doesn't work well if have more <pre> in html code:regex101.com/r/rQ6mK9/4

Avinash Raj Over a year ago

@WeizhongTu but in your example, you put the ending pre tag as <pre>

WeizhongTu Over a year ago

But in fact, it is more difficult, every <pre> has a class, and they are may be different, regex101.com/r/rQ6mK9/6

Avinash Raj Over a year ago

it's soo easy see regex101.com/r/rQ6mK9/7 , use [^<>]* instead of .* in opening pre tag.

Avinash Raj Over a year ago

even better to use this regex(?!(?:(?!<\/?pre[^<>]*>).)*<\/pre>)

vks · Accepted Answer · 2014-10-21 08:54:24Z

0

regex(?=(?:((?!<pre[^>]*>|<\/pre>).)*<pre[^>]*>(?:(?!<\/pre>).)*<\/pre>)*(?:(?!<pre[^>]*>|<\/pre>).)*$)

Try this.See demo.

http://regex101.com/r/rQ6mK9/8

answered Oct 21, 2014 at 8:54

vks

68.1k11 gold badges96 silver badges132 bronze badges

2 Comments

WeizhongTu Over a year ago

When I add the sentence HTML tutoial: about <code><pre></code>, it doesn't work

PM 2Ring Over a year ago

But you didn't have <code><pre></code> in your question, so how can you expect people to know that the regex has to deal with stuff like that?? Perhaps you should be using a proper HTML / XML parser like lxml, rather than going through the frustration of trying to do this with regexes.

Collectives™ on Stack Overflow

python regex replace only part of NOT match

2 Answers 2

5 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related