0

I have many html codes which have <pre> python code </pre>, just like following

html code:

<pre class="c1">
# regex usage
import re
re.findall(r'abc','abcde')
</pre>

python tutorial ...python regex<br>

<pre class="c2">
# regex usage
import re
re.findall(r'abc','abcde')
</pre>

I regard regex as a keyword, and replace it to a link: <a href="link-to-regex">regex</a>,but I don't want to replace contents in label <pre>

output:

<pre class="c1">
# regex usage
import re
re.findall(r'abc','abcde')
</pre>

python tutorial ...python <a href="link-to-regex">regex</a><br>

<pre class="c2">
# regex usage
import re
re.findall(r'abc','abcde')
</pre>

I do it use placeholders

pre_list = re.compile(r'(<pre>.+?</pre>)').findall(html_code)

# use CODE_PLACEHODER to protect code sources
for index,code in enumerate(pre_list):
    html_code = html_code.replace(code, 'CODE_PLACEHOLDER_{}'.format(index))

# replace the html content here
html_code = html_code.replace('regex', '<a href="link-to-regex">regex</a>')

for index,code in enumerate(pre_list):
    html_code = html_code.replace('CODE_PLACEHOLDER_{}'.format(index), code)
    enter code here

Better method to do this?

2
  • What's your expected output? Is this a html file? Commented Oct 21, 2014 at 8:23
  • It doesn't matter, just how to 'protect' the content in <pre> and replace others. Commented Oct 21, 2014 at 8:26

2 Answers 2

1

Use positive lookaround assertions to match the string regex which is not present inside the <pre> tag. And don't forget to enable DOTALL modifier.

>>> import re
>>> s = """<pre>
# regex usage
import re
re.findall(r'abc','abcde')
</pre>

python tutorial ...python regex<br>
<pre>
# regex usage
import re
re.findall(r'abc','abcde')
</pre>"""
>>> m = re.sub(r'(?s)regex(?!(?:(?!<\/?pre[^<>]*>).)*<\/pre>)', r'<a href="link-to-regex">regex</a>', s)
>>> print m
<pre>
# regex usage
import re
re.findall(r'abc','abcde')
</pre>

python tutorial ...python <a href="link-to-regex">regex</a><br>
<pre>
# regex usage
import re
re.findall(r'abc','abcde')
</pre>

DEMO

Sign up to request clarification or add additional context in comments.

5 Comments

It doesn't work well if have more <pre> in html code:regex101.com/r/rQ6mK9/4
@WeizhongTu but in your example, you put the ending pre tag as <pre>
But in fact, it is more difficult, every <pre> has a class, and they are may be different, regex101.com/r/rQ6mK9/6
it's soo easy see regex101.com/r/rQ6mK9/7 , use [^<>]* instead of .* in opening pre tag.
even better to use this regex(?!(?:(?!<\/?pre[^<>]*>).)*<\/pre>)
0
regex(?=(?:((?!<pre[^>]*>|<\/pre>).)*<pre[^>]*>(?:(?!<\/pre>).)*<\/pre>)*(?:(?!<pre[^>]*>|<\/pre>).)*$)

Try this.See demo.

http://regex101.com/r/rQ6mK9/8

2 Comments

When I add the sentence HTML tutoial: about <code><pre></code>, it doesn't work
But you didn't have <code><pre></code> in your question, so how can you expect people to know that the regex has to deal with stuff like that?? Perhaps you should be using a proper HTML / XML parser like lxml, rather than going through the frustration of trying to do this with regexes.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.