0

In a huge collection of HTML files, I've got to prefix all occurrences of the class name "wrapper" to "prefixed-wrapper", so the snippet

<div class="someclass wrapper notarget-wrapper wrapper-notarget"><a id="wrapper" class="wrapper">The term wrapper may occur as text as well</a></div>

gets rewritten to

<div class="someclass prefixed-wrapper notarget-wrapper wrapper-notarget"><a id="wrapper" class="prefixed-wrapper">The term wrapper may occur as text as well</a></div>

Following this SO answer, I partially succeeded to replace the target class name:

#!/usr/bin/env python

import re

snippet = '''<div class="someclass wrapper notarget-wrapper wrapper-notarget"><a id="wrapper" class="wrapper">The term wrapper may occur as text as well</a></div>'''

# building regular expressions
attrMatcher = r'''(?:class *= *[\\'\"]{0,1})((?:[\w -](?!\w+=|\/))+)[\'\"]*'''
classMatcher = r'''wrapper(?: +|$)'''

match = re.findall(pattern=attrMatcher, string=snippet, flags=re.I)

# defining replacement function
def replaceClassname(match_obj):
    if match_obj.group() is not None:
        targetclassname = re.sub(classMatcher, 'prefixed-wrapper', match_obj.group())
        return targetclassname

# pass replacement function to re.sub()
res_str = re.sub(attrMatcher, replaceClassname, snippet)
print(res_str)

But unfortunately, the output has (at least) two issues:

<div class="someclass prefixed-wrappernotarget-prefixed-wrapperwrapper-notarget"><a id="wrapper" class="wrapper">The term wrapper may occur as text as well</a></div>

Issue 1: class name notarget-wrapper should stay untouched, but gets replaced to notarget-prefixed-wrapper; also, spaces between multiple class names are lost.

Issue 2: the class name in <a id="wrapper" class="wrapper"> should get prefixed, but it isn't.

The reason for the issue is a wrong regex for the class matcher. What do I have to alter in order to get the issues fixed?

1 Answer 1

2

I would strongly suggest using an HTML parser tool (the one I've used most is BeautifulSoup) rather than trying to parse HTML with regex. Those are optimized for exactly the case of "find me the elements with this exact class".

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, that's a very good suggestion, it just didn't occure to me to use a parser. But for some reason, I have to do it with regex (because I need the code in a bash script as well) — so I updated my question and narrowed its scope down to regex.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.