In a huge collection of HTML files, I've got to prefix all occurrences of the class name "wrapper" to "prefixed-wrapper", so the snippet
<div class="someclass wrapper notarget-wrapper wrapper-notarget"><a id="wrapper" class="wrapper">The term wrapper may occur as text as well</a></div>
gets rewritten to
<div class="someclass prefixed-wrapper notarget-wrapper wrapper-notarget"><a id="wrapper" class="prefixed-wrapper">The term wrapper may occur as text as well</a></div>
Following this SO answer, I partially succeeded to replace the target class name:
#!/usr/bin/env python
import re
snippet = '''<div class="someclass wrapper notarget-wrapper wrapper-notarget"><a id="wrapper" class="wrapper">The term wrapper may occur as text as well</a></div>'''
# building regular expressions
attrMatcher = r'''(?:class *= *[\\'\"]{0,1})((?:[\w -](?!\w+=|\/))+)[\'\"]*'''
classMatcher = r'''wrapper(?: +|$)'''
match = re.findall(pattern=attrMatcher, string=snippet, flags=re.I)
# defining replacement function
def replaceClassname(match_obj):
if match_obj.group() is not None:
targetclassname = re.sub(classMatcher, 'prefixed-wrapper', match_obj.group())
return targetclassname
# pass replacement function to re.sub()
res_str = re.sub(attrMatcher, replaceClassname, snippet)
print(res_str)
But unfortunately, the output has (at least) two issues:
<div class="someclass prefixed-wrappernotarget-prefixed-wrapperwrapper-notarget"><a id="wrapper" class="wrapper">The term wrapper may occur as text as well</a></div>
Issue 1: class name notarget-wrapper should stay untouched, but gets replaced to notarget-prefixed-wrapper; also, spaces between multiple class names are lost.
Issue 2: the class name in <a id="wrapper" class="wrapper"> should get prefixed, but it isn't.
The reason for the issue is a wrong regex for the class matcher. What do I have to alter in order to get the issues fixed?