Regex for replacing a specific HTML class name with a prefixed one

Question

In a huge collection of HTML files, I've got to prefix all occurrences of the class name "wrapper" to "prefixed-wrapper", so the snippet

<div class="someclass wrapper notarget-wrapper wrapper-notarget"><a id="wrapper" class="wrapper">The term wrapper may occur as text as well</a></div>

gets rewritten to

<div class="someclass prefixed-wrapper notarget-wrapper wrapper-notarget"><a id="wrapper" class="prefixed-wrapper">The term wrapper may occur as text as well</a></div>

Following this SO answer, I partially succeeded to replace the target class name:

#!/usr/bin/env python

import re

snippet = '''<div class="someclass wrapper notarget-wrapper wrapper-notarget"><a id="wrapper" class="wrapper">The term wrapper may occur as text as well</a></div>'''

# building regular expressions
attrMatcher = r'''(?:class *= *[\\'\"]{0,1})((?:[\w -](?!\w+=|\/))+)[\'\"]*'''
classMatcher = r'''wrapper(?: +|$)'''

match = re.findall(pattern=attrMatcher, string=snippet, flags=re.I)

# defining replacement function
def replaceClassname(match_obj):
    if match_obj.group() is not None:
        targetclassname = re.sub(classMatcher, 'prefixed-wrapper', match_obj.group())
        return targetclassname

# pass replacement function to re.sub()
res_str = re.sub(attrMatcher, replaceClassname, snippet)
print(res_str)

But unfortunately, the output has (at least) two issues:

<div class="someclass prefixed-wrappernotarget-prefixed-wrapperwrapper-notarget"><a id="wrapper" class="wrapper">The term wrapper may occur as text as well</a></div>

Issue 1: class name notarget-wrapper should stay untouched, but gets replaced to notarget-prefixed-wrapper; also, spaces between multiple class names are lost.

Issue 2: the class name in <a id="wrapper" class="wrapper"> should get prefixed, but it isn't.

The reason for the issue is a wrong regex for the class matcher. What do I have to alter in order to get the issues fixed?

lgaud · Accepted Answer · 2022-10-13 20:35:51Z

2

I would strongly suggest using an HTML parser tool (the one I've used most is BeautifulSoup) rather than trying to parse HTML with regex. Those are optimized for exactly the case of "find me the elements with this exact class".

answered Oct 13, 2022 at 20:35

lgaud

2,48920 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Madamadam Over a year ago

Thanks, that's a very good suggestion, it just didn't occure to me to use a parser. But for some reason, I have to do it with regex (because I need the code in a bash script as well) — so I updated my question and narrowed its scope down to regex.

Collectives™ on Stack Overflow

Regex for replacing a specific HTML class name with a prefixed one

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related