Remove all attributes in HTML tag except specified with regex

Question

I'm trying to make regex which remove all attributes from HTML tags except specified ones.

I have this HTML code:

<p class="someClass" id="someId" style="border: 1px solid black" name="someName" foo="bar"></p>

And I want to remove all attributes except class, id and name, code should look like:

<p class="someClass" id="someId" name="someName">Text</p>

I have this regex:

<([a-z][a-z0-9]*)(?:[^>]*(\sid=['"][^'"]*['"]))?[^>]*?(\/?)>

and use pattern

<$1$2>

It only works for attribute id. How to do it for all specified attributes?

I agree with WiktorStribiżew. As pointed out many times before, regular expressions are just not a reliable way of parsing HTML. — bytesized
– bytesized, Commented Mar 10, 2016 at 23:45
I know DOM parser is better solution, but using regex is requirement for this project. :/ — Michalowic
– Michalowic, Commented Mar 11, 2016 at 8:38

Scott Weaver · Accepted Answer · 2016-03-11 02:01:51Z

1

You can achieve this with a negative lookahead, which will tell your expression to either 1. eat one character, or 2. match the special sequence, then rinse and repeat:

<(\w+)\s*(?:(?:(?:(?!class=|id=|name=)[^>]))*((?:class|id|name)=['"][^'"]*['"]\s*)?)+>

Explanation:

<(\w+)\s* (match open of tag and tagname)
(?: (begin enclosure of main construct (note that it doesn't remember matches))
(?:(?:(?!class=|id=|name=)[^>]))* (look ahead for no special token, then eat one character, repeat as many times possible, don't bother to remember anything)
((?:class|id|name)=['"][^'"]*['"])\s*? (lookahead failed, so special token ahead, let's eat it! note the regular, 'remembering' parens)
)+ (end enclosure of main construct; repeat it, it'll match once for each special token)
> (end of tag)

At this point you might have the matches you need, if your regex flavor supports multiple matches per group. In .NET for example, you'd have something similar to this: $1 = 'a', $2[0]='class="someClass"', $2[1]='id="someId"', etc.

But if you find that only the last match is remembered, you may have to simply repeat the main construct for each token you want to match, like so: (matches will be $1-$4)

<(\w+)\s*(?:(?:(?:(?!class=|id=|name=)[^>]))*((?:class|id|name)=['"][^'"]*['"]\s*)?)(?:(?:(?:(?!class=|id=|name=)[^>]))*((?:class|id|name)=['"][^'"]*['"]\s*)?)(?:(?:(?:(?!class=|id=|name=)[^>]))*((?:class|id|name)=['"][^'"]*['"]\s*)?)[^>]*>

(see it in action here).

answered Mar 11, 2016 at 2:01

Scott Weaver

7,3832 gold badges33 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Quentin Over a year ago

This breaks when the attribute value includes ' or " characters. Regex are not suitable for parsing HTML

Collectives™ on Stack Overflow

Remove all attributes in HTML tag except specified with regex

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related