Replace a class by another in an html string

Question

I want to replace a class name by another in an html string : class="abc" would become class="xyz". I tried to use regular expressions (I'm using C#) with no success:

const string input = @"abc class=""abcd abc zabc ab c"" abc";

Regex regex = new Regex(string.Format(@"class="".*(?({0})).*""", "abc")); // change this line ?!!

string output = regex.Replace(input, "xyz");

Assert.AreEqual(@"abc class=""abcd xyz zabc ab c"" abc", output);

PS: if it matters: this isn't homework :p

not the best idea for many reasons, the two that apply best to this situation are the ugly expression syntax required to handle different types of quotes and spaces (the class attribute may be quoted using " or ' quotes and may or may not have spacing including tabs to mess with regular parsers) and the fact that the string class='abc' can appear in all sorts of contexts (plain text, etc) - I think your particular problem can be solved purely with regexes, but will either have false positives or negatives depending upon your exact requirements or take a LOT more work than you think. — Code Jockey
– Code Jockey, Commented Oct 10, 2011 at 16:15
@user93422 it's supposed to match exactly the part I want to replace — Catalin DICU
– Catalin DICU, Commented Oct 10, 2011 at 16:22
I mean I don't think .net's regex has a (?()) construct. There is (?(expression)yes|no) alternatives matching, and there is (?<name>) named group capture, but no (?(abc)). I don't think that's the problem in this case, I am just curious if it is an expression new to me. — THX-1138
– THX-1138, Commented Oct 10, 2011 at 17:08

Community · Accepted Answer · 2017-05-23 09:59:41Z

2

No wonder you had no success. Parsing HTML can't be done using regexes.

You should use a proper HTML parser like HTML Agility Pack.

edited May 23, 2017 at 9:59

CommunityBot

11 silver badge

answered Oct 10, 2011 at 16:06

svick

247k54 gold badges407 silver badges535 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Community · Accepted Answer · 2017-05-23 09:59:41Z

2

Parsing HTML with Regular Expressions tends to be a futile effort; because most browsers have a fair amount of leeway for badly-formed HTML, you aren't guaranteed to get consistently formed HTML in order to parse with regular expressions easily (and as commented on by svick).

That said, you are better off using a formal HTML parser (I recomment the HTML Agility Pack) and then changing the values of the attributes after you've parsed the document, and then output the changed document if need be.

edited May 23, 2017 at 9:59

CommunityBot

11 silver badge

answered Oct 10, 2011 at 16:07

casperOne

74.7k19 gold badges189 silver badges262 bronze badges

1 Comment

svick Over a year ago

Even well-formed HTML can't be parsed using regular expressions. HTML isn't regular language.

as-cii · Accepted Answer · 2011-10-10 16:08:59Z

1

Is it a real HTML string? I mean, are you sure you are dealing with well formed HTML? Could there be some error inside your string?

Based on the answers you have given above you can choose how to solve your problem.

Yep: use HTML Agility Pack or something similar in order to parse correctly your string;
Nope: consider using an XML Parser (like the ones integrated in .NET assemblies). Make sure, however, it works well for you (remember XML is not HTML).

Whatever you choose, please: NEVER use Regular Expressions to parse HTML.

answered Oct 10, 2011 at 16:08

as-cii

13.1k5 gold badges44 silver badges43 bronze badges

Comments

Code Jockey · Accepted Answer · 2011-10-10 21:13:59Z

I've done a best effort attempt at answering this... a REGEX could be used similar to the following:

@"(?<=<[\w-]+\s+([\w-]+=""[^""]*""\s*)*class=""[^""]*)(?<![\w-])abc(?![\w-])(?=[^""]*""\s*([\w-]+=""[^""]*""\s*)*/?>)"

broken down a little bit:

(?<=<[\w-]+\s+([\w-]+=""[^""]*""\s*)*class=""[^""]*)  #Make sure its inside a tag
(?<![\w-])abc(?![\w-])                                #just the tag abc (not abcd, etc)
(?=[^""]*""\s*([\w-]+=""[^""]*""\s*)*/?>)             #Make sure its really INSIDE a tag

a little further:

(?<=                           #lookbehind
   <[\w-]+\s+                  # match tag name and whitespace
   ([\w-]+=""[^""]*""\s*)*     # match any attributes coming before the class attribute
   class=""[^""]*              # match the class attribute and any other classes before
)                              #end lookbehind
(?<![\w-])abc(?![\w-])         #"abc" at appropriate boundaries
(?=                            #lookahead
   [^""]*""                    # match any remaining classes in the declaration
   \s*([\w-]+=""[^""]*""\s*)*  # match any remaining attributes in the tag
   /?>                         # match the end of the tag
)                              #end lookahead

This will match the string abc inside any class attribute value that is inside a tag (not in text in between tags), and which might or might not have other attributes before or after it.

Attention!

IT ONLY HANDLES attribute values in double quotes (")

IT ONLY ALLOWS underscores, letters, numbers and dash symbols in the tag and attribute names - you'll need to add colons and periods if you want them (and make it only match names STARTING with a letter if you want it strict)

EDIT As discussed in a comment somewhere around here, IT WILL ALSO MATCH abc-1 or not-abc in addition to abc, thus turning <p class="abc-1 abc not-abc">text</p> into <p class="xyz-1 xyz not-zyx">text</p> - because \b will match at the dash character... this gets EXTREMELY HARD TO ACCOUNT FOR!! FOLLOW-UP I added an additional lookahead and lookbehind to hopefully account for the dashes, but who knows... END EDITS

Also, there are bound to be other situations that can break this...

In short - it's probably best not to use this, but instead to use something like HTML Agility Pack - good luck!

Good eye noting the: class="abc-def" special case. Thus: \babc\b is NOT reliable - D'oh!

lurker · Accepted Answer · 2011-10-10 16:13:28Z

0

I'm not sure of the C# version of this regex, but here's how it would be done in Ruby:

regex = / class="[^"]*"/i

input.gsub( regex, ' class="abc"' )

This replaces the first instance of a class specifier in the input to be class="abc". It assumes no spaces around the equals, but allows for upper or lower case equivalence.

I assume C# is very similar in terms of describing the regex, and you might have to escape the double quotes.

Are you looking for something more specific? E.g., for a method that takes two inputs (s1 and s2) and replaces class "s1" to class "s2"?

answered Oct 10, 2011 at 16:13

lurker

58.5k9 gold badges74 silver badges108 bronze badges

Comments

Buh Buh · Accepted Answer · 2011-10-10 17:22:55Z

0

Obviously Regex is unlikely to be your best choice when working with XML. You will probably have a more consistant result if you try something suggested by the other people. Meanwhile, if you really want some Regex here it is:

const string input = @"abc class=""abcd abc zabc ab c"" abc"; 

Regex regex = new Regex(string.Format(@"(?<=class\=""[^""]*\b){0}\b", "abc")); // I changed this line ?!! 

string output = regex.Replace(input, "xyz");

Assert.AreEqual(@"abc class=""abcd xyz zabc ab c"" abc", output);

To brake it down:

(               #Start a group
    ?<=         #Positive lookbehind
    class\="    #Some charactors to match against (without consuming)
    [^"]*       #Any other charachactors which are not "
                #This stops us from accidentaly leaving the class attribute
)               #Close the lookbehind group
\b              #A word boundry (Such as whitespace or just before a ")
abc             #Your target
\b              #Another word boundry

Note the positve lookbehind means that we check for "class=" without it being part of our match. That is what we mean by "without consuming".

Note the use of the word boundries, \b, so that we don't accidently match abcd.

edited Oct 10, 2011 at 17:22

answered Oct 10, 2011 at 16:44

Buh Buh

7,5262 gold badges38 silver badges64 bronze badges

8 Comments

THX-1138 Over a year ago

note \b won't deal with dashes and numbers in class name. e.g. \b will match dash in abc-1. [ "'] would be safer.

Code Jockey Over a year ago

I believe the comment in the break-down above (and the explanation) should be "Positive lookbehind" not "Negative lookbehind" - i.e.: you want to ensure that it can be matched, not that it cannot be matched.

Code Jockey Over a year ago

@user93422 as long as the class name abc-1 is literal, that should not make a difference - but just one more reason on the pile of reasons to not recreate the wheel out of sand and try to compress it into sandstone, when there's a perfectly good round block of granite to be carved out perfectly for your wheel.

Buh Buh Over a year ago

@user Yep, thats true... and so the XML/regex headache begins. Lucky for me this isn't my question. Maybe instead of \b we could use [\s"] ?

Buh Buh Over a year ago

@Code fixed the negative/positve comment. Thanks.

|

ridgerunner · Accepted Answer · 2011-10-10 20:08:15Z

Disclaimer:

As others have pointed out, using regex to parse non-regular languages is fraught with peril! It is best to use a dedicated parser specifically designed for the job, especially when parsing the tag soup that is HTML.

That said...

If you insist on using a regular expression, here is a regex solution that will do a pretty good job:

text = Regex.Replace(text, @"
    # Change HTML element class attribute value: 'abc' to: 'xyz'.
    (                   # $1: Everything up to 'abc'.
      <\w+              # Begin (X)HTML element open tag.
      (?:               # Match any attribute(s) preceding 'class'.
        \s+             # Whitespace required before each attribute.
        (?!class\b)     # Assert this attribute name is not 'class'.
        [\w\-.:]+       # Required attribute name.
        (?:             # Begin optional attribute value.
          \s*=\s*       # Attribute value separated by =.
          (?:           # Group for attrib value alternatives.
            ""[^""]*""  # Either a double quoted value,
          | '[^']*'     # or a single quoted value,
          | [\w\-.:]+   # or an unquoted value.
          )             # End group for attrib value alternatives.
        )?              # End optional attribute value.
      )*                # Zero or more attributes may precede class.
      \s+               # Whitespace required before class attribute.
      class             # Literal class attribute name.
      \s*=\s*           # Attribute value separated by =.
      (?:               # Group for attrib value alternatives.
        ""              # Either a double quoted value.
        [^""]*?         # Zero or more classes may precede 'abc'.
      | '               # Or a single quoted value.
        [^']*?          # Zero or more classes may precede 'abc'.
      )?                # Or 'abc' class attrib value is unquoted.
    )                   # End $1: Everything up to 'abc'.
    (?<=['""\s=])       # Assert 'abc' not part of '123-abc'.
    abc                 # Match the 'abc' in class attribute value.
    (?=['""\s>])        # Assert 'abc' not part of 'abc-123'.",
    "$1xyz", RegexOptions.IgnorePatternWhitespace);

Example input:

class=abc ... class="abc" ... class='abc'
class = abc ... class = "abc" ... class = 'abc'
class="123 abc 456" ... class='123 abc 456'
class="123-abc abc 456-abc" ... class='123-abc abc 456-abc'
class="abc-123 abc abc-456" ... class='abc-123 abc abc-456'

Example output:

class=xyz ... class="xyz" ... class='xyz'
class = xyz ... class = "xyz" ... class = 'xyz'
class="123 xyz 456" ... class='123 xyz 456'
class="123-abc xyz 456-abc" ... class='123-abc xyz 456-abc'
class="abc-123 xyz abc-456" ... class='abc-123 xyz abc-456'

Note that there will always be edge cases where this solution will fail. e.g. Evil strings within CDATA sections, comments, scripts, styles and tag attribute values can trip this up. (See disclaimer above.) That said, this solution will do a pretty good job for many cases (but will never be 100% reliable!)

Edit: 2011-10-10 14:00 MDT Streamlined overal answer. Removed first regex solution. Modified to correctly ignore classes having similar names like: abc-123 and 123-abc.

this will also change <a href="#" class="abc"> my class = abc </a> into <a href="#" class="xyz"> my class = xyz </a> -- if that is desired, then yay! otherwise, it will still need work. I'll grant you the question should be asked more clearly, as it neither requires nor prohibits that normal text be included in the replacement (it's simply an assumption of mine)
@Code Jockey - Yes, you are absolutely correct. Note however, that if required, a more complex regex can be crafted to correctly handle the example case you cite.

Collectives™ on Stack Overflow

Replace a class by another in an html string

7 Answers 7

Comments

1 Comment

Comments

1 Comment

Comments

8 Comments

Disclaimer:

That said...

Example input:

Example output:

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

Comments

1 Comment

Comments

1 Comment

Comments

8 Comments

Disclaimer:

That said...

Example input:

Example output:

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related