5

I want to remove style from HTML Tags using C#. It should return only HTML Simple Tags.

For i.e. if String = <p style="margin: 15px 0px; padding: 0px; border: 0px; outline: 0px;">Hello</p> Then it should return String = <p>Hello</p>

Like that for all HTML Tags, <strong></string>, <b></b> etc. etc.

Please help me for this.

4
  • 2
    See: stackoverflow.com/questions/5850718/… Commented Aug 14, 2014 at 11:15
  • 1
    Are you (accidentally) missing the closing quote? Commented Aug 14, 2014 at 11:18
  • @RobP., yes, sorry. Updated post. Commented Aug 14, 2014 at 11:20
  • Probably because this question has been asked a million times. Commented Aug 14, 2014 at 12:11

5 Answers 5

10

First, as others suggest, an approach using a proper HTML parser is much better. Either use HtmlAgilityPack or CsQuery.

If you really want a regex solution, here it is:

Replace this pattern: (<.+?)\s+style\s*=\s*(["']).*?\2(.*?>)
With: $1$3

Demo: http://regex101.com/r/qJ1vM1/1


To remove multiple attributes, since you're using .NET, this should work:

Replace (?<=<[^<>]+)\s+(?:style|class)\s*=\s*(["']).*?\1
With an empty string

Sign up to request clarification or add additional context in comments.

6 Comments

Regex is working fine. But in code, It is showing error unrecognized escape sequence because of " in string. What should i do ??? I am using it as @"(<.+?)\s+style\s*=\s*(["']).*?\2(.*?>)", "")
Please note that it is not working for other tags like '<ul style="list-style-type:circle;"> <strong style="font:bold">Endoderm</strong>' What to do for this ?
And I also want to remove class so will i make to another regex as same like style ?
@CSAT It's working for me, so please show how you used it so I can tell you what's wrong. If you also want to remove class, see my edit.
|
0

As others said, You can use HTML Agility pack, which has this nice tool: HTML Agility Pack test which shows you what you're doing.

Other than that, it's regex, which is not recommended with HTML usually, or simply running on your code with a loop on all chars. If it starts with < read until whitespace, and then remove all the chars up until >. That should take care of most basic cases, but you'll have to test it.

Here's a little snippet that will do it:

void Main()
{
    // your input
    String input = @"<p style=""margin: 15px 0px; padding: 0px; border: 0px; outline: 0px;"">Hello</p>";
    // temp variables
    StringBuilder sb = new StringBuilder();
    bool inside = false;
    bool delete = false;
    // analyze string
    for (int i = 0; i < input.Length; i++)
    {
        // Special case, start bracket
        if (input[i].Equals('<')) { 
            inside = true;
            delete = false;
        }
        // special case, close bracket
        else if (input[i].Equals('>')) {
            inside = false;
            delete = false;
        }
        // other letters
        else if (inside) {
            // Once you have a space, ignore the rest until closing bracket
            if (input[i].Equals(' '))
                delete = true;
        }   
        // add if needed
        if (!delete)
                sb.Append(input[i]);
    }
    var result = sb.ToString(); // -> holds: "<p>Hello</p>"
}

3 Comments

this fails if like this </ p>
this also fails if there are <> inside e.g. <math>K_B \cap\left \{ |k| < \beta\right \} </math>
@MonsterMMORPG yep. it will .
0

I usually use the below code to remove inline styles, class, images and comments from an Outlook message prior to saving it into database:

    desc = Regex.Replace(desc, "(<style.+?</style>)|(<script.+?</script>)", "", RegexOptions.IgnoreCase | RegexOptions.Singleline);
    desc = Regex.Replace(desc, "(<img.+?>)", "", RegexOptions.IgnoreCase | RegexOptions.Singleline);
    desc = Regex.Replace(desc, "(<o:.+?</o:.+?>)", "", RegexOptions.IgnoreCase | RegexOptions.Singleline);
    desc = Regex.Replace(desc, "<!--.+?-->", "", RegexOptions.IgnoreCase | RegexOptions.Singleline);
    desc = Regex.Replace(desc, "class=.+?>", ">", RegexOptions.IgnoreCase | RegexOptions.Singleline);
    desc = Regex.Replace(desc, "class=.+?\s", " ", RegexOptions.IgnoreCase | RegexOptions.Singleline);

2 Comments

Your regex pattern of class=.+?> removes everything between class= and the next > which is more than what you want. class=.+?\" is probably what you were after.
he should use class=".+?" or class='.+?' instead of class=.+?>
0

All the answers are fine but it can also be done simply by using this method: "Your HTML String".replace("style", "data-tags"); You can also replace "class" the same way.

Comments

-1
   source = Regex.Replace(source, "(<style.+?</style>)|(<script.+?</script>)", "", RegexOptions.IgnoreCase | RegexOptions.Singleline);
   source = Regex.Replace(source, "(<img.+?>)", "", RegexOptions.IgnoreCase | RegexOptions.Singleline);
   source = Regex.Replace(source, "(<o:.+?</o:.+?>)", "", RegexOptions.IgnoreCase | RegexOptions.Singleline);
   source = Regex.Replace(source, "<!--.+?-->", "", RegexOptions.IgnoreCase | RegexOptions.Singleline);
   source = Regex.Replace(source, "class=.+?>", ">", RegexOptions.IgnoreCase | RegexOptions.Singleline);
   source = Regex.Replace(source.Replace(System.Environment.NewLine, "<br/>"), "<[^(a|img|b|i|u|ul|ol|li)][^>]*>", " ");

1 Comment

May I request you to please add some context around your source-code. Code-only answers are difficult to understand. It will help the asker and future readers both if you can add more information in your post.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.