I want to remove style from HTML Tags using C#. It should return only HTML Simple Tags.
For i.e.
if String = <p style="margin: 15px 0px; padding: 0px; border: 0px; outline: 0px;">Hello</p>
Then it should return String = <p>Hello</p>
Like that for all HTML Tags, <strong></string>, <b></b> etc. etc.
Please help me for this.
-
2See: stackoverflow.com/questions/5850718/…Chris McAtackney– Chris McAtackney2014-08-14 11:15:57 +00:00Commented Aug 14, 2014 at 11:15
-
1Are you (accidentally) missing the closing quote?Rob P.– Rob P.2014-08-14 11:18:14 +00:00Commented Aug 14, 2014 at 11:18
-
@RobP., yes, sorry. Updated post.CSAT– CSAT2014-08-14 11:20:39 +00:00Commented Aug 14, 2014 at 11:20
-
Probably because this question has been asked a million times.Inspector Squirrel– Inspector Squirrel2014-08-14 12:11:06 +00:00Commented Aug 14, 2014 at 12:11
5 Answers
First, as others suggest, an approach using a proper HTML parser is much better. Either use HtmlAgilityPack or CsQuery.
If you really want a regex solution, here it is:
Replace this pattern: (<.+?)\s+style\s*=\s*(["']).*?\2(.*?>)
With: $1$3
Demo: http://regex101.com/r/qJ1vM1/1
To remove multiple attributes, since you're using .NET, this should work:
Replace (?<=<[^<>]+)\s+(?:style|class)\s*=\s*(["']).*?\1
With an empty string
6 Comments
unrecognized escape sequence because of " in string. What should i do ??? I am using it as @"(<.+?)\s+style\s*=\s*(["']).*?\2(.*?>)", "")class so will i make to another regex as same like style ?class, see my edit.As others said, You can use HTML Agility pack, which has this nice tool: HTML Agility Pack test which shows you what you're doing.
Other than that, it's regex, which is not recommended with HTML usually, or simply running on your code with a loop on all chars. If it starts with < read until whitespace, and then remove all the chars up until >. That should take care of most basic cases, but you'll have to test it.
Here's a little snippet that will do it:
void Main()
{
// your input
String input = @"<p style=""margin: 15px 0px; padding: 0px; border: 0px; outline: 0px;"">Hello</p>";
// temp variables
StringBuilder sb = new StringBuilder();
bool inside = false;
bool delete = false;
// analyze string
for (int i = 0; i < input.Length; i++)
{
// Special case, start bracket
if (input[i].Equals('<')) {
inside = true;
delete = false;
}
// special case, close bracket
else if (input[i].Equals('>')) {
inside = false;
delete = false;
}
// other letters
else if (inside) {
// Once you have a space, ignore the rest until closing bracket
if (input[i].Equals(' '))
delete = true;
}
// add if needed
if (!delete)
sb.Append(input[i]);
}
var result = sb.ToString(); // -> holds: "<p>Hello</p>"
}
3 Comments
I usually use the below code to remove inline styles, class, images and comments from an Outlook message prior to saving it into database:
desc = Regex.Replace(desc, "(<style.+?</style>)|(<script.+?</script>)", "", RegexOptions.IgnoreCase | RegexOptions.Singleline);
desc = Regex.Replace(desc, "(<img.+?>)", "", RegexOptions.IgnoreCase | RegexOptions.Singleline);
desc = Regex.Replace(desc, "(<o:.+?</o:.+?>)", "", RegexOptions.IgnoreCase | RegexOptions.Singleline);
desc = Regex.Replace(desc, "<!--.+?-->", "", RegexOptions.IgnoreCase | RegexOptions.Singleline);
desc = Regex.Replace(desc, "class=.+?>", ">", RegexOptions.IgnoreCase | RegexOptions.Singleline);
desc = Regex.Replace(desc, "class=.+?\s", " ", RegexOptions.IgnoreCase | RegexOptions.Singleline);
2 Comments
class=.+?> removes everything between class= and the next > which is more than what you want. class=.+?\" is probably what you were after.class=".+?" or class='.+?' instead of class=.+?> source = Regex.Replace(source, "(<style.+?</style>)|(<script.+?</script>)", "", RegexOptions.IgnoreCase | RegexOptions.Singleline);
source = Regex.Replace(source, "(<img.+?>)", "", RegexOptions.IgnoreCase | RegexOptions.Singleline);
source = Regex.Replace(source, "(<o:.+?</o:.+?>)", "", RegexOptions.IgnoreCase | RegexOptions.Singleline);
source = Regex.Replace(source, "<!--.+?-->", "", RegexOptions.IgnoreCase | RegexOptions.Singleline);
source = Regex.Replace(source, "class=.+?>", ">", RegexOptions.IgnoreCase | RegexOptions.Singleline);
source = Regex.Replace(source.Replace(System.Environment.NewLine, "<br/>"), "<[^(a|img|b|i|u|ul|ol|li)][^>]*>", " ");