C# regular expressions with HTML strings

Question

I'm working on a small assignment that requires the use of regular expressions with HTML strings. My current problem is properly obtaining strings enclosed within HTML tags.

For instance:

I have a string

<p>&lt;Placeholder&gt;</p>

I've been able to obtain the contents with the following regex

private string Unescape(){
    string s = WebUtility.HtmlDecode("<p>&lt;Placeholder&gt;</p>");
    string dec = Regex.Replace(s, "^<.*?>|^<.*?><.*?>", "");
    return Regex.Replace(dec, "</.*?>$|</.*?></.*?>$", "");
}

Which would return:

<Placeholder>

However, should the string contain an additional HTML tag, e.g.:

<p><strong>Placeholder</strong></p>

I would get this

<strong>Placeholder

It appears I'm only able to successfully remove the closing tag(s), but I can't do the same with the opening tag(s). Could anybody tell me where I've gone wrong?

EDIT:

To summarize, is there a way for me to treat the string enclosed within HTML tags as literal? To cover the possibility that the string could contain special characters (e.g. > <)

Try using HtmlAgilityPack http://htmlagilitypack.codeplex.com/ — animaonline
– animaonline, Commented Oct 9, 2012 at 6:59
I'm trying to avoid libraries if possible. But I'll check it out! — Winz
– Winz, Commented Oct 9, 2012 at 7:07
In my own project, I included the necessary source code, there's about 15 files. So it's pretty compact ;) I believe regex is an overkill solution. Good luck anyway! — animaonline
– animaonline, Commented Oct 9, 2012 at 7:09
"I'm trying to avoid libraries if possible." Reinventing the wheel is not good. You should use libraries if possible, unless you have a really good reason not to. — dan1111
– dan1111, Commented Oct 9, 2012 at 7:29

stema · Accepted Answer · 2012-10-09 07:57:34Z

I am not sure if your will get happy with your regex usage on html, but I want to explain what the problem for your "mis"match is:

An alternation will use the first match it will find and will not look for further matches. So when you search at the start for

^<.*?>|^<.*?><.*?>

on the string

<p><strong>Placeholder</strong></p>

It will match on the first alternative and therefore it will end with a successful match on the first alternative. So if you want to match <p><strong> at the start you should change the ordering in the alternation. but only for the part at the start of the string, for the end of the string your ordering is fine.

So for your example this would work:

private string Unescape(){
    string s = WebUtility.HtmlDecode("<p>&lt;Placeholder&gt;</p>");
    string dec = Regex.Replace(s, "^<.*?><.*?>|^<.*?>", "");
    return Regex.Replace(dec, "</.*?>$|</.*?></.*?>$", "");
}

==> The ordering inside an alternation can be important

An alternative would be to use a quantifier instead of an alternation:

string dec = Regex.Replace(s, "^(?:<.*?>)+", "");
return Regex.Replace(dec, "(?:</.*?>)+$", "");

this would work also for more than 2 tags.

Collectives™ on Stack Overflow

C# regular expressions with HTML strings

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related