C# Regex matching

Question

I need to replace some text in C# using RegEx:

string strSText = "<P>Bulleted list</P><UL><P><LI>Bullet 1</LI><P></P><P>
<LI>Bullet 2</LI><P></P><P><LI>Bullet 3</LI><P></UL>"

Basically I need to get rid of the

"<P>"

tag(s) introduced between

"<UL><P><LI>", 
"</LI><P></P><P><LI>" and
"</LI><P></UL>"

I also need to ignore any spaces between these tags when performing the removal.

So

"</LI><P></P><P><LI>", "</LI>    <P></P><P><LI>", "</LI><P></P><P>   <LI>" or 
"</LI> <P> </P> <P> <LI>"

must all be replaced with

"</LI><LI>"

I tried using the following RegEx match for this purpose:

strSText = Regex.Replace(strSText, "<UL>.*<LI>", "<UL><LI>", RegexOptions.IgnoreCase);
strSText = Regex.Replace(strSText, "</LI>.*<LI>", "</LI><LI>", 
RegexOptions.IgnoreCase);
strSText = Regex.Replace(strSText, "</LI>.*</UL>", "</LI></UL>", 
RegexOptions.IgnoreCase);

But it performs a "greedy" match and results in:

"<P>Bulleted list</P><UL><LI>Bullet 3</LI></UL>"

I then tried using "lazy" match:

strSText = Regex.Replace(strSText, "<UL>.*?<LI>", "<UL><LI>", RegexOptions.IgnoreCase);
strSText = Regex.Replace(strSText, "</LI>.*?<LI>", "</LI><LI>", 
RegexOptions.IgnoreCase);
strSText = Regex.Replace(strSText, "</LI>.*?</UL>", "</LI></UL>", 
RegexOptions.IgnoreCase);

and this results in:

"<P>Bulleted list</P><UL><LI>Bullet 1</LI></UL>"

But I want the following result, which preserves all other data:

"<P>Bulleted list</P><UL><LI>Bullet 1</LI><LI>Bullet 2</LI><LI>Bullet 3</LI></UL>"

Don't use regular expressions for parsing HTML. What is the best way to parse html in C#? — Jonathon Reinhart
– Jonathon Reinhart, Commented Sep 11, 2013 at 7:59
Would something like strSText.Replace("<UL><P><LI>", "<UL><LI>"); etc... work? — DGibbs
– DGibbs, Commented Sep 11, 2013 at 8:17

Ed Chapel · Accepted Answer · 2013-09-11 08:52:56Z

1

The following regexp matches one or more <P> or </P> tags:

(?:</?P>\s*)+

So if you place that between the other tags you have, you can get rid of them, i.e.

strSText = Regex.Replace(strSText, @"<UL>\s*(?:</?P>\s*)+<LI>", "<UL><LI>", RegexOptions.IgnoreCase);
strSText = Regex.Replace(strSText, @"</LI>\s*(?:</?P>\s*)+<LI>", "</LI><LI>", RegexOptions.IgnoreCase);
strSText = Regex.Replace(strSText, @"</LI>\s*(?:</?P>\s*)+</UL>", "</LI></UL>", RegexOptions.IgnoreCase);

edited Sep 11, 2013 at 8:52

Ed Chapel

6,9523 gold badges32 silver badges44 bronze badges

answered Sep 11, 2013 at 8:15

krisku

3,9911 gold badge21 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Christian HC · Accepted Answer · 2013-09-11 08:58:43Z

1

Not really an answer to your question, but more of a comment to Jonathon: Parse HTML with HTMLAgilityPack

answered Sep 11, 2013 at 8:58

Christian HC

563 bronze badges

Collectives™ on Stack Overflow

C# Regex matching

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related