1

I need to replace some text in C# using RegEx:

string strSText = "<P>Bulleted list</P><UL><P><LI>Bullet 1</LI><P></P><P>
<LI>Bullet 2</LI><P></P><P><LI>Bullet 3</LI><P></UL>"

Basically I need to get rid of the

"<P>"

tag(s) introduced between

"<UL><P><LI>", 
"</LI><P></P><P><LI>" and
"</LI><P></UL>"

I also need to ignore any spaces between these tags when performing the removal.

So

"</LI><P></P><P><LI>", "</LI>    <P></P><P><LI>", "</LI><P></P><P>   <LI>" or 
"</LI> <P> </P> <P> <LI>"

must all be replaced with

"</LI><LI>"

I tried using the following RegEx match for this purpose:

strSText = Regex.Replace(strSText, "<UL>.*<LI>", "<UL><LI>", RegexOptions.IgnoreCase);
strSText = Regex.Replace(strSText, "</LI>.*<LI>", "</LI><LI>", 
RegexOptions.IgnoreCase);
strSText = Regex.Replace(strSText, "</LI>.*</UL>", "</LI></UL>", 
RegexOptions.IgnoreCase);

But it performs a "greedy" match and results in:

"<P>Bulleted list</P><UL><LI>Bullet 3</LI></UL>"

I then tried using "lazy" match:

strSText = Regex.Replace(strSText, "<UL>.*?<LI>", "<UL><LI>", RegexOptions.IgnoreCase);
strSText = Regex.Replace(strSText, "</LI>.*?<LI>", "</LI><LI>", 
RegexOptions.IgnoreCase);
strSText = Regex.Replace(strSText, "</LI>.*?</UL>", "</LI></UL>", 
RegexOptions.IgnoreCase);

and this results in:

"<P>Bulleted list</P><UL><LI>Bullet 1</LI></UL>"

But I want the following result, which preserves all other data:

"<P>Bulleted list</P><UL><LI>Bullet 1</LI><LI>Bullet 2</LI><LI>Bullet 3</LI></UL>"
2
  • 7
    Don't use regular expressions for parsing HTML. What is the best way to parse html in C#? Commented Sep 11, 2013 at 7:59
  • Would something like strSText.Replace("<UL><P><LI>", "<UL><LI>"); etc... work? Commented Sep 11, 2013 at 8:17

2 Answers 2

1

The following regexp matches one or more <P> or </P> tags:

(?:</?P>\s*)+

So if you place that between the other tags you have, you can get rid of them, i.e.

strSText = Regex.Replace(strSText, @"<UL>\s*(?:</?P>\s*)+<LI>", "<UL><LI>", RegexOptions.IgnoreCase);
strSText = Regex.Replace(strSText, @"</LI>\s*(?:</?P>\s*)+<LI>", "</LI><LI>", RegexOptions.IgnoreCase);
strSText = Regex.Replace(strSText, @"</LI>\s*(?:</?P>\s*)+</UL>", "</LI></UL>", RegexOptions.IgnoreCase);
Sign up to request clarification or add additional context in comments.

Comments

1

Not really an answer to your question, but more of a comment to Jonathon: Parse HTML with HTMLAgilityPack

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.