c# regex parsing

Question

I am trying to parse data from a very long html content. I am just pasting here the important part I am interested in:

Technical Details

<div class="content">

    <ul style="list-style: disc; padding-left: 25px;">

      <li>1920x1080 Full HD 60p/24p Recording w/7MP still image</li>
      <li>32GB Flash Memory for up to 13 hours (LP mode) of HD recording</li>
      <li>Project your videos on the go anywhere, anytime.</li>
      <li>Wide Angle G lens to capture everything you want.</li>
      <li>Back-illuminated "Exmor R" CMOS sensor for superb low-light video</li>

    </ul>

  <div id="technicalProductFeatures"></div>

I need to start parsing from :

<div class="content">

til

<ul

and then until

</ul>

I have tried following regex but it did not work:

Regex specsRegex = new Regex ("<div class=\"content\">[\\s]*<ul.[\\s]*</ul>");

this gives me nothing..

One other issue is sometimes it has a linebreak and sometimes not between initial div and ul tags like:

<div class="content">
<ul style="list-style: disc; padding-left: 25px;">

or

<div class="content">

<ul style="list-style: disc; padding-left: 25px;">

thanks for any help.

Won't someone think of the children?! stackoverflow.com/questions/1732348/… — rrhartjr
– rrhartjr, Commented Oct 3, 2011 at 14:24
I really encourage you to read answers to this question: stackoverflow.com/questions/1732348/… — Łukasz Wiatrak
– Łukasz Wiatrak, Commented Oct 3, 2011 at 14:27
Lucasus has clearly spent three minutes searching for that link :) — sehe
– sehe, Commented Oct 3, 2011 at 14:30

Steve Wortham · Accepted Answer · 2011-10-03 15:14:13Z

3

I wouldn't suggest using regular expressions for this. It's like trying to fix a tire with a hammer. The hammer is a good tool, but it's not for everything.

I'd use Html Agility Pack. It's not clear to me exactly what you're looking to extract. But I'll assume it's the list items. So you'd do something like this...

var hdoc = new HtmlAgilityPack.HtmlDocument();
hdoc.LoadHtml(YourHtmlGoesHere);

var MatchingNodes = hdoc.DocumentNode.SelectNodes("/html/body/div/ul/li");

As you can see, the syntax for the Html Agility Pack is based on XPATH and is much simpler for this task. It's also much more robust and something as silly as nested tags or a comment is not going to throw it off. Those types of things can throw off even the most carefully written regular expression in this scenario.

UPDATE

If you were determined to create a quick & dirty regular expression for this, it'd be something like this...

<div class="content">.*?</ul>

Ordinarily the .*? part matches anything except lines feeds 0 or more times, as few times as possible. So be sure to use RegexOptions.Singleline so that the . will match line feeds as well. This should work for the example you've given, but a commented bit of code with </ul> in it could throw it off, or a nested <ul></ul> could throw it off as well.

UPDATE #2

This will grab everything between the <ul></ul>...

(?<=<div class="content">\s*<ul[^>]*>).*?(?=</ul>)

Again, be sure to use RegexOptions.Singleline.

edited Oct 3, 2011 at 15:14

answered Oct 3, 2011 at 14:31

Steve Wortham

22.3k5 gold badges72 silver badges91 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Val Nolav Over a year ago

I dont have htmlagilitypack loaded also I wanna learn how I can strip that data with Regex. Regex is not the best tool but still I wanna learn how I can do that.

Justin Morgan Over a year ago

Based on his description, I think you want <div class="content">.*?(?=<ul) and then <ul.*?</ul>. It sounds like he wants to pull two result sets out of there.

Val Nolav Over a year ago

Thanks Justin. Your code is good but still does not work. I changed it as [\\s]*(?=<ul) however I parsed only the initial line with this change and the result was: <div class="content">

Val Nolav Over a year ago

Steven I am sorry to take your precious time. Your code works but the problem is I need to start parsing after <div class="content"> <ul style="list-style: disc; padding-left: 25px;"> not after <div class="content"> and sometimes there is a line break after <div class="content"> and sometimes there is not.

Val Nolav Over a year ago

Thanks Steve, I am trying your update. Once I learn the Regex style, I will use HtmlAgilityPack in my actual code. I installed htmlagilitypack. However, I really need to learn Regex part to for my future projects.

|

Hans Keﬆing · Accepted Answer · 2011-10-03 14:24:01Z

2

Regex isn't the best tool to parse html (to put it mildly). Use HtmlAgilityPack.

answered Oct 3, 2011 at 14:24

Hans Keﬆing

39.6k10 gold badges84 silver badges119 bronze badges

7 Comments

Val Nolav Over a year ago

probably not but I wanna use Regex

sehe Over a year ago

@ValNolav: nobody is stopping you. Just don't expect many people to help you - this site is about helping and answers are judged by their quality. That means that many people are not about to spend much time writing tedious answers of lower quality...

Justin Morgan Over a year ago

@Val - Why do you want to use regex? If you think it will save you time or effort, it won't. Regex is not the right tool for this.

Val Nolav Over a year ago

I just wanna learn how I can do that. There are many occasions I meet a line break and I wanna parse it with Regex not only with HTML but also in other text files.

Justin Morgan Over a year ago

@Matt - What he's trying to do is not impossible; it's been shown several times that regex can be part of an HTML parser, especially for a limited set of HTML. Full-featured regex engines can incorporate recursion, balanced groups, etc. that make parsing HTML at least possible. However, it is an enormous headache, and far more complicated than it seems. It's really not something anyone should do in most circumstances.

|

Collectives™ on Stack Overflow

c# regex parsing

Technical Details

2 Answers 2

7 Comments

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Technical Details

2 Answers 2

7 Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related