0

I am trying to parse data from a very long html content. I am just pasting here the important part I am interested in:

Technical Details

<div class="content">

    <ul style="list-style: disc; padding-left: 25px;">

      <li>1920x1080 Full HD 60p/24p Recording w/7MP still image</li>
      <li>32GB Flash Memory for up to 13 hours (LP mode) of HD recording</li>
      <li>Project your videos on the go anywhere, anytime.</li>
      <li>Wide Angle G lens to capture everything you want.</li>
      <li>Back-illuminated "Exmor R" CMOS sensor for superb low-light video</li>

    </ul>

  <div id="technicalProductFeatures"></div>

I need to start parsing from :

<div class="content">

til

<ul

and then until

</ul>

I have tried following regex but it did not work:

Regex specsRegex = new Regex ("<div class=\"content\">[\\s]*<ul.[\\s]*</ul>");

this gives me nothing..

One other issue is sometimes it has a linebreak and sometimes not between initial div and ul tags like:

<div class="content">
<ul style="list-style: disc; padding-left: 25px;">

or

<div class="content">

<ul style="list-style: disc; padding-left: 25px;">

thanks for any help.

7

2 Answers 2

3

I wouldn't suggest using regular expressions for this. It's like trying to fix a tire with a hammer. The hammer is a good tool, but it's not for everything.

I'd use Html Agility Pack. It's not clear to me exactly what you're looking to extract. But I'll assume it's the list items. So you'd do something like this...

var hdoc = new HtmlAgilityPack.HtmlDocument();
hdoc.LoadHtml(YourHtmlGoesHere);

var MatchingNodes = hdoc.DocumentNode.SelectNodes("/html/body/div/ul/li");

As you can see, the syntax for the Html Agility Pack is based on XPATH and is much simpler for this task. It's also much more robust and something as silly as nested tags or a comment is not going to throw it off. Those types of things can throw off even the most carefully written regular expression in this scenario.


UPDATE

If you were determined to create a quick & dirty regular expression for this, it'd be something like this...

<div class="content">.*?</ul>

Ordinarily the .*? part matches anything except lines feeds 0 or more times, as few times as possible. So be sure to use RegexOptions.Singleline so that the . will match line feeds as well. This should work for the example you've given, but a commented bit of code with </ul> in it could throw it off, or a nested <ul></ul> could throw it off as well.

UPDATE #2

This will grab everything between the <ul></ul>...

(?<=<div class="content">\s*<ul[^>]*>).*?(?=</ul>)

Again, be sure to use RegexOptions.Singleline.

Sign up to request clarification or add additional context in comments.

7 Comments

I dont have htmlagilitypack loaded also I wanna learn how I can strip that data with Regex. Regex is not the best tool but still I wanna learn how I can do that.
Based on his description, I think you want <div class="content">.*?(?=<ul) and then <ul.*?</ul>. It sounds like he wants to pull two result sets out of there.
Thanks Justin. Your code is good but still does not work. I changed it as [\\s]*(?=<ul) however I parsed only the initial line with this change and the result was: <div class="content">
Steven I am sorry to take your precious time. Your code works but the problem is I need to start parsing after <div class="content"> <ul style="list-style: disc; padding-left: 25px;"> not after <div class="content"> and sometimes there is a line break after <div class="content"> and sometimes there is not.
Thanks Steve, I am trying your update. Once I learn the Regex style, I will use HtmlAgilityPack in my actual code. I installed htmlagilitypack. However, I really need to learn Regex part to for my future projects.
|
2

Regex isn't the best tool to parse html (to put it mildly). Use HtmlAgilityPack.

7 Comments

probably not but I wanna use Regex
@ValNolav: nobody is stopping you. Just don't expect many people to help you - this site is about helping and answers are judged by their quality. That means that many people are not about to spend much time writing tedious answers of lower quality...
@Val - Why do you want to use regex? If you think it will save you time or effort, it won't. Regex is not the right tool for this.
I just wanna learn how I can do that. There are many occasions I meet a line break and I wanna parse it with Regex not only with HTML but also in other text files.
@Matt - What he's trying to do is not impossible; it's been shown several times that regex can be part of an HTML parser, especially for a limited set of HTML. Full-featured regex engines can incorporate recursion, balanced groups, etc. that make parsing HTML at least possible. However, it is an enormous headache, and far more complicated than it seems. It's really not something anyone should do in most circumstances.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.