0

So I'm trying to parse following data into a CSV. From my reading it sounds like the best way to go about it is using HAP since it has a robust parser.

As of right now, the WPF WebBrowser control content is being accessed by:

dynamic doc = this.wbControl.Document;

Content

        <div class="content">
                <fieldset>
                    <ul class="fieldsetr">
                        <li class="row medium">
                            <div class="field">
                                <div class="shell">
                                    <em class="disable">Sender:</em>
                                </div>
                            </div>
                            <div>
                                <div class="clip">
                                    <em>[email protected]</em>
                                </div>
                            </div>
                        </li>
                        <li class="row medium alt">
                            <div class="field">
                                <div class="shell">
                                    <em class="disable">Recipient:</em>
                                </div>
                            </div>
                            <div>
                                <div class="clip">
                                    <em>[email protected]</em>
                                </div>
                            </div>
                        </li>
                        <li class="row medium">
                            <div class="field">
                                <div class="shell">
                                    <em class="disable">Message ID:</em>
                                </div>
                            </div>
                            <div>
                                <div class="clip">
                                    <em>2342342345235</em>
                                </div>
                            </div>
                        </li>
                        <li class="row medium alt">
                            <div class="field">
                                <div class="shell">
                                    <em class="disable">Message size:</em>
                                </div>
                            </div>
                            <div>
                                <div class="clip">
                                    <em>18.74 KB
                                    </em>
                                </div>
                            </div>
                        </li>
                        <li class="row medium">
                            <div class="field">
                                <div class="shell">
                                    <em class="disable">Date and time received:</em>
                                </div>
                            </div>
                            <div>
                                <div class="clip">
                                    <em>11/27/2012 6:17:22 AM</em>
                                </div>
                            </div>
                        </li>
                        <li class="row medium alt">
                            <div class="field">
                                <div class="shell">
                                    <em class="disable">Date and time filtered:</em>
                                </div>
                            </div>
                            <div>
                                <div class="clip">
                                    <em>11/27/2012 6:17:22 AM</em>
                                </div>
                            </div>
                        </li>
                        <li class="row medium">
                            <!-- Connector Details -->

                        </li>                            
                        <li class="row medium alt">
                            <div class="field">
                                <div class="shell">
                                    <em class="disable">First delivery attempt:</em>
                                </div>
                            </div>
                            <div>
                                <div class="clip">
                                    <em>11/27/2012 6:17:23 AM</em>
                                </div>
                            </div>
                        </li>
                        <li class="row medium">
                            <div class="field">
                                <div class="shell">
                                    <em class="disable">Final delivery attempt:</em>
                                </div>
                            </div>
                            <div>
                                <div class="clip">
                                    <em>11/27/2012 6:17:23 AM</em>
                                </div>
                            </div>
                        </li>
                        <li class="row medium alt">
                            <div class="field">
                                <div class="shell">
                                    <em class="disable">From IP address:</em>
                                </div>
                            </div>
                            <div>
                                <div class="clip">
                                    <em>1.2.3.4 &lt;unknown&gt;</em>
                                </div>
                            </div>
                        </li>
                        <li class="row medium">
                            <div class="field">
                                <div class="shell">
                                    <em class="disable">To IP address:</em>
                                </div>
                            </div>
                            <div>
                                <div class="clip">
                                    <em>4.3.2.1 &lt;mail.example2.com&gt; </em>
                                </div>
                            </div>
                        </li>
                        <li class="row medium alt">
                            <div class="field">
                                <div class="shell">
                                    <em class="disable">Filtering results:</em>
                                </div>
                            </div>
                            <div>
                                <div class="clip">
                                    <em>Passed Filtering</em>
                                </div>
                            </div>
                        </li>
                        <li class="row medium">
                            <div class="field">
                                <div class="shell">
                                    <em class="disable">Delivery result:</em>
                                </div>
                            </div>
                            <div>
                                <div class="clip">
                                    <span><em>Delivered: 470 2.4.0 &lt;2342342345235&gt; [InternalId=2321233] Queued mail for delivery</em></span>
                                </div>
                            </div>
                        </li>
                    </ul>
                </fieldset>
        </div>

What is the best way for me to convert this data? This is only one record, but more records would be added.

Edit

Ended up using the following code to test it out:

            HtmlAgilityPack.HtmlDocument docHAP = new HtmlAgilityPack.HtmlDocument();
            docHAP.LoadHtml(doc.Body.InnerHtml.ToString());

            foreach(HtmlNode emNode in docHAP.DocumentNode.SelectNodes("//em"))
            {
                MessageBox.Show(emNode.InnerText.ToString());
            }

If anyone has a more efficient solution, please feel free to let me know.

1 Answer 1

1

Yes, use the HTML Agilty Pack - it is an open source HTML parser for .NET.

What is exactly the Html Agility Pack (HAP)?

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

You can use this to query HTML and extract whatever data you wish.

Simply by using XPath you can get any particular element/attribute/text data.

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(this.wbControl.Document);

// get all the 'em' tags from HTML
foreach(HtmlNode emNode in doc.DocumentElement.SelectNodes("//em")
{    
    if (emNode.Attributes["class"] != null)
       var value = emNode.Attributes["class"].Value;
}

// get all the `em` tags where 'class' attribute value is 'disable' from HTML
foreach(HtmlNode emNode in doc.DocumentElement
                              .SelectNodes("//em[@class='disabled']")
{    
    // ...
}
Sign up to request clarification or add additional context in comments.

2 Comments

Well you hit the nail on the head with the XPATH and XSLT stuff. That part always gets me. Could you perchance provide an example for even just one section above?
It certainly is pointing me in the right direction. The LoadHtml method won't work though as the this.wbControl.Document an mshtml document. Trying to find out how to convert it to a HAP HtmlDocument atm to further test. That said, it also threw an error on the DocumentElement but seems like it will take a DocumentNode instead.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.