3

I would like to read in a dynamic URL what contains a HTML file, and read it like an XML file, based on nodes (HTML tags). Is this somehow possible?

I mean, there is this HTML code:

            <table class="bidders" cellpadding="0" cellspacing="0"> 

            <tr class="bidRow4"> 
                <td>kucik (automata)</td> 
                <td class="right">9 374 Ft</td> 
                <td class="bidders_date">2010-06-10 18:19:52</td> 
            </tr> 

            <tr class="bidRow4"> 
                <td>macszaf (automata)</td> 
                <td class="right">9 373 Ft</td> 
                <td class="bidders_date">2010-06-10 18:19:52</td> 
            </tr> 

            <tr class="bidRow2"> 
                <td>kucik (automata)</td> 
                <td class="right">9 372 Ft</td> 
                <td class="bidders_date">2010-06-10 18:19:42</td> 
            </tr> 

            <tr class="bidRow2"> 
                <td>macszaf (automata)</td> 
                <td class="right">9 371 Ft</td> 
                <td class="bidders_date">2010-06-10 18:19:42</td> 
            </tr> 

            <tr class="bidRow0"> 
                <td>kucik (automata)</td> 
                <td class="right">9 370 Ft</td> 
                <td class="bidders_date">2010-06-10 18:19:32</td> 
            </tr> 

            <tr class="bidRow0"> 
                <td>macszaf (automata)</td> 
                <td class="right">9 369 Ft</td> 
                <td class="bidders_date">2010-06-10 18:19:32</td> 
            </tr> 

            <tr class="bidRow8"> 
                <td>kucik (automata)</td> 
                <td class="right">9 368 Ft</td> 
                <td class="bidders_date">2010-06-10 18:19:22</td> 
            </tr> 

            <tr class="bidRow8"> 
                <td>macszaf (automata)</td> 
                <td class="right">9 367 Ft</td> 
                <td class="bidders_date">2010-06-10 18:19:22</td> 
            </tr> 

            <tr class="bidRow6"> 
                <td>kucik (automata)</td> 
                <td class="right">9 366 Ft</td> 
                <td class="bidders_date">2010-06-10 18:19:12</td> 
            </tr> 

            <tr class="bidRow6"> 
                <td>macszaf (automata)</td> 
                <td class="right">9 365 Ft</td> 
                <td class="bidders_date">2010-06-10 18:19:12</td> 
            </tr> 

        </table> 

I want to parse this into a ListView (or a Grid) to create rows with the data contained. All tr are different row, and all td in a given td is a column in the given row.

And also I want it to be as fast as possible, as it would update itself in 5 seconds.

Is there any library for this?

4 Answers 4

8

I recommend HTML Agility Pack. You'll have to handle the GUI part yourself. It doesn't require valid HTML, but creates a HtmlDocument similar to XmlDocument.

Sign up to request clarification or add additional context in comments.

Comments

0

Sure, it's possible. But be warned — a compliant xml processor is supposed to treat anything that's not well-formed as a fatal error. That means it's only going to work on documents that pass validation for xhtml strict.

4 Comments

Not quite. The XHTML strict standard defines additional requirements on things like what attributes are available for what tags, what tags can be placed where, etc. Unless the HTML document links to a schema and the XML parser actually uses that schema, the document only needs to be syntactically valid XML.
This page's syntax never changes, I want to read it's content. Maybe the best solution would be RegEx?
As I said, the syntax never changes, just the data. So this can be parsed by RegEx, if I read the file into a string. There are no changes, no additional info, nor anything when it is updated. Only those fields.
0

I normally use Fast XPath Reader in combination with LinqToXML for the job. It is rather old (2007) though.

I wasn't aware of the HTML Agility Pack, so I can't say how it compares (in both performance and ease of use).

Comments

0

Why not just do string replacement to convert the HTML table into XML:

   <table class="bidders" cellpadding="0" cellspacing="0">

becomes:

   <?xml version="1.0" encoding="UTF-8"?>

and

  <tr class="bidRow4">

becomes

  <item>

and

 <td class="right">

becomes

 <field1>

etc

EDIT 1:

I think also that the DataSet Class has a:

.ReadXML

method such that you could then databind to that dataset:

    DataSet ds = new DataSet();
    ds.ReadXml("foo.xml");
    DataGrid.DataSource = ds;
    DataGrid.DataBind();

or something similar

9 Comments

I don't want to convert, as even reading a simple XML document with XMLdocument takes very long time.
Sounds like your trying to scrape data off a website, there is not ever going to be a fast way of doing it. You need to find another method of getting that data, what other access to you have to this data?
Only this HTML page, as it is rendered by an unknown script, from an unknown database, on an unknown way. So no more access, until I can hack my way around this.
One problem with the DataSet method - this file has child nodes. So it will cause an exception, and it can't run down sadly.
Sorry I'm not sure what you mean, have you actually tried this method?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.