0

I am trying to extract some data from a website using a LINQ statement, the XML is in the following form.

<parent> 
  <p>
    <b>
      Title
    </b>
  </p>
  <p>
    blurb
  </p>
  <p>
    <b>
      As Of Date
    </b>
  </p>
  <center>
    <table>
      <tr>
        <th>
          Header
        </th>
      </tr>
      <tr>
        <td>
          Data
        </td>
      </tr>
    </table>
  </center>
  <p>
    <b>
      As Of Date
    </b>
  </p>
  <center>
    <table>
      <tr>
        <th>
          Header
        </th>
      </tr>
      <tr>
        <td>
          Data
        </td>
      </tr>
    </table>
  </center>
</p>

I would like to get the As Of Date and Data (the data row is iterated several times). Also the table and as of date appear several times in the document (the table is active from a date).

I can get the rows using the following LINQ but how do I get the As Of Date

Dim l_PricesTable = From rows In l_Xml.Descendants("tr") _
                   Where ((rows.Descendants("td") IsNot Nothing) AndAlso (rows.Descendants("td").Count >= 1)) _
                          Select Data = rows.Descendants("td")(0).Value,
                          AsOfDate = ???

I have no way of changing the XML as it is a 3rd party source. There is no XML element which contains just the as of date and also the table, they are all under the one parent node.

I am confident in C# and VB.Net so any solution is OK.

Any help would be appreciated.

Thanks

Dave

2 Answers 2

1

Do not use an XML library to parse HTML. The syntax is similar, but not the same. XHTML is XML, HTML is not.

That being said, the sample data you have above is HTML that is compatible with XML, so if all of the data looks like that (and doesn't use any non-closing tags like img), then you should be able to skate by.

Assuming that the string "As Of Date" in your above sample is a placeholder for what you actually want to retrieve, then:

Dim asOfDate = l_Xml.Elements("p")(2).Element("b").Value

Just be aware that this suffers from the intrinsically brittle nature of screen scraping; if the design is changed at all, your process will break.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, I am aware of the brittleness of this and sadly it is the only way. I chose LINQ as it was a neat website (which doesn't change often, I realise I have just cursed this) and it was easier than string iteration. Sadly this doesn't solve the issue as the data iterates (I have made this clearer in the question).
0

I have got round this problem in a really messy way, but as no other answers are forthcoming I will post what I have done.

Dim l_PricesTable = From rows In l_Xml.Descendants("tr") _ 
               Where ((rows.Descendants("td") IsNot Nothing) AndAlso (rows.Descendants("td").Count >= 1)) _ 
                      Select Data = rows.Descendants("td")(0).Value, 
                      AsOfDate = rows.Parent.Parent.ElementsBeforeSelf("p")(rows.Parent.Parent.ElementsBeforeSelf("p").Count - 1).Descendants("b").Value

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.