2

I'm writing some little applications that parse the source of a few web pages, extract some data, and save it into another format. Specifically, some of my banks don't provide downloads of transactions/statements but they do provide access to those statements on their websites.

I've done one fine, but another (HSBC UK) is proving a pain in the arse, since its source is not valid XHTML. For example there is whitespace before the <?xml?> tag, and there are places where == is used instead of = between an attribute name and its value (e.g. <li class=="lastItem">).

Of course, when I pass this data into my XmlDocument, it throws a wobbly (more accurately an exception).

My question is: is it possible to relax the requirements for XML parsing in C#? I know it's far better to fix these problems at source - that's absolutely my attitude too - but there's roughly zero chance HSBC would change their website which already works in most browsers just for little old me.

3 Answers 3

7

Take a look at the HTML agility pack. It allows you to extract elements of a non-XHTML-compliant web page through XPath, as if it were a well-formed XHTML document.

And for the love of Kleene, don't try to regexp a HTML page with any kind of complexities!

Sign up to request clarification or add additional context in comments.

1 Comment

+1. If the fools at HSBC are serving a file that isn't well-formed to browsers as text/html, it's a legacy HTML file you need to parse using an HTML parser, and not XHTML at all, even if it superficially looks like it.
3

I don't believe you can relax the parsing, but you could run it through something like HTML Tidy first to let that deal with the mess.

1 Comment

I gave HTML Tidy a go, but the HTML is so badly formed that it says it can't fix it without me fixing parts manually. Quite how HSBC ever employed a web developer capable of writing such a terrible website is beyond me.
0

If they are not XHTML compliant, you cannot shove the HTML into an XMLDocument object, no matter how hard you try.

If this is low volume, you can use the WebBrowserControl to create an empty HtmlDocument object and then use the Write() method of HtmlDocument to put the string you retrieved to scrape from.

Another option is mshtml.HTMLDocument, which is a bit of a pain to work with in .NET, as it is interop.

The most common type of screen scrape is using Regex, however. Once you determine the pattern you are loooking for, you can scrape over and over again.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.