Parsing an XML/XHTML document but ignoring errors in C#

Question

I'm writing some little applications that parse the source of a few web pages, extract some data, and save it into another format. Specifically, some of my banks don't provide downloads of transactions/statements but they do provide access to those statements on their websites.

I've done one fine, but another (HSBC UK) is proving a pain in the arse, since its source is not valid XHTML. For example there is whitespace before the <?xml?> tag, and there are places where == is used instead of = between an attribute name and its value (e.g. <li class=="lastItem">).

Of course, when I pass this data into my XmlDocument, it throws a wobbly (more accurately an exception).

My question is: is it possible to relax the requirements for XML parsing in C#? I know it's far better to fix these problems at source - that's absolutely my attitude too - but there's roughly zero chance HSBC would change their website which already works in most browsers just for little old me.

Pontus Gagge · Accepted Answer · 2009-03-11 14:29:56Z

7

Take a look at the HTML agility pack. It allows you to extract elements of a non-XHTML-compliant web page through XPath, as if it were a well-formed XHTML document.

And for the love of Kleene, don't try to regexp a HTML page with any kind of complexities!

answered Mar 11, 2009 at 14:29

Pontus Gagge

17.3k1 gold badge43 silver badges52 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

bobince Over a year ago

+1. If the fools at HSBC are serving a file that isn't well-formed to browsers as text/html, it's a legacy HTML file you need to parse using an HTML parser, and not XHTML at all, even if it superficially looks like it.

Jon Skeet · Accepted Answer · 2009-03-11 14:21:31Z

3

I don't believe you can relax the parsing, but you could run it through something like HTML Tidy first to let that deal with the mess.

answered Mar 11, 2009 at 14:21

Jon Skeet

1.5m893 gold badges9.3k silver badges9.3k bronze badges

1 Comment

Ben Hymers Over a year ago

I gave HTML Tidy a go, but the HTML is so badly formed that it says it can't fix it without me fixing parts manually. Quite how HSBC ever employed a web developer capable of writing such a terrible website is beyond me.

Gregory A Beamer · Accepted Answer · 2009-03-11 14:23:17Z

0

If they are not XHTML compliant, you cannot shove the HTML into an XMLDocument object, no matter how hard you try.

If this is low volume, you can use the WebBrowserControl to create an empty HtmlDocument object and then use the Write() method of HtmlDocument to put the string you retrieved to scrape from.

Another option is mshtml.HTMLDocument, which is a bit of a pain to work with in .NET, as it is interop.

The most common type of screen scrape is using Regex, however. Once you determine the pattern you are loooking for, you can scrape over and over again.

answered Mar 11, 2009 at 14:23

Gregory A Beamer

17k3 gold badges27 silver badges32 bronze badges

Collectives™ on Stack Overflow

Parsing an XML/XHTML document but ignoring errors in C#

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related