1

I want to make an XML document from an HTML one so I can use the XML parsing tools. My problem is that my HTML is not guaranteed to be XHTML nor valid. How can I bypass the exceptions? In this string <p> is not terminated, nor is <br> nor <meta>.

var poorHtml:String = "<html><meta content=\"stuff\" name=\"description\"><p>Hello<br></html>";
var html:XML = new XML(poorHtml);

TypeError: Error #1085: The element type "meta" must be terminated by the matching end-tag "</meta>".

3 Answers 3

1

I did some searching and couldn't come up with anything except this doesn't really seem possible, the major issue is how should it correct when the format is not valid.

In the case of browsers, every browser does this based upon it's own rules of what should happen in the case that the closing tag isn't found (put it in wherever it would cause the code to produce a valid XML and subsequently DOM tree, or self terminate the tag, or remove the tag, or for the case that a closing tag was found with no opening how should this be handled, what about unclosed attributes etc.).

Unfortunately I don't know of anything in the specification that explains what should be done in this case, with XHTML just like how flex treats it these are fatal errors and result in no functionality rather than how HTML4 treated it with the quirky and transitional DTD options.

To avoid the error or give better error messaging you can use this:

var poorHtml:String = "<html><meta content=\"stuff\" name=\"description\"><p>Hello<br></html>";

try
{
    var html:XML = new XML(poorHtml);
}
catch(e:TypeError)
{
    trace("error caught")
}

but it's likely you'll be best off using some sort of server side script to validate the XML or correct the XML before passing it over to the client.

Sign up to request clarification or add additional context in comments.

2 Comments

Regrouping, I really only cared about the DOM parsing tools of the XML format in actionscript. E.g. return a list of all elements with attribute href. Is there an HTML parser that can search like this?
If you just want to extract all the links in a page your better off just doing this manually, I'm personally not a big fan of regular expressions but if your comfortable that's the way to go, otherwise you could go the ultra manual route do a loop search for href=" var startPoint:Number = myString.indexOf('href="',lastEndpoint) then var lastEndpoint = myString.indexOf('"', startPoint) then myString.substr(startPoint,lastEndpoint)) or else can have a look through the code here, using regexp sourceforge.net/projects/as3htmlparser/develop
1

There is probably an implementation of HTML Tidy in just about any language you might happen to be working with. This looks promising for your sitation: http://code.google.com/p/as3htmltidylib/

If you don't want to drag in a whole library (I wouldn't), you could just write your own XML parser that handles errors in whatever way suits you (I'd suggest auto-closing tags until the document makes sense again, ignoring end tags with no start tags, maybe un-closing certain special tags such as "body" and "html"). This has the added advantage that you can optimize it for whatever jobs you need it for, i.e. by storing a list of all elements with the attribute "href" as you come to them.

Comments

0

You could try to pass your HTML through HTML Tidy on the server before loading it. I believe that HTML Tidy does a good job at cleaning up broken HTML.

2 Comments

This is an AIR app that is GETing the HTML.
With GETing, you mean with an HTTP GET from a server?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.