Create XML object from poorly formatted HTML

Question

I want to make an XML document from an HTML one so I can use the XML parsing tools. My problem is that my HTML is not guaranteed to be XHTML nor valid. How can I bypass the exceptions? In this string <p> is not terminated, nor is <br> nor <meta>.

var poorHtml:String = "<html><meta content=\"stuff\" name=\"description\"><p>Hello<br></html>";
var html:XML = new XML(poorHtml);

TypeError: Error #1085: The element type "meta" must be terminated by the matching end-tag "</meta>".

shaunhusain · Accepted Answer · 2010-12-29 22:44:21Z

1

I did some searching and couldn't come up with anything except this doesn't really seem possible, the major issue is how should it correct when the format is not valid.

In the case of browsers, every browser does this based upon it's own rules of what should happen in the case that the closing tag isn't found (put it in wherever it would cause the code to produce a valid XML and subsequently DOM tree, or self terminate the tag, or remove the tag, or for the case that a closing tag was found with no opening how should this be handled, what about unclosed attributes etc.).

Unfortunately I don't know of anything in the specification that explains what should be done in this case, with XHTML just like how flex treats it these are fatal errors and result in no functionality rather than how HTML4 treated it with the quirky and transitional DTD options.

To avoid the error or give better error messaging you can use this:

var poorHtml:String = "<html><meta content=\"stuff\" name=\"description\"><p>Hello<br></html>";

try
{
    var html:XML = new XML(poorHtml);
}
catch(e:TypeError)
{
    trace("error caught")
}

but it's likely you'll be best off using some sort of server side script to validate the XML or correct the XML before passing it over to the client.

answered Dec 29, 2010 at 22:44

shaunhusain

19.7k4 gold badges41 silver badges53 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ojreadmore Over a year ago

Regrouping, I really only cared about the DOM parsing tools of the XML format in actionscript. E.g. return a list of all elements with attribute href. Is there an HTML parser that can search like this?

shaunhusain Over a year ago

If you just want to extract all the links in a page your better off just doing this manually, I'm personally not a big fan of regular expressions but if your comfortable that's the way to go, otherwise you could go the ultra manual route do a loop search for href=" var startPoint:Number = myString.indexOf('href="',lastEndpoint) then var lastEndpoint = myString.indexOf('"', startPoint) then myString.substr(startPoint,lastEndpoint)) or else can have a look through the code here, using regexp sourceforge.net/projects/as3htmlparser/develop

Brilliand · Accepted Answer · 2011-05-24 18:27:21Z

1

There is probably an implementation of HTML Tidy in just about any language you might happen to be working with. This looks promising for your sitation: http://code.google.com/p/as3htmltidylib/

If you don't want to drag in a whole library (I wouldn't), you could just write your own XML parser that handles errors in whatever way suits you (I'd suggest auto-closing tags until the document makes sense again, ignoring end tags with no start tags, maybe un-closing certain special tags such as "body" and "html"). This has the added advantage that you can optimize it for whatever jobs you need it for, i.e. by storing a list of all elements with the attribute "href" as you come to them.

answered May 24, 2011 at 18:27

Brilliand

13.8k6 gold badges49 silver badges58 bronze badges

Comments

Luke · Accepted Answer · 2010-12-30 07:00:10Z

0

You could try to pass your HTML through HTML Tidy on the server before loading it. I believe that HTML Tidy does a good job at cleaning up broken HTML.

answered Dec 30, 2010 at 7:00

Luke

21.4k39 gold badges124 silver badges180 bronze badges

2 Comments

ojreadmore Over a year ago

This is an AIR app that is GETing the HTML.

Luke Over a year ago

With GETing, you mean with an HTTP GET from a server?

Collectives™ on Stack Overflow

Create XML object from poorly formatted HTML

3 Answers 3

2 Comments

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related