NSXMLParser with HTML Containing Javascript and "bad" characters

Question

I am using NSXMLParser to parse HTML from web sites. Testing site is under my control but in operation sites will not be.

Problem is when parser encounters javascript which contains "bad" characters. For example, javascript containing if(screen.width<=521). The problem is the < in the code. I can see the problem but am unsure if there is any good way round it. (the NSXMLParser is reporting NSXMLParserErrorDomain error 68. and I can see why - it is treating the <= as the start of a new tag but = is not a valid tag name character...). But then what would I do with e.g. if(var<20) ?

I actually not interested in the specific content so could do things like global replace/removal of e.g. "<=" and ">=" (etc.) but in some regards that seems a bit of a mess as I was using NSXMLParser to avoid having to start messing around with the content. If substitution is the best way forward, I can envisage "<=" and ">=" but any other sequences I should include ?

I am new to Cocoa so may easily have missed something obvious - in which case many apologies. I did see that others have found similar problems but could not get a good way forward from the questions.

I am handling the error OK (in a tidy manner) but it is preventing my app from doing what it is meant to do - i.e. I need to avoid the error rather than handle it.

Background: that application is doing a "before" and "after" comparison on the html and looking for changes. I could swap "<=" for something really weird, then swap it back when necessary. I could even check the data for the replace content first to eliminate possible ambiguities (e.g. find a UID sequence not in the downloaded page, replace "<=" with UID sequence, parse page, if need be, replace UID with "<=", ditto for ">=".

(I have looked at e.g. libtidy of libxml2 but cannot find easy documentation and am wary about launching down such a route if it will not solve the issues.)

omz · Accepted Answer · 2012-07-28 13:42:48Z

2

NSXMLParser, as its name implies, is not meant for parsing HTML. XML is much stricter than HTML, and the errors you've encountered are certainly not the only ones that are possible with real-world HTML. There are HTML documents that are also valid XML, but that is the exception, rather than the norm.

I would suggest using a proper HTML parser instead, such as this one, which is an Objective-C wrapper around libxml's HTML parsing functions.

edited Jul 28, 2012 at 13:42

answered Jul 28, 2012 at 13:33

omz

53.6k5 gold badges133 silver badges141 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Stuart Mycroft Over a year ago

Many thanks for that. I looked and tried the code you linked to and it helped a lot though did not quite fit with what my app was trying to do. Well more that the parse to a tree in libxml2 did not fit with what my app was doing.

Stuart Mycroft Over a year ago

Many thanks for that. I looked and tried the code you linked to and it helped a lot though did not quite fit with what my app was trying to do. Well more that the parse to a tree in libxml2 did not fit with what my app was doing. However, the SAX parser in libxml2 did fit well (HTML version). I had seen a lot of comments about poor libxml2 documentation and it does look that way. However, with the code you linked to, a bit of work with tests in the debugger it is nothing quite as bad as it appears at first (particularly true with the SAX parsing).

Stuart Mycroft Over a year ago

From a beginners perspective, one aspect that still confuses me in the tree output of libxml2 is the nodes below a tag called "text" and "entity refs" - particularly when your tag has the content. Of course, that became a non-issue when I switched to the SAX processor, but it would probably help beginners if the libxml2 people gave a bit more explanation about the tree (the example tree tag printing is a bit to straightforward) - but they have done excellent work available for free so that comment is meant more as an "icing on the cake wish".

Collectives™ on Stack Overflow

NSXMLParser with HTML Containing Javascript and "bad" characters

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related