1

I'm working a page that needs to fetch info from some other pages and then display parts of that information/data on the current page.

I have the HTML source code that I need to parse in a string. I'm looking for a library that can help me do this easily. (I just need to extract specific tags and the text they contain) The HTML is well formed (All closing/ending tags present).

I've looked at some options but they are all being extremely difficult to work with for various reasons.

I've tried the following solutions:

  1. jkl-parsexml library (The library js file itself throws up HTTPError 101)
  2. jQuery.parseXML Utility (Didn't find much documentation/many examples to figure out what to do)
  3. XPATH (The Execute statement is not working but the JS Error Console shows no errors)

And so I'm looking for a more user friendly library or anything(tutorials/books/references/documentation) that can let me use the aforementioned tools better, more easily and efficiently.

An Ideal solution would be something like BeautifulSoup available in Python.

5
  • 5
    You could add it to the DOM, hide it, then access your elements with plain js or jQuery. That's actually letting the browser parse it for you, and using js to traverse the DOM. Commented Sep 11, 2012 at 22:53
  • The HTML I have is heavily nested(10-12 levels deep) and lacks class,name and id attributes; i.e the getELementById and similar functions are rendered effectively useless. So recovering the required data would be a real bother that way. Commented Sep 11, 2012 at 22:56
  • And how a custom parser would address that? Commented Sep 11, 2012 at 22:58
  • 2
    Hm. Take a look at jquery selectors. It should be powerful enough. Something like this "div p span" will find all spans located inside div and than inside p. "div>p>span" will do the same, but now p must be a direct child of div and span - direct child of such p. And there are a lot of other helpful selectors/functions in jquery Commented Sep 11, 2012 at 23:00
  • @bfavaretto I can't say for sure that a custom parser will make the job easier, but this was the first approach I tried and it was extremely time consuming. I was hoping that the parser would give me nested dictionaries which I could loop through more easily. Commented Sep 11, 2012 at 23:03

2 Answers 2

4

Using jQuery, it would be as simple as $(HTMLstring); to create a jQuery object with the HTML data from the string inside it (this DOM would be disconnected from your document). From there it's very easy to do whatever you want with it--and traversing the loaded data is, of course, a cinch with jQuery.

Sign up to request clarification or add additional context in comments.

4 Comments

I'm not sure if this is a problem with my code or the HTML itself but I get "Error: Invalid XML" when I try this. Here is the code I used ` htmlDoc = $.parseXML(pagetext);$html = $( htmldoc );$html.find("body");`
@Ayos: I would guess it's because you're trying to pass something into .parseXML that is invalid XML. What's the contents of pagetext?
The page contains HTML with CSS in the head and Javascript within the <script> tags. It's basically the entire source code of a website obtained via XHR's responseText.
Try var $html = $(pagetext) directly, then.
0

You can do something like this:

$("string with html here").find("jquery selector")

$("string with html here") this will create a document fragment and put an html into it (basically, it will parse your HTML). And find will search for elements in that document fragment (and only inside it). At the same time it will not put it in page DOM

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.