I'm attempting to write some JavaScript code (in particular, a Chrome extension) which does the following:
- Retrieve some web page's contents via AJAX.
- Get some content from that page by locating certain elements inside of the HTML string and getting their contents.
- Do a thing with that data.
I have 1) and 3) working, but I'm having some trouble achieving step 2) in a reasonable way.
I currently have 2) implemented via jQuery(htmlString) and then using normal jQuery selectors and etc. to extract the data I want. The problem is that jQuery actually adds the retrieved HTML to the current page, loading and executing all external resources / scripts in the process. This is obviously bad.
So I'm looking for a way to get the text and HTML in certain tags inside my HTML string without:
- Loading or executing ANY scripts or resources (images, CSS, etc.) referenced in the HTML string.
- Trying to remove external resources with regular expressions, since we all know what happens when you parse [X]HTML with regex.
I believe that I can achieve what I want using jsdom and jQuery, since jsdom has a FetchExternalResources option which can be set to false. However, jsdom seems to only work in NodeJS, not in the browser.
Is there any reasonable way to do this?
jQuery.parseHTMLstill attempts to load external images and etc., and additionally its attempts at not executing scripts are trivially thwarted - from the documentation: "However, it is still possible in most environments to execute script indirectly, for example via the <img onerror> attribute."