4

Lets say I have a bit of javascript code that is passed a string from php containing an entire html page. I write the string to the current document and then alter one of it's containing elements. Something like this:

<script type="text/javascript">
var foo = <?php echo $html_document;?>;
document.open();
document.write(foo);
document.close();
document.getElementById("some_id_within_html_document").innerHTML = "some stuff";
</script>

This gives me my desired output, everything looks great... except when you view the source of this page. If i wanted to scrape this page later and do the same thing it displays the javascript instead of the html interpreted by the browser. Using this method how could I scrape the desired HTML instead of the javascript generating it? I have already circumvented this issue by processing the string in php instead however I am still curious if it is possible to display the interpreted HTML this way when viewing the source/scraping the page.

Edit: Great responses across the board, I learned a lot about what is actually going on here and what practices I should stay away from. The simplest solution that would take the least effort in relation to my original problem was given by Justin Wood.

1
  • You realise that's an oxmoron? If the page is generated by script, it has no source markup. However, the innerHTML property is supposed to be a markup equivalent based on the HTML fragment serialisation algorithm. Note that serialising a document fragment, then turning the result back into a fragment with an HTML parser may not produce excatly the same result as the original. Commented Oct 3, 2012 at 0:55

5 Answers 5

6

Not exactly sure what you are trying to do but you can see the HTML equivalent to the generated/modified DOM using something like:

document.documentElement.innerHTML

or:

document.getElementById("some_id").innerHTML

See DEMO.

You can create a bookmarklet that includes this code:

alert(document.documentElement.innerHTML);

to see the HTML of the DOM modified by JavaScript on every page that you view.

Update:

If you want to do some Web scraping on your server where you want to download some external Web page, execute its JavaScript and then see the HTML that corresponds to the DOM after the JavaScript is executed (with the document.write calls and all that) then try using Zombie or Phantom. See also Mink for a PHP tool that supports Zombie.

Generally search for a headless browser with JavaScript engine.

Contrary to what people write in other answers here, it is actually possible.

Sign up to request clarification or add additional context in comments.

1 Comment

When I try this I get the code that generates HTML (between <script> brackets in the <head> of the document); I don't get the HTML that it would generate.
1

don't pass your PHP variable into the javascript. Just output the variable itself, then use javascript to edit whatever it is that you want to edit...

<?php
$html = "<html><head><title></title></head><body><p id='p'>Something</p></body></html>";

echo $html;
?>

<script type="text/javascript">
  document.getElementById("p").innerHTML = "blah";
</script>

Something like that should work for you.

NOTE: I have only tested this in chrome, FF, and safari

Comments

0

You don't. The HTML is not in the source, period. The original HTML contains Javascript that needs to be executed. That Javascript manipulates the DOM of the page to add more things to it. The original HTML doesn't change, it still has only the Javascript.

If you want to "scrape" Javascript-generated content, you always need to parse and execute the whole page including Javascript and a DOM and evaluate the resulting changed DOM.

1 Comment

Curious, I am running php with the CodeIgniter framework and I am sure there is a way to do this. I'll look into it thanks!
-1

Since JavaScript is a client-sided language, it doesn't get executed when you view the source of a page, and thus the discrepancy between the visual result and the source. You would have to replace the JS with PHP or another server-sided language to achieve the same result.

Moreover, if you still wanted to use JavaScript, then you would have to view the DOM, or document object, which holds all the HTML nodes, after the JavaScript had been executed. One way to do this is using the inspector in Chrome (CTRT + SHIFT + I) or (Right Click -> Inspect this element).

Comments

-2

Stepping aside from the Javascript reference, are you really trying to "view source", which used to be a simple option in browsers? A vanilla look that helps find typos etc?

In Chrome that is Ctl-U. Not a menu option anymore, but working 2022-10-29.

2 Comments

He want to get the code programmatically
This does not provide an answer to the question. Once you have sufficient reputation you will be able to comment on any post; instead, provide answers that don't require clarification from the asker. - From Review

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.