3

I am trying to utilize PhantomJS to get html generated by dynamic page. I supposed that this would be easy, but after few hours of trying, I am still not lucky.

The page itself has this source code and what gets saved in 1.html eventually:

<!doctype html>
<html lang="cs" ng-app="appId">
<head ng-controller="MainCtrl">
     (ommited some lines)
    <script src="/js/conf/config.js?pars"></script>
    <script src="/js/all.js?pars"></script>
</head>
<body>
<!--<![endif]-->
    <div site-loader></div>
    <div page-layout>
        <div ng-view></div>
    </div>
</body>
</html>

All content of web gets loaded inside site-loader div, but I have no luck to get it, even though I am using timeout before scraping html by PhantomJS. Here goes code I am using:

var url = 'http:...';
var page = require('webpage').create();
var fs = require('fs');

page.open(url, function (status) {
    if (status !== 'success') {
        console.log('Fail');
        phantom.exit();
    } else {        
        window.setTimeout(function () {
        fs.write('1.html', page.content, 'w');
        phantom.exit();
        }, 2000); // Change timeout as required to allow sufficient time 
    }
});

Please what am I doing wrong?

EDIT: I have decided to try PJscrapper framework and configured it to scrappe all contents of div block. All I got was lousy:

["","\n\t\tif (window.DOT) {\n\t\t\tDOT.cfg({service: 'sreality', impress: false});\n\t\t}\n\t","","Loader.load()","",""]

Seems that I seriously do not get it and always get code before Loader.load() acts. And obviously, timeout does not solve it.

5
  • You didn't show what is written to 1.html. Please register to the onConsoleMessage and onError events. Maybe there are errors. If bind is an issue, you need a shim. Commented Sep 9, 2014 at 9:00
  • Hi, in the resulting 1.html is the same code as I have put into my question (html). This is the same code as is shown when I hit Ctrl+U in browser. But my understanding, that this get manipulated by Java Script somehow. Becasuse when I manualy inspect elements of page, I can see them in mentioned div block... I will check the onError event and see what will happen, thanks for your help. Commented Sep 9, 2014 at 9:07
  • When I registered onError as you have mentioned I am getting two errors for missing variables: ERROR: ReferenceError: Cant find variable. One variable is named Loader and another JAK (both can be found in JavaScript generating page. Am I doing something principially wrong? Commented Sep 9, 2014 at 9:45
  • As a couple more troubleshooting ideas: 1. Try your script with SlimerJS. This uses a different rendering engine. If the results are different, maybe it is something not supported in PhantomJS 1.9. 2. ALso do page.render so you can see a screenshot. Is it dull, or does it look like the page you see in your own browser? Commented Sep 10, 2014 at 7:42
  • Thanks for ideas, I will try SlimerJS - but right now I am getting 'Fail' - which means that page.open(url, function (status) resulted into status different than success. When I did screenshot, it is dull... Commented Sep 10, 2014 at 8:57

1 Answer 1

1

This will do the trick

    page.open(url, function (status) {
    if (status !== 'success') {
        console.log('Unable to load the url!');
        phantom.exit();
    } else {
        window.setTimeout(function () {
            var results = page.evaluate(function() {
                return document.documentElement.innerHTML;
            });
            console.log(results)
            phantom.exit();
        }, 200);
    }
});
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.