2

I need to scrape a website with content 'inserted' by Angular. And it needs to be done with java.

I have tried Selenium Webdriver (as I have used Selenium before for scraping less dynamic webpages). But I have no idea how to deal with the Angular part. Apart from the script tags in the head section of the page, there is only one place in the site where there are Angular attributes:

<div data-ng-module="vindeenjob"><div data-ng-view=""></div>

I found this article here, but honestly... I can't figure it out. It seems like the author is selecting (lets call them ) 'ng-attributes' like this

WebElement theForm = wd.findElement(By.cssSelector("div[ng-controller='UserForm']"));

but fails to explain why he does what he does. In the source code of his demo page, I cant find anything that is called 'UserForm'... So the why remains a mystery.

Then I tried setting a timeinterval for Selenium, in hopes that the page would be rendered and that I eventually can grab the results after the wait period, like this:

    WebDriver webdriver = new HtmlUnitDriver();
    webdriver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);
    webdriver.get("https://www.myurltoscrape.com");

But to no avail. Then there is also this article, which gives some interesting exceptions, such as Cannot set property [HTMLStyleElement].media that has only a getter to all. which basically means that there might be something wrong with the javascript. However, HtmlUnit does seems to realize that there is javascript on the page, which is more then I got before. I do realize (as I did a search on the exceptions) that there is a feature in HtmlUnit which should make sure that you don't see the javascript exceptions. I turned it off, but I get exceptions anyway. Here is the code:

webClient.getOptions().setThrowExceptionOnScriptError(false); 

I would post more code, but basically nothing scrapes the dynamic content and I am pretty sure that it is not the code that is wrong, it merely is not the correct solution yet.

Can I get some help please?

3
  • Selenium is not designed to do this, although it is capable of doing.If your requirement is pure scraping, I would suggest Jsoup or something advanced like Apache Nutch. Commented Mar 30, 2015 at 9:07
  • @Madusudanan Thanks for the comment. I have tried Jsoup already, and then I found this on stackoverflow: link . It is from quite some time ago, so jsoup might have been updated, but can't find anything concerning jsoup and angular actually... But i ll look into the Nutch though. Thanks! Commented Mar 30, 2015 at 9:15
  • Be advised that Nutch is actually pretty advanced, much more than just a web crawler.Also see Phantom JS which is actually a javascript library, but can be called by selenium. Commented Mar 30, 2015 at 9:32

2 Answers 2

4

In the end, I have followed Madusudanan 's excellent advise and I looked into PhantomJS / Selenium combination. And there actually is a solution! Its called PhantomJSDriver.

You can find the maven dependency here. Here is more info on ghost driver.

The setup in Maven- I have added the following:

<dependency>
        <groupId>net.sourceforge.htmlunit</groupId>
        <artifactId>htmlunit</artifactId>
        <version>2.41.0</version>
    </dependency>
    <dependency>
        <groupId>com.github.detro</groupId>
        <artifactId>phantomjsdriver</artifactId>
        <version>1.2.0</version>
    </dependency>

It also runs with Selenium version 2.45 which is the latest version up until now. I am mentioning this, because of some articles I read in which people say that the Phantom driver isn't compatible with every version of Selenium, but I guess they addressed that problem in the meantime.

If you are already using a Selenium/Phantomdriver combination and you are getting 'strict javascript errors' on a certain site, update your version of selenium. That will fix it.

And here is some sample code:

public void testPhantomDriver() throws Exception {
    DesiredCapabilities options = new DesiredCapabilities();
    // the website i am scraping uses ssl, but I dont know what version
    options.setCapability(PhantomJSDriverService.PHANTOMJS_CLI_ARGS, new String[] {
          "--ssl-protocol=any"
      });

    PhantomJSDriver driver = new PhantomJSDriver(options);

    driver.get("https://www.mywebsite");

    List<WebElement> elements = driver.findElementsByClassName("media-title");

    for(WebElement element : elements ){
        System.out.println(element.getText());
    }

    driver.quit();
}
Sign up to request clarification or add additional context in comments.

2 Comments

And where to set the path to phantomjs.exe ?
From what i can remember, you don't have to set the path to phantomjs.exe. The only thing that needs to be done is installing phantomjs on the system where you want to run this code (windows box)
0

Here is the perfect Solution to scrap any web page with JSoup & WebDriver with java

ChromeOptions chromeOptions = new ChromeOptions();
chromeOptions.addArguments("--headless");
WebDriver driver = new romeDriver(chromeOptions);
driver.get(bean.getDomainQuery().trim());
Document doc = Jsoup.parse(driver.getPageSource());

And then use JSoup selectors to read any tag info

1 Comment

Cool, tnx! i ll look into that!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.