scrape an angularjs website with java

Question

I need to scrape a website with content 'inserted' by Angular. And it needs to be done with java.

I have tried Selenium Webdriver (as I have used Selenium before for scraping less dynamic webpages). But I have no idea how to deal with the Angular part. Apart from the script tags in the head section of the page, there is only one place in the site where there are Angular attributes:

<div data-ng-module="vindeenjob"><div data-ng-view=""></div>

I found this article here, but honestly... I can't figure it out. It seems like the author is selecting (lets call them ) 'ng-attributes' like this

WebElement theForm = wd.findElement(By.cssSelector("div[ng-controller='UserForm']"));

but fails to explain why he does what he does. In the source code of his demo page, I cant find anything that is called 'UserForm'... So the why remains a mystery.

Then I tried setting a timeinterval for Selenium, in hopes that the page would be rendered and that I eventually can grab the results after the wait period, like this:

    WebDriver webdriver = new HtmlUnitDriver();
    webdriver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);
    webdriver.get("https://www.myurltoscrape.com");

But to no avail. Then there is also this article, which gives some interesting exceptions, such as Cannot set property [HTMLStyleElement].media that has only a getter to all. which basically means that there might be something wrong with the javascript. However, HtmlUnit does seems to realize that there is javascript on the page, which is more then I got before. I do realize (as I did a search on the exceptions) that there is a feature in HtmlUnit which should make sure that you don't see the javascript exceptions. I turned it off, but I get exceptions anyway. Here is the code:

webClient.getOptions().setThrowExceptionOnScriptError(false);

I would post more code, but basically nothing scrapes the dynamic content and I am pretty sure that it is not the code that is wrong, it merely is not the correct solution yet.

Can I get some help please?

Selenium is not designed to do this, although it is capable of doing.If your requirement is pure scraping, I would suggest Jsoup or something advanced like Apache Nutch. — Madusudanan
– Madusudanan, Commented Mar 30, 2015 at 9:07
@Madusudanan Thanks for the comment. I have tried Jsoup already, and then I found this on stackoverflow: link . It is from quite some time ago, so jsoup might have been updated, but can't find anything concerning jsoup and angular actually... But i ll look into the Nutch though. Thanks! — ocket-san
– ocket-san, Commented Mar 30, 2015 at 9:15
Be advised that Nutch is actually pretty advanced, much more than just a web crawler.Also see Phantom JS which is actually a javascript library, but can be called by selenium. — Madusudanan
– Madusudanan, Commented Mar 30, 2015 at 9:32

ocket-san · Accepted Answer · 2015-04-02 11:29:19Z

4

In the end, I have followed Madusudanan 's excellent advise and I looked into PhantomJS / Selenium combination. And there actually is a solution! Its called PhantomJSDriver.

You can find the maven dependency here. Here is more info on ghost driver.

The setup in Maven- I have added the following:

<dependency>
        <groupId>net.sourceforge.htmlunit</groupId>
        <artifactId>htmlunit</artifactId>
        <version>2.41.0</version>
    </dependency>
    <dependency>
        <groupId>com.github.detro</groupId>
        <artifactId>phantomjsdriver</artifactId>
        <version>1.2.0</version>
    </dependency>

It also runs with Selenium version 2.45 which is the latest version up until now. I am mentioning this, because of some articles I read in which people say that the Phantom driver isn't compatible with every version of Selenium, but I guess they addressed that problem in the meantime.

If you are already using a Selenium/Phantomdriver combination and you are getting 'strict javascript errors' on a certain site, update your version of selenium. That will fix it.

And here is some sample code:

public void testPhantomDriver() throws Exception {
    DesiredCapabilities options = new DesiredCapabilities();
    // the website i am scraping uses ssl, but I dont know what version
    options.setCapability(PhantomJSDriverService.PHANTOMJS_CLI_ARGS, new String[] {
          "--ssl-protocol=any"
      });

    PhantomJSDriver driver = new PhantomJSDriver(options);

    driver.get("https://www.mywebsite");

    List<WebElement> elements = driver.findElementsByClassName("media-title");

    for(WebElement element : elements ){
        System.out.println(element.getText());
    }

    driver.quit();
}

answered Apr 2, 2015 at 11:29

ocket-san

88412 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Petr Over a year ago

And where to set the path to phantomjs.exe ?

ocket-san Over a year ago

From what i can remember, you don't have to set the path to phantomjs.exe. The only thing that needs to be done is installing phantomjs on the system where you want to run this code (windows box)

Shahid Hussain Abbasi · Accepted Answer · 2018-02-18 07:55:24Z

0

Here is the perfect Solution to scrap any web page with JSoup & WebDriver with java

ChromeOptions chromeOptions = new ChromeOptions();
chromeOptions.addArguments("--headless");
WebDriver driver = new romeDriver(chromeOptions);
driver.get(bean.getDomainQuery().trim());
Document doc = Jsoup.parse(driver.getPageSource());

And then use JSoup selectors to read any tag info

edited Feb 18, 2018 at 7:55

user3559349

answered Feb 18, 2018 at 7:37

Shahid Hussain Abbasi

2,69219 silver badges10 bronze badges

1 Comment

ocket-san Over a year ago

Cool, tnx! i ll look into that!

Collectives™ on Stack Overflow

scrape an angularjs website with java

2 Answers 2

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related