I need to scrape a website with content 'inserted' by Angular. And it needs to be done with java.
I have tried Selenium Webdriver (as I have used Selenium before for scraping less dynamic webpages). But I have no idea how to deal with the Angular part. Apart from the script tags in the head section of the page, there is only one place in the site where there are Angular attributes:
<div data-ng-module="vindeenjob"><div data-ng-view=""></div>
I found this article here, but honestly... I can't figure it out. It seems like the author is selecting (lets call them ) 'ng-attributes' like this
WebElement theForm = wd.findElement(By.cssSelector("div[ng-controller='UserForm']"));
but fails to explain why he does what he does. In the source code of his demo page, I cant find anything that is called 'UserForm'... So the why remains a mystery.
Then I tried setting a timeinterval for Selenium, in hopes that the page would be rendered and that I eventually can grab the results after the wait period, like this:
WebDriver webdriver = new HtmlUnitDriver();
webdriver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);
webdriver.get("https://www.myurltoscrape.com");
But to no avail. Then there is also this article, which gives some interesting exceptions, such as Cannot set property [HTMLStyleElement].media that has only a getter to all. which basically means that there might be something wrong with the javascript. However, HtmlUnit does seems to realize that there is javascript on the page, which is more then I got before. I do realize (as I did a search on the exceptions) that there is a feature in HtmlUnit which should make sure that you don't see the javascript exceptions. I turned it off, but I get exceptions anyway. Here is the code:
webClient.getOptions().setThrowExceptionOnScriptError(false);
I would post more code, but basically nothing scrapes the dynamic content and I am pretty sure that it is not the code that is wrong, it merely is not the correct solution yet.
Can I get some help please?