Get html changes from javascript link using Java

Question

I have been using JSOUP for all my html website requirements thus far. I have however, ran into a roadblock. Kickass gets the full list of files from each torrent by clicking a javascript link <a href="javascript:getFiles('52261EB9480EDFD83B5B85C8C4817D28F3AE0C95', 1);" class="showmore folded">. I have traced the javascript function back to a *.js file that is used but I am not sure how to mimic this behaviour. Ideally I would just like to grab the javascript link from the main site, and get the list like I would with any other website, though everything for JSOUP seems to follow html links rather than javascript ones.

So I tried with HtmlUnit. I inspected the site with chrome: https://kickass.to/australian-aria-top-50-singles-13-10-2014-t9702189.html

and copied the xpath expression. Currently the below does not work, while I would like to get around having to use this library for a single function, I can't get it work in general.

My Test Code:

    java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);
    WebClient webClient = new WebClient(BrowserVersion.CHROME);
    HtmlPage page = webClient.getPage("https://kickass.to/australian-aria-top-50-singles-13-10-2014-t9702189.html");

    HtmlElement htmlElement = page.getFirstByXPath("//*[@id=\"ul_top\"]/tbody/tr[31]/td[2]/a");
    System.out.println(htmlElement.toString());
    htmlElement.click(); 
    webClient.waitForBackgroundJavaScript(1000);

    //get changes here
    webClient.closeAllWindows();

Using in-built libraries actually. Purely checking torrent information with JSOUP and htmlunit. — Larry
– Larry, Commented Apr 4, 2015 at 4:51
Is javascript enabled for htmlunit? I have posted an alternative solution. But, this question might help - stackoverflow.com/questions/10136873/… — LittlePanda
– LittlePanda, Commented Apr 4, 2015 at 5:15

LittlePanda · Accepted Answer · 2015-04-04 05:12:45Z

2

Jsoup does not execute Javascript (as far as I have seen from many questions so far). You should consider using Selenium + HtmlUnitDriver (this runs headless). I have tried out this sample code and the page source contains the content that is displayed after executing the javascript.

Sample code:

//set javascript enabled to true
HtmlUnitDriver driver = new HtmlUnitDriver(true);

//to set logging off....
LogFactory.getFactory().setAttribute("org.apache.commons.logging.Log","org.apache.commons.logging.impl.NoOpLog");
java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit").setLevel(Level.OFF);
java.util.logging.Logger.getLogger("org.apache.commons.httpclient").setLevel(Level.OFF);

// navigate to the page
driver.get("https://kickass.to/australian-aria-top-50-singles-13-10-2014-t9702189.html");
driver.executeScript("javascript:getFiles('52261EB9480EDFD83B5B85C8C4817D28F3AE0C95', 1);","");
//this is displayed only after executing the javascript
System.out.println(driver.getPageSource().contains("Australian ARIA Top 50 Singles 13.10.2014.pdf"));
System.out.println(driver.getPageSource().contains("47. Sheppard - Geronimo.mp3"));
//System.out.println(driver.getPageSource());
driver.quit();

answered Apr 4, 2015 at 5:12

LittlePanda

2,5251 gold badge23 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Larry Over a year ago

Yea, that worked a lot better! The HtmlUnit initialisation takes a bit, but if I parse it through functions and looks I can get away with the startup process. Would it be worth doing away with jsoup if i am just using it for going through page source, just use htmlunit/selenium instead so I don't have to download the page twice?

Larry Over a year ago

I was looking at phantomjs as well, thought it was platform dependent though?

LittlePanda Over a year ago

PhantomJS guys run a Ghostdriver project, its for java. Ghostdriver is an implementation of Webdriver, there are a few stackoverflow questions about it. Since, you are just doing website scraping, ghostdriver or htmlunitdriver are good options.

Larry Over a year ago

As it turns out...jsoup could do a post to the server and retrieved a json result.

Collectives™ on Stack Overflow

Get html changes from javascript link using Java

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related