htmlunit java - How to parse a content results from javascript? and a htmlunit error

Question

This is one the page that i am going to scrape: https://www.tokopedia.com/berkahcell2/promo-termurah-vr-virtual-reality-box-v-2-0-remote-bluetooth-gamepad/review?src=topads

I want to scrape by the comment text under "ulasan terbaru" which I theorize it is a result of a javascript (I might be wrong though, I am not entirely sure how to check it through inspect element), other than that I also am not sure on several things in HTMLUnit

I have read that to scrape the javascript content I need to use HTMLUnit than Jsoup. I have read http://htmlunit.10904.n7.nabble.com/Selecting-a-div-by-class-name-td25787.html to try scrape the comment the div by class but i got zero output.

    public static void comment(String url) throws IOException{

        WebClient client = new WebClient();
        client.setCssEnabled(true);
        client.setJavaScriptEnabled(true);
        
        try {
            HtmlPage page = client.getPage(url);
            List<?> date = page.getByXPath("//div/@class='list-box-comment'");
            System.out.println(date.size());
            for(int i =0 ; i<date.size();i++){
                System.out.println(date.get(i).asText());
            }
        }
        catch(Exception e){
                e.printStackTrace();
            }

    }

This is the part of my code that will handle the comment scraping, do I do it right?. But I have two problems:

at "asText()" it said that "can't resolve method asText()"
Even if i run without "asText()", i get this as an error:

com.gargoylesoftware.htmlunit.ObjectInstantiationException: unable to create HTML parser
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.<init>(HTMLParser.java:418)
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.<init>(HTMLParser.java:342)
    at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:203)
    at com.gargoylesoftware.htmlunit.html.HTMLParser.parseHtml(HTMLParser.java:179)
    at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:221)
    at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:106)
    at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:433)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:311)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:373)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:358)
    at ReviewScraping.comment(ReviewScraping.java:86)
    at ReviewScraping.main(ReviewScraping.java:108)
Caused by: org.xml.sax.SAXNotRecognizedException: Feature 'http://cyberneko.org/html/features/scanner/allow-selfclosing-iframe' is not recognized.
    at org.apache.xerces.parsers.AbstractSAXParser.setFeature(Unknown Source)
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.<init>(HTMLParser.java:411)
    ... 11 more

I hope that I can show all of the comment

/edit I use Intellij as my IDE when I am making this and the dependecies for HTMLUnit is in my Intellij project structure by using Maven

RBRi · Accepted Answer · 2019-05-19 11:48:07Z

Regarding you code:

public static void main(String[] args) throws IOException {
    final String url = "https://www.tokopedia.com/berkahcell2/promo-termurah-vr-virtual-reality-box-v-2-0-remote-bluetooth-gamepad/review?src=topads";

    try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_60)) {
        webClient.getOptions().setThrowExceptionOnScriptError(false);

        HtmlPage page = webClient.getPage(url);
        webClient.waitForBackgroundJavaScript(40_000);

        System.out.println(page.asXml());

        List<DomNode> date = page.getByXPath("//div[@class='list-box-comment']");
        System.out.println(date.size());

        for(int i = 0 ; i < date.size();i++){
            System.out.println(date.get(i).asText());
        }
    }
}

Now the problems with the page itself:

Have done some test and it looks like the page produces errors with real browsers also (check the browser console). But with HtmlUnit you get more problems (maybe because of the missing support of some javascript features). Usually this kind of pages are using many, many lines of js code - it will be really time consuming for me to figure out what is going wrong. If you like to get this fixed, try to find the real reason of the problem (see http://htmlunit.sourceforge.net/submittingJSBugs.html for some hints) and file a bug report.

Collectives™ on Stack Overflow

htmlunit java - How to parse a content results from javascript? and a htmlunit error

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related