0

I am trying to use JSoup to scrape some content off of a website. Here is some sample HTML content from the page I am interested in:

<div class="sep_top shd_hdr pb2 luna">
    <div class="KonaBody" style="padding-left:0px;">
        <div class="lunatext results_content frstluna">
            <div class="luna-Ent">
                <div class="header">
                <div class="body">
                    <div class="pbk">
                        <div id="rltqns">
                    <div class="pbk">
                        <span class="pg">
                            <span id="hotword">
                                <span id="hotword">Fizz</span>
                            </span>
                        </span>
                        <div class="luna-Ent">
                        <div class="luna-Ent">
                        <div class="luna-Ent">
                        <div class="luna-Ent">
                    </div>
                    <div class="pbk">
                        <span class="sectionLabel">
                        <span class="pg">
                            <span id="hotword">
                                <span id="hotword">Buzz</span>
                            </span>
                        </span>
                        <span class="pg">
                            <span id="hotword">
                                <span id="hotword">Foo</span>
                            </span>
                        </span>
                        <span class="pg">
                            <span id="hotword">
                                <span id="hotword">Bar</span>
                            </span>
                        </span>
                    </div>
                <div class="tail">
            </div>
            <div class="rcr">
        <!-- ... rest of content omitted for brevity -->

I am interested in obtaining a list of all the hotwords in the page (so "Fizz", "Buzz", "Foo" and "Bar"). But I can't just query for hotword, because they use the hotword class all over the place to decorate lots of different elements. Specifically, I need all the hotwords that exist inside a pbk pg hotword element. Note that pbks can contain 0+ pgs, and pgs can contain 0+ hotwords, and hotwords can contain 1+ other hotwords. I have the following code:

// Update, per PShemo:
Document doc = Jsoup.connect("http://somesite.example.com").get();

System.out.println("Starting to crawl...");

// Get the document's .pbk elements.
Elements pbks = doc.select(".pbk");

List<String> hotwords = new ArrayList<String>();

System.out.println(String.format("Found %s pbks.", pbks.size()));
int pbkCount = 0;
for(Element pbk : pbks) {
    pbkCount++;

    // Get the .pbk element's .pg elements.
    for(Element pg : pbk.getElementsByClass("pg")) {
        System.out.println(String.format("PBK #%s has %s pgs.", pbkCount, pbk.getElementsByClass("pg").size()));
        Element hotword = pg.getElementById("hotword");

        System.out.println("Adding hotword: " + hotword.text());
        hotwords.add(hotword.text());
    }
}

Running that code produces the following output:

Starting to crawl...
Found 3 pbks.

I am either not using the JSoup API correctly, or not using the right selectors, or both. Any thoughts as to where I'm going awry?

2 Answers 2

2

If you are using getElementsByClass then you don't need to add . before it, just use class name like getElementsByClass("pg"), not getElementsByClass(".pg")

Same goes to getElementById. Don't add # before id value. Just use getElementById("hotword").

Also it seems that your divs with pbk class are nested so getElementsByClass could give you duplicate results.


After knowing what page you are trying to parse you can do it with one selector. Try maybe this way

for (Element element:doc.select("div.body div.pbk span.pg")){
    System.out.println(element.text());
}
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks @Pschemo (+1) - that helped a little but now it's telling me that there are no hotwords in the document, which I know is false. The URL I'm actually trying to hit is http://dictionary.reference.com/browse/quick?s=t, and I'm trying to accumulate a list of all the different "word types" (adjective, noun, verb) a particular word is. For example, on that link, the word "quick" is 3 different types: adjective, noun and adverb. How could I tweak my JSoup selectors to obtain a List<String> with the values "adjective", "noun" and "verb" in it?
@TicketMonster I updated my code a little. It seems to work as you want. I came up with this solution after seeing HTML code that JSoup get from that site (you can see it with System.out.println(doc);).
0
Elements hotwords = document.select("#hotwords");

for (Element hotword : hotwords){
    String word = hotword.getText();
}

1 Comment

Thanks @William Falcon, but that doesn't work either. The hotword variable has a size of 0.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.