Using JSoup CSS selectors

Question

I am trying to use JSoup to scrape some content off of a website. Here is some sample HTML content from the page I am interested in:

<div class="sep_top shd_hdr pb2 luna">
    <div class="KonaBody" style="padding-left:0px;">
        <div class="lunatext results_content frstluna">
            <div class="luna-Ent">
                <div class="header">
                <div class="body">
                    <div class="pbk">
                        <div id="rltqns">
                    <div class="pbk">
                        <span class="pg">
                            <span id="hotword">
                                <span id="hotword">Fizz</span>
                            </span>
                        </span>
                        <div class="luna-Ent">
                        <div class="luna-Ent">
                        <div class="luna-Ent">
                        <div class="luna-Ent">
                    </div>
                    <div class="pbk">
                        <span class="sectionLabel">
                        <span class="pg">
                            <span id="hotword">
                                <span id="hotword">Buzz</span>
                            </span>
                        </span>
                        <span class="pg">
                            <span id="hotword">
                                <span id="hotword">Foo</span>
                            </span>
                        </span>
                        <span class="pg">
                            <span id="hotword">
                                <span id="hotword">Bar</span>
                            </span>
                        </span>
                    </div>
                <div class="tail">
            </div>
            <div class="rcr">
        <!-- ... rest of content omitted for brevity -->

I am interested in obtaining a list of all the hotwords in the page (so "Fizz", "Buzz", "Foo" and "Bar"). But I can't just query for hotword, because they use the hotword class all over the place to decorate lots of different elements. Specifically, I need all the hotwords that exist inside a pbk pg hotword element. Note that pbks can contain 0+ pgs, and pgs can contain 0+ hotwords, and hotwords can contain 1+ other hotwords. I have the following code:

// Update, per PShemo:
Document doc = Jsoup.connect("http://somesite.example.com").get();

System.out.println("Starting to crawl...");

// Get the document's .pbk elements.
Elements pbks = doc.select(".pbk");

List<String> hotwords = new ArrayList<String>();

System.out.println(String.format("Found %s pbks.", pbks.size()));
int pbkCount = 0;
for(Element pbk : pbks) {
    pbkCount++;

    // Get the .pbk element's .pg elements.
    for(Element pg : pbk.getElementsByClass("pg")) {
        System.out.println(String.format("PBK #%s has %s pgs.", pbkCount, pbk.getElementsByClass("pg").size()));
        Element hotword = pg.getElementById("hotword");

        System.out.println("Adding hotword: " + hotword.text());
        hotwords.add(hotword.text());
    }
}

Running that code produces the following output:

Starting to crawl...
Found 3 pbks.

I am either not using the JSoup API correctly, or not using the right selectors, or both. Any thoughts as to where I'm going awry?

Pshemo · Accepted Answer · 2013-11-12 21:15:23Z

2

If you are using getElementsByClass then you don't need to add . before it, just use class name like getElementsByClass("pg"), not getElementsByClass(".pg")

Same goes to getElementById. Don't add # before id value. Just use getElementById("hotword").

Also it seems that your divs with pbk class are nested so getElementsByClass could give you duplicate results.

After knowing what page you are trying to parse you can do it with one selector. Try maybe this way

for (Element element:doc.select("div.body div.pbk span.pg")){
    System.out.println(element.text());
}

edited Nov 12, 2013 at 21:15

answered Nov 12, 2013 at 20:35

Pshemo

125k25 gold badges194 silver badges280 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

IAmYourFaja Over a year ago

Thanks @Pschemo (+1) - that helped a little but now it's telling me that there are no hotwords in the document, which I know is false. The URL I'm actually trying to hit is http://dictionary.reference.com/browse/quick?s=t, and I'm trying to accumulate a list of all the different "word types" (adjective, noun, verb) a particular word is. For example, on that link, the word "quick" is 3 different types: adjective, noun and adverb. How could I tweak my JSoup selectors to obtain a List<String> with the values "adjective", "noun" and "verb" in it?

Pshemo Over a year ago

@TicketMonster I updated my code a little. It seems to work as you want. I came up with this solution after seeing HTML code that JSoup get from that site (you can see it with System.out.println(doc);).

Prasad Khode · Accepted Answer · 2016-05-23 09:11:11Z

0

Elements hotwords = document.select("#hotwords");

for (Element hotword : hotwords){
    String word = hotword.getText();
}

edited May 23, 2016 at 9:11

Prasad Khode

6,77712 gold badges48 silver badges62 bronze badges

answered Nov 12, 2013 at 20:32

William Falcon

9,82314 gold badges69 silver badges111 bronze badges

1 Comment

IAmYourFaja Over a year ago

Thanks @William Falcon, but that doesn't work either. The hotword variable has a size of 0.

Collectives™ on Stack Overflow

Using JSoup CSS selectors

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related