0

I have the following html I am trying to parse into objects in Java using jsoup.

I'm trying to traverse the elements and extract all "Class" as objects to generate timetable data. Each "Class" has a time, location, lecturer and description and so on, but that is not the issue. All elements are of class tt_details. Each day does not have a specific parent to child relationship, however I can extract the days involved using Elements dayNames = content.getElementsByClass("tt_day");

Each day can have a different number of "Classes" per day as you can see Monday has 3 "Classes" and tuesday has, so a normal loop structure won't work. How can I achieve this?

<div class='tt_details'>
    <div class='tt_day'>Mon</div>
</div>
<div class='tt_details'>
    <div class='tt_timeslot'>11:00 - 13:00
        <div class='tt_day_small'> (Mon)</div>
    </div>
    <div class='tt_detail'>Internet of Things<br/>E1010 - MAC Lab <br/></div>
    <div class='tt_lecturer'>Loftus, M</div>
</div>
<div class='tt_details'>
    <div class='tt_timeslot'>13:00 - 14:00
        <div class='tt_day_small'> (Mon)</div>
    </div>
    <div class='tt_detail'>Computer Systems & Networking<br/>A0004 - Tiered Lecture Theatre (132) <br/></div>
    <div class='tt_lecturer'>Lang, D</div>
</div>
<div class='tt_details'>
    <div class='tt_timeslot'>16:00 - 18:00
        <div class='tt_day_small'> (Mon)</div>
    </div>
    <div class='tt_detail'>Intro.to Programming L8<br/>D2005 - Computer Laboratory (32) <br/></div>
    <div class='tt_lecturer'>Kinsella,V</div>
</div>
<div class='tt_details'>
    <div class='tt_day'>Tue</div>
</div>
<div class='tt_details'>
    <div class='tt_timeslot'>09:00 - 10:00
        <div class='tt_day_small'> (Tue)</div>
    </div>
    <div class='tt_detail'>Mathematics 2<br/>A0004 - Tiered Lecture Theatre (132) <br/></div>
    <div class='tt_lecturer'>O'Regan,D</div>
</div>
<div class='tt_details'>
    <div class='tt_timeslot'>10:00 - 11:00
        <div class='tt_day_small'> (Tue)</div>
    </div>
    <div class='tt_detail'>Mathematics 2<br/>E0017 - Tiered Classroom (106) <br/></div>
    <div class='tt_lecturer'>O'Regan,D</div>
</div>
<div class='tt_details'>
    <div class='tt_timeslot'>11:00 - 12:00
        <div class='tt_day_small'> (Tue)</div>
    </div>
    <div class='tt_detail'>Intro to Programming<br/>A0006 - Tiered Lecture Theatre (152) <br/></div>
    <div class='tt_lecturer'>Kinsella,V</div>
</div>
<div class='tt_details'>
    <div class='tt_timeslot'>16:00 - 17:00
        <div class='tt_day_small'> (Tue)</div>
    </div>
    <div class='tt_detail'>Computer Systems & Networking<br/>A0006 - Tiered Lecture Theatre (152) <br/></div>
    <div class='tt_lecturer'>Lang, D</div>
</div>
3
  • Can you use jQuery with your project? This would make it much, much easier. Commented Feb 10, 2016 at 18:16
  • html is like an XML so you can create Object with xml annotation and use Mashal and unMarshal Commented Feb 10, 2016 at 18:24
  • Its a scraper to feed an android app, so I'm not using any jQuery Commented Feb 10, 2016 at 18:28

3 Answers 3

2

Something like this could help:

String html = ""
        +"<div class='tt_details'>"
        +"    <div class='tt_day'>Mon</div>"
        +"</div>"
        +"<div class='tt_details'>"
        +"    <div class='tt_timeslot'>11:00 - 13:00"
        +"        <div class='tt_day_small'> (Mon)</div>"
        +"    </div>"
        +"    <div class='tt_detail'>Internet of Things<br/>E1010 - MAC Lab <br/></div>"
        +"    <div class='tt_lecturer'>Loftus, M</div>"
        +"</div>"
        +"<div class='tt_details'>"
        +"    <div class='tt_timeslot'>13:00 - 14:00"
        +"        <div class='tt_day_small'> (Mon)</div>"
        +"    </div>"
        +"    <div class='tt_detail'>Computer Systems & Networking<br/>A0004 - Tiered Lecture Theatre (132) <br/></div>"
        +"    <div class='tt_lecturer'>Lang, D</div>"
        +"</div>"
        +"<div class='tt_details'>"
        +"    <div class='tt_timeslot'>16:00 - 18:00"
        +"        <div class='tt_day_small'> (Mon)</div>"
        +"    </div>"
        +"    <div class='tt_detail'>Intro.to Programming L8<br/>D2005 - Computer Laboratory (32) <br/></div>"
        +"    <div class='tt_lecturer'>Kinsella,V</div>"
        +"</div>"
        +"<div class='tt_details'>"
        +"    <div class='tt_day'>Tue</div>"
        +"</div>"
        +"<div class='tt_details'>"
        +"    <div class='tt_timeslot'>09:00 - 10:00"
        +"        <div class='tt_day_small'> (Tue)</div>"
        +"    </div>"
        +"    <div class='tt_detail'>Mathematics 2<br/>A0004 - Tiered Lecture Theatre (132) <br/></div>"
        +"    <div class='tt_lecturer'>O'Regan,D</div>"
        +"</div>"
        +"<div class='tt_details'>"
        +"    <div class='tt_timeslot'>10:00 - 11:00"
        +"        <div class='tt_day_small'> (Tue)</div>"
        +"    </div>"
        +"    <div class='tt_detail'>Mathematics 2<br/>E0017 - Tiered Classroom (106) <br/></div>"
        +"    <div class='tt_lecturer'>O'Regan,D</div>"
        +"</div>"
        +"<div class='tt_details'>"
        +"    <div class='tt_timeslot'>11:00 - 12:00"
        +"        <div class='tt_day_small'> (Tue)</div>"
        +"    </div>"
        +"    <div class='tt_detail'>Intro to Programming<br/>A0006 - Tiered Lecture Theatre (152) <br/></div>"
        +"    <div class='tt_lecturer'>Kinsella,V</div>"
        +"</div>"
        +"<div class='tt_details'>"
        +"    <div class='tt_timeslot'>16:00 - 17:00"
        +"        <div class='tt_day_small'> (Tue)</div>"
        +"    </div>"
        +"    <div class='tt_detail'>Computer Systems & Networking<br/>A0006 - Tiered Lecture Theatre (152) <br/></div>"
        +"    <div class='tt_lecturer'>Lang, D</div>"
        +"</div>"
        ;
Document doc = Jsoup.parse(html);
Elements courseEls = doc.select("div.tt_details:not(:has(div.tt_day))");
class Course{
    public Course(String day, String time, String lecturer, String subject) {
        super();
        this.day = day;
        this.time = time;
        this.lecturer = lecturer;
        this.subject = subject;
    }
    public String day;
    public String time;
    public String lecturer;
    public String subject;

    public String toString(){
        return day + " : "+ time +" : "+ lecturer + " : "+ subject;
    }
}
Map<String,List<Course>> coursesByDay = new HashMap<>();
for (Element courseEl : courseEls){
    Element timeSlotEl = courseEl.select(".tt_timeslot").first();
    String timeSlotStr = timeSlotEl.ownText();
    String dayStr = timeSlotEl.select(".tt_day_small").first().text().trim().replace("(", "").replace(")", "");
    String detailStr = courseEl.select(".tt_detail").first().text();
    String lecturerStr = courseEl.select(".tt_lecturer").first().text();

    Course course = new Course(dayStr, timeSlotStr, lecturerStr, detailStr);
    List<Course> courses = coursesByDay.get(dayStr);
    if (courses == null){
        courses = new ArrayList<>();
        coursesByDay.put(dayStr, courses);
    }
    courses.add(course);
}

//get all courses on Tue
List<Course> courses = coursesByDay.get("Tue");
for (Course c : courses){
    System.out.println(c);
}

This creates a map with courses by day. So the map key is the day and it contains a list of Course objects.

Some remarks about this:

  • I use a custom Object to hold the course infos
  • I use the selector div.tt_details:not(:has(div.tt_day)) to get only the course divs, leaving out the day divs. This is possible because the info about the day is repeated within the course div.
  • CSS selectors are used to get the details.
  • Note the difference between ownText() and text(). This is used to only get the time info without the day.
  • The Map is filled with its contents dynamically.
Sign up to request clarification or add additional context in comments.

3 Comments

I'm unable to iterate through the set of elements. The line for (Element courseEl : courseEls){ is giving a compile error saying Unknown class: courselEls
courseEls should be of class Elements, which implements the List interface. You should be able to iterate over this. Did I or you make a typo?
Apologies, it was caused by a typeo
1

Try this

static final String[] DETAILS = { "tt_timeslot", "tt_day_small", "tt_detail", "tt_lecturer" };

and

     Document doc = Jsoup.parse(html);
     String day = null;
     for (Element e : doc.select("div.tt_details")) {
         Elements days = e.select("div.tt_day");
         if (days.size() > 0) {
             day = days.get(0).text();
             System.out.printf("    *** %s ***%n", day);
         } else {
             System.out.printf("        --------%n");
             for (String cls : DETAILS) {
                 Elements elements = e.select("div." + cls);
                 if (elements.size() > 0)
                     System.out.printf("%24s : %s%n", cls, elements.get(0).text());
             }
         }
     }

result

*** Mon ***
    --------
         tt_timeslot : 11:00 - 13:00 (Mon)
        tt_day_small : (Mon)
           tt_detail : Internet of Things E1010 - MAC Lab
         tt_lecturer : Loftus, M
    --------
         tt_timeslot : 13:00 - 14:00 (Mon)
        tt_day_small : (Mon)
           tt_detail : Computer Systems & Networking A0004 - Tiered Lecture Theatre (132)
         tt_lecturer : Lang, D
    --------
         tt_timeslot : 16:00 - 18:00 (Mon)
        tt_day_small : (Mon)
           tt_detail : Intro.to Programming L8 D2005 - Computer Laboratory (32)
         tt_lecturer : Kinsella,V
*** Tue ***
    --------
         tt_timeslot : 09:00 - 10:00 (Tue)
        tt_day_small : (Tue)
           tt_detail : Mathematics 2 A0004 - Tiered Lecture Theatre (132)
         tt_lecturer : O'Regan,D
    --------
         tt_timeslot : 10:00 - 11:00 (Tue)
        tt_day_small : (Tue)
           tt_detail : Mathematics 2 E0017 - Tiered Classroom (106)
         tt_lecturer : O'Regan,D
    --------
         tt_timeslot : 11:00 - 12:00 (Tue)
        tt_day_small : (Tue)
           tt_detail : Intro to Programming A0006 - Tiered Lecture Theatre (152)
         tt_lecturer : Kinsella,V
    --------
         tt_timeslot : 16:00 - 17:00 (Tue)
        tt_day_small : (Tue)
           tt_detail : Computer Systems & Networking A0006 - Tiered Lecture Theatre (152)
         tt_lecturer : Lang, D

1 Comment

Getting elements with certain classes with CSS should better be done with element.select(".classname"), because if an element carries multiple classes you can't really be sure about the ordering of the class names. with the dot operator you can easily handel this. Also you can concatenate like this: el.select(".className1.className2"). Still I like your simple approach. +1
0

If this is from HTML source of an online page then you can use selenium for such purposes, for this you have to import selenium jars.

My suggestion -

String datentime = driver.findElement(By.className("tt_timeslot")).getText(); 

if you have same name of the elements, then use unique id's or css selectors or xpaths.

2 Comments

The only information I have to work with is what's posted in the HTML, there is no more id's or css selectors to work with
you can create xpaths by using existing <div>elements as locations. for example - //div/div[@class='tt_day']

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.