Parsing Information from URL Using Jsoup

Question

I need help with my Java project using Jsoup (if you think there is a more efficient way to achieve the purpose, please let me know). The purpose of my program is to parse certain useful information from different URLs and put it in a text file. I am not an expert in HTML or JavaScript, therefore, it has been difficult for me to code in Java exactly what I want to parse. In the website that you see in the code below as one of the examples, the information that interests me to parse with Jsoup is everything you can see in the table under “Routing”(Route, Location, Vessel/Voyage, Container Arrival Date, Container Departure Date; = Origin, Seattle SSA Terminal T18, 26 Jun 15 A, 26 Jun 15 A… and so on). So far, with Jsoup we are only able to parse the title of the website, yet we have been unsuccessful in getting any of the body. Here is the code that I used, which I got from an online source:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

 public class Jsouptest71115 {

    public static void main(String[] args) throws Exception {
 String url = "http://google.com/gentrack/trackingMain.do "
                + "?trackInput01=999061985";
        Document document = Jsoup.connect(url).get();

        String title = document.title();
        System.out.println("title : " + title);

        String body = document.select("body").text();
        System.out.println("Body: " + body);


        }
    }

Community · Accepted Answer · 2020-06-20 09:12:55Z

Working code:

import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.util.ArrayList;

public class Sample {
    public static void main(String[] args) {
        String url = "http://homeport8.apl.com/gentrack/blRoutingPopup.do";

        try {
            Connection.Response response = Jsoup.connect(url)
                    .data("blNbr", "999061985")  // tracking number
                    .method(Connection.Method.POST)
                    .execute();

            Element tableElement = response.parse().getElementsByTag("table")
                    .get(2).getElementsByTag("table")
                    .get(2);

            Elements trElements = tableElement.getElementsByTag("tr");
            ArrayList<ArrayList<String>> tableArrayList = new ArrayList<>();

            for (Element trElement : trElements) {
                ArrayList<String> columnList = new ArrayList<>();
                for (int i = 0; i < 5; i++) {
                    columnList.add(i, trElement.children().get(i).text());
                }
                tableArrayList.add(columnList);
            }

            System.out.println("Origin/Location: "
                    +tableArrayList.get(1).get(1));// row and column number

            System.out.println("Discharge Port/Container Arrival Date: "
                    +tableArrayList.get(5).get(3));


        } catch (IOException e) {
            e.printStackTrace();
        }


    }


}

Output:

Origin/Location: SEATTLE SSA TERMINAL (T18), WA

Discharge Port/Container Arrival Date: 23 Jul 15 E

user2009750 · Accepted Answer · 2015-07-15 06:28:13Z

You need to utilize document.select("body") select method input to which is CSS selector. To know more about CSS selectors just google it, or Read this. Using CSS selectors you can identify parts of web page body easily.

In your particular case you will have a different problem though, for instance the table you are after is inside an IFrame and if you look at the html of web page you are visiting its(iframe's) url is "http://homeport8.apl.com/gentrack/blRoutingFrame.do", so if you visit this URL directly so that you can access its content you will get an exception which is perhaps some restriction from Server. To get content properly you need to visit two URLs via JSoup, 1. http://homeport8.apl.com/gentrack/trackingMain.do?trackInput01=999061985 and 2. http://homeport8.apl.com/gentrack/blRoutingFrame.do?trackInput01=999061985

For first URL you'll get nothing useful, but for second URL you'll get tables of your interest. The try using document.select("table") which will give you List of tables iterator over this list and find table of your interest. Once you have the table use Element.select("tr") to get a table row and then for each "tr" use Element.select("td") to get table cell data.

The webpage you are visiting didn't use CSS class and id selectors which would have made reading it with jsoup a lot easier so I am afraid iterating over document.select("table") is your best and easy option.

Good Luck.

Collectives™ on Stack Overflow

Parsing Information from URL Using Jsoup

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related