JAVA parsing table data

Question

I would like to extract some html data from page source. Here is the ref. link have a html link view-source:http://www.4icu.org/reviews/index2.htm. I would like to ask how could I extract only the name of the university and the country name with JAVA. I know the way to just extract the university name as they are between , but how could I make the program faster by just scanning the table when class="i" and extract also the country, i.e. United States, with the <...alt="United States" />

<tr>
<td><a name="UNIVERSITIES-BY-NAME"></a><h2>A-Z list of world Universities and Colleges</h2>
</tr>

<tr>
<td class="i"><a href="/reviews/9107.htm"> A.T. Still University</a></td>
<td width="50" align="right" nowrap>us <img src="/i/bg.gif" class="fl flag-us" alt="United States" /></td>
</tr>

Thanks in advance.

EDIT Following what @11thdimension has said, here is my .java file

public class University {
    public static void main(String[] args) throws Exception {
        System.out.println("Started");

        URL url = new URL ("http://www.4icu.org/reviews/index2.htm");

        URLConnection spoof = url.openConnection();        
        // Spoof the connection so we look like a web browser
        spoof.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)");

        String connect = url.toString();
        Document doc = Jsoup.connect(connect).get();

        Elements cells = doc.select("td.i");

        Iterator<Element> iterator = cells.iterator();

        while (iterator.hasNext()) {
            Element cell = iterator.next();
            String university = cell.select("a").text();
            String country = cell.nextElementSibling().select("img").attr("alt");

            System.out.printf("country : %s, university : %s %n", country, university);
        }
    }
}

However, when I run it, it gives me the following error.

Started
Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=http://www.4icu.org/reviews/index2.htm

EDIT2 I have created the following program to get the header of the html site.

public class Get_Header {
  public static void main(String[] args) throws Exception {
    URL url = new URL("http://www.4icu.org/reviews/index2.htm");
    URLConnection connection = url.openConnection();

    Map responseMap = connection.getHeaderFields();
    for (Iterator iterator = responseMap.keySet().iterator(); iterator.hasNext();) {
      String key = (String) iterator.next();
      System.out.println(key + " = ");

      List values = (List) responseMap.get(key);
      for (int i = 0; i < values.size(); i++) {
        Object o = values.get(i);
        System.out.println(o + ", ");
      }
    }
  }
}

It retunrs the following result.

X-Frame-Options = 
SAMEORIGIN, 
Transfer-Encoding = 
chunked, 
null = 
HTTP/1.1 403 Forbidden, 
CF-RAY = 
2ca61c7a769b1980-HKG, 
Server = 
cloudflare-nginx, 
Cache-Control = 
max-age=10, 
Connection = 
keep-alive, 
Set-Cookie = 
__cfduid=d4f8d740e0ae0dd551be15e031359844d1469853403; expires=Sun, 30-Jul-17 04:36:43 GMT; path=/; domain=.4icu.org; HttpOnly, 
Expires = 
Sat, 30 Jul 2016 04:36:53 GMT, 
Date = 
Sat, 30 Jul 2016 04:36:43 GMT, 
Content-Type = 
text/html; charset=UTF-8,

Though I can get the header, but how should I combine the code in EDIT and EDIT2 to form a complete one? Thanks.

How long should the solution be to justify putting the question on hold ? — 11thdimension
– 11thdimension, Commented Jul 28, 2016 at 18:36
I have edited the question as to narrow down my issue. Thanks — Kennedy Kan
– Kennedy Kan, Commented Jul 29, 2016 at 16:19
I also tried to do it using URL but site seems to be blocking script download attempts, it must be because of certain headers that it expects. If you copy the headers from the browser that are sent and specify them in the connection then it should work with URL also. — 11thdimension
– 11thdimension, Commented Jul 30, 2016 at 1:45

11thdimension · Accepted Answer · 2016-07-28 18:17:09Z

1

If it's going to be a single time task then you should probably use Javascript fot it.

Following code will log the required names in the console. You'll have to run it in the browser console.

(function () {
    var a = [];
    document.querySelectorAll("td.i a").forEach(function (anchor) { a.push(anchor.textContent.trim());});

    console.log(a.join("\n"));
})();

Following is a Java example with Jsoup selectors

Maven Dependency

<dependencies>
    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.8.3</version>
    </dependency>
</dependencies>

Java Code

import java.io.File;
import java.util.Iterator;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class TestJsoup {
    public static void main(String[] args) throws Exception {
        System.out.println("Starteed");

        File file = new File("A-Z list of 11930 World Colleges & Universities.html");
        Document doc = Jsoup.parse(file, "UTF-8");

        Elements cells = doc.select("td.i");

        Iterator<Element> iterator = cells.iterator();

        while (iterator.hasNext()) {
            Element cell = iterator.next();
            String university = cell.select("a").text();
            String country = cell.nextElementSibling().select("img").attr("alt");

            System.out.printf("country : %s, university : %s %n", country, university);
        }
    }
}

edited Jul 28, 2016 at 18:17

answered Jul 28, 2016 at 16:39

11thdimension

10.7k4 gold badges40 silver badges74 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Kennedy Kan Over a year ago

Thanks. The program will run multiple times, as there are various index number to be changed in the http link. Just curious about how I can grab just the country nane data in the “alt=united states” with java. Thanks

Kennedy Kan Over a year ago

Thanks for help. However, when I insert the link 4icu.org/reviews/index2.htm to the place in substitue of A-Z list of 11930 World Colleges & Universities.html, it gives me Exception in thread "main" java.io.FileNotFoundException: www.4icu.org\reviews\index2.htm I have modified my question to make it clearer.

Collectives™ on Stack Overflow

JAVA parsing table data

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related