0

I would like to extract some html data from page source. Here is the ref. link have a html link view-source:http://www.4icu.org/reviews/index2.htm. I would like to ask how could I extract only the name of the university and the country name with JAVA. I know the way to just extract the university name as they are between , but how could I make the program faster by just scanning the table when class="i" and extract also the country, i.e. United States, with the <...alt="United States" />

<tr>
<td><a name="UNIVERSITIES-BY-NAME"></a><h2>A-Z list of world Universities and Colleges</h2>
</tr>

<tr>
<td class="i"><a href="/reviews/9107.htm"> A.T. Still University</a></td>
<td width="50" align="right" nowrap>us <img src="/i/bg.gif" class="fl flag-us" alt="United States" /></td>
</tr>

Thanks in advance.

EDIT Following what @11thdimension has said, here is my .java file

public class University {
    public static void main(String[] args) throws Exception {
        System.out.println("Started");

        URL url = new URL ("http://www.4icu.org/reviews/index2.htm");

        URLConnection spoof = url.openConnection();        
        // Spoof the connection so we look like a web browser
        spoof.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)");

        String connect = url.toString();
        Document doc = Jsoup.connect(connect).get();

        Elements cells = doc.select("td.i");

        Iterator<Element> iterator = cells.iterator();

        while (iterator.hasNext()) {
            Element cell = iterator.next();
            String university = cell.select("a").text();
            String country = cell.nextElementSibling().select("img").attr("alt");

            System.out.printf("country : %s, university : %s %n", country, university);
        }
    }
}

However, when I run it, it gives me the following error.

Started
Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=http://www.4icu.org/reviews/index2.htm

EDIT2 I have created the following program to get the header of the html site.

public class Get_Header {
  public static void main(String[] args) throws Exception {
    URL url = new URL("http://www.4icu.org/reviews/index2.htm");
    URLConnection connection = url.openConnection();

    Map responseMap = connection.getHeaderFields();
    for (Iterator iterator = responseMap.keySet().iterator(); iterator.hasNext();) {
      String key = (String) iterator.next();
      System.out.println(key + " = ");

      List values = (List) responseMap.get(key);
      for (int i = 0; i < values.size(); i++) {
        Object o = values.get(i);
        System.out.println(o + ", ");
      }
    }
  }
}

It retunrs the following result.

X-Frame-Options = 
SAMEORIGIN, 
Transfer-Encoding = 
chunked, 
null = 
HTTP/1.1 403 Forbidden, 
CF-RAY = 
2ca61c7a769b1980-HKG, 
Server = 
cloudflare-nginx, 
Cache-Control = 
max-age=10, 
Connection = 
keep-alive, 
Set-Cookie = 
__cfduid=d4f8d740e0ae0dd551be15e031359844d1469853403; expires=Sun, 30-Jul-17 04:36:43 GMT; path=/; domain=.4icu.org; HttpOnly, 
Expires = 
Sat, 30 Jul 2016 04:36:53 GMT, 
Date = 
Sat, 30 Jul 2016 04:36:43 GMT, 
Content-Type = 
text/html; charset=UTF-8, 

Though I can get the header, but how should I combine the code in EDIT and EDIT2 to form a complete one? Thanks.

4
  • Do you need do it once or would it be a repetitive task ? Commented Jul 28, 2016 at 16:16
  • How long should the solution be to justify putting the question on hold ? Commented Jul 28, 2016 at 18:36
  • I have edited the question as to narrow down my issue. Thanks Commented Jul 29, 2016 at 16:19
  • I also tried to do it using URL but site seems to be blocking script download attempts, it must be because of certain headers that it expects. If you copy the headers from the browser that are sent and specify them in the connection then it should work with URL also. Commented Jul 30, 2016 at 1:45

1 Answer 1

1

If it's going to be a single time task then you should probably use Javascript fot it.

Following code will log the required names in the console. You'll have to run it in the browser console.

(function () {
    var a = [];
    document.querySelectorAll("td.i a").forEach(function (anchor) { a.push(anchor.textContent.trim());});

    console.log(a.join("\n"));
})();

Following is a Java example with Jsoup selectors

Maven Dependency

<dependencies>
    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.8.3</version>
    </dependency>
</dependencies>

Java Code

import java.io.File;
import java.util.Iterator;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class TestJsoup {
    public static void main(String[] args) throws Exception {
        System.out.println("Starteed");

        File file = new File("A-Z list of 11930 World Colleges & Universities.html");
        Document doc = Jsoup.parse(file, "UTF-8");

        Elements cells = doc.select("td.i");

        Iterator<Element> iterator = cells.iterator();

        while (iterator.hasNext()) {
            Element cell = iterator.next();
            String university = cell.select("a").text();
            String country = cell.nextElementSibling().select("img").attr("alt");

            System.out.printf("country : %s, university : %s %n", country, university);
        }
    }
}
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks. The program will run multiple times, as there are various index number to be changed in the http link. Just curious about how I can grab just the country nane data in the “alt=united states” with java. Thanks
Thanks for help. However, when I insert the link 4icu.org/reviews/index2.htm to the place in substitue of A-Z list of 11930 World Colleges & Universities.html, it gives me Exception in thread "main" java.io.FileNotFoundException: www.4icu.org\reviews\index2.htm I have modified my question to make it clearer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.