1

Edit: I have apparently solve the problem forcing the code getting the HTML. The problem I have is that randomly the HTML is not taken. To force that I have added:

                int intento = 0;

                while (document == null) {
                    intento++;
                    System.out.println("Intento número: " + intento);                        
                    document = getHtmlDocument(urlPage);
                }

I am experiencing this random issue. Sometimes it gives me problems when fetching an URL an as it reaches to the timeout the program execution stops. The code:

public static int getStatusConnectionCode(String url) {

    Response response = null;

    try {
        response = Jsoup.connect(url).userAgent("Mozilla/5.0").timeout(100000).ignoreHttpErrors(true).execute();
    } catch (IOException ex) {
        System.out.println("Excepción al obtener el Status Code: " + ex.getMessage());
    }
    return response.statusCode();
}   

/**
 * Con este método devuelvo un objeto de la clase Document con el contenido del
 * HTML de la web que me permitirá parsearlo con los métodos de la librelia JSoup
 * @param url
 * @return Documento con el HTML
 */
public static Document getHtmlDocument(String url) {

    Document doc = null;

    try {
        doc = Jsoup.connect(url).userAgent("Mozilla/5.0").timeout(100000).get();
    } catch (IOException ex) {
        System.out.println("Excepción al obtener el HTML de la página" + ex.getMessage());
    }

    return doc;

}

Should I use another method or increase the time out limit? The problem is that the program execution spends more or less 10 hours, and sometimes the problem happens in the URL number 500 another time in the 250...this is nonsense for me...if there is a problem in the link number 250, why if I run another time the program the problem happens in the link number 450 (for example)? I have been thinking that it could be internet problems but it's not.

The solution for another case is not solving my problem: Java JSoup error fetching URL

Thanks in advice.

7
  • 3
    Might be possible that the specific link is down when you tried to access and up again when you verified from browser. Also possible that it might have blocked your request finding it to appear from a bot. There can be multiple reasons and cannot be certain on why it occurs. Coding advice for you is to skip such erroneous links and proceed with next links in execution. You can re run the code later just for those which failed. Commented Jan 23, 2017 at 10:06
  • Have you tried using HttpUrlConnection instead? Commented Jan 23, 2017 at 10:09
  • @PavanKumar could be that, but it's a pitty when that happens. In the code I have written if the code connection is not "200" then just write "-" in the prices (as it is parsing prices), and in some cases the URL won't exist. Commented Jan 23, 2017 at 11:20
  • @SteveSmith Where should I use HttpUrlConnection? Commented Jan 23, 2017 at 11:21
  • @JetLagFox HttpUrlConnection will replace both your methods since it will get a response code and can download the response. You will then need to pass the response to Jsoup. Google "HttpUrlConnection example". Commented Jan 23, 2017 at 11:29

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.