1

Duplicate:

How do you Programmatically Download a Webpage in Java?

How to fetch html in Java

I'm developping an application that consists on: the user inputs an URL of some website, and then the application have to analyze that URL.

How can I have access to the HTML file, using Java? Does I need to use HttpRequest? How does that works?

Thks.

5 Answers 5

5

URLConnection is fine for simple cases. When there are things like redirections involved, you are better off using Apache's HTTPClient

Sign up to request clarification or add additional context in comments.

Comments

4

You could just use a URLConnection. See this Java Tutorial from Sun

Comments

1

This code downloads data from a URL, treating it as binary content:

public class Download {

  private static void download(URL input, File output)
      throws IOException {
    InputStream in = input.openStream();
    try {
      OutputStream out = new FileOutputStream(output);
      try {
        copy(in, out);
      } finally {
        out.close();
      }
    } finally {
      in.close();
    }
  }

  private static void copy(InputStream in, OutputStream out)
      throws IOException {
    byte[] buffer = new byte[1024];
    while (true) {
      int readCount = in.read(buffer);
      if (readCount == -1) {
        break;
      }
      out.write(buffer, 0, readCount);
    }
  }

  public static void main(String[] args) {
    try {
      URL url = new URL("http://stackoverflow.com");
      File file = new File("data");
      download(url, file);
    } catch (IOException e) {
      e.printStackTrace();
    }
  }

}

The downside of this approach is that it ignores any meta-data, like the Content-Type, which you would get from using HttpURLConnection (or a more sophisticated API, like the Apache one).

In order to parse the HTML data, you'll either need a specialized HTML parser that can handle poorly formed markup or tidy it first before parsing using a XML parser.

Comments

0

You can use java.net.URL and then open an input stream to read the HTML from the server. See the example here.

Comments

0

Funnily enough I wrote utility method that does just that the other week

/**
 * Retrieves the file specified by <code>fileUrl</code> and writes it to 
 * <code>out</code>.
 * <p>
 * Does not close <code>out</code>, but does flush.
 * @param fileUrl The URL of the file.
 * @param out An output stream to capture the contents of the file
 * @param batchWriteSize The number of bytes to write to <code>out</code>
 *                       at once (larger files than this will be written
 *                       in several batches)
 * @throws IOException If call to web server fails
 * @throws FileNotFoundException If the call to the web server does not
 *                               return status code 200. 
 */
public static void getFileStream(String fileURL, OutputStream out, int batchWriteSize)
                            throws IOException{
    GetMethod get = new GetMethod(fileURL);
    HttpClient client = new HttpClient();
    HttpClientParams params = client.getParams();
    params.setSoTimeout(2000);
    client.setParams(params);
    try {
        client.executeMethod(get);
    } catch(ConnectException e){
        // Add some context to the exception and rethrow
        throw new IOException("ConnectionException trying to GET " + 
                fileURL,e);
    }

    if(get.getStatusCode()!=200){
        throw new FileNotFoundException(
                "Server returned " + get.getStatusCode());
    }

    // Get the input stream
    BufferedInputStream bis = 
        new BufferedInputStream(get.getResponseBodyAsStream());

    // Read the file and stream it out
    byte[] b = new byte[batchWriteSize];
    int bytesRead = bis.read(b,0,batchWriteSize);
    long bytesTotal = 0;
    while(bytesRead!=-1) {
        bytesTotal += bytesRead;
        out.write(b, 0, bytesRead);
        bytesRead = bis.read(b,0,batchWriteSize);;
    } 
    bis.close(); // Release the input stream.
    out.flush();        
}

Uses Apache Commons library i.e.

import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.methods.GetMethod;
import org.apache.commons.httpclient.params.HttpClientParams;

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.