46

I want to parse a simple web site and scrape information from that web site.

I used to parse XML files with DocumentBuilderFactory, i tried to do the same thing for the html file but it always get into an infinite loop.

    URL url = new URL("http://www.deneme.com");
    URLConnection uc = url.openConnection();

    InputStreamReader input = new InputStreamReader(uc.getInputStream());
    BufferedReader in = new BufferedReader(input);
    String inputLine;

     FileWriter outFile = new FileWriter("orhancan");
     PrintWriter out = new PrintWriter(outFile);

    while ((inputLine = in.readLine()) != null) {
        out.println(inputLine);
    }

    in.close();
    out.close();

    File fXmlFile = new File("orhancan");
    DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
    DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
    Document doc = dBuilder.parse(fXmlFile);


    NodeList prelist = doc.getElementsByTagName("body");
    System.out.println(prelist.getLength());

Whats is the problem? Or is there any easier way to scrape data from a web site for a given html tag?

1
  • Firstly you can use a String instead of a File. Where does it enter in an infinite loop ? Maybe because of the input stream from the url which doesn't seem to end you have that problem. Commented Jan 30, 2012 at 22:19

3 Answers 3

91

There is a much easier way to do this. I suggest using JSoup. With JSoup you can do things like

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");

Or if you want the body:

Elements body = doc.select("body");

Or if you want all links:

Elements links = doc.select("body a");

You no longer need to get connections or handle streams. Simple. If you have ever used jQuery then it is very similar to that.

Sign up to request clarification or add additional context in comments.

6 Comments

First, thank you ! But what is #mp-itn b a ?
#mp-itn is just a container with id="mp-itn"
See my edit. Understanding how css selectors work would really help you.
OK, jsoup.org/cookbook/extracting-data/dom-navigation this was really what I need, thanks.
A library is a better choice than raw code, I would go for it
|
22

Definitely JSoup is the answer. ;-)

Comments

5

HTML is not always valid, well-formatted XML. Try a special HTML parser instead of an XML parser. There are a couple of different ones available:

http://java-source.net/open-source/html-parsers

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.