Parse Web Site HTML with JAVA [duplicate]

Question

I want to parse a simple web site and scrape information from that web site.

I used to parse XML files with DocumentBuilderFactory, i tried to do the same thing for the html file but it always get into an infinite loop.

    URL url = new URL("http://www.deneme.com");
    URLConnection uc = url.openConnection();

    InputStreamReader input = new InputStreamReader(uc.getInputStream());
    BufferedReader in = new BufferedReader(input);
    String inputLine;

     FileWriter outFile = new FileWriter("orhancan");
     PrintWriter out = new PrintWriter(outFile);

    while ((inputLine = in.readLine()) != null) {
        out.println(inputLine);
    }

    in.close();
    out.close();

    File fXmlFile = new File("orhancan");
    DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
    DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
    Document doc = dBuilder.parse(fXmlFile);


    NodeList prelist = doc.getElementsByTagName("body");
    System.out.println(prelist.getLength());

Whats is the problem? Or is there any easier way to scrape data from a web site for a given html tag?

Firstly you can use a String instead of a File. Where does it enter in an infinite loop ? Maybe because of the input stream from the url which doesn't seem to end you have that problem. — Horatiu Jeflea
– Horatiu Jeflea, Commented Jan 30, 2012 at 22:19

Amir Raminfar · Accepted Answer · 2014-08-15 00:43:46Z

91

There is a much easier way to do this. I suggest using JSoup. With JSoup you can do things like

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");

Or if you want the body:

Elements body = doc.select("body");

Or if you want all links:

Elements links = doc.select("body a");

You no longer need to get connections or handle streams. Simple. If you have ever used jQuery then it is very similar to that.

edited Aug 15, 2014 at 0:43

answered Jan 30, 2012 at 22:14

Amir Raminfar

34.2k8 gold badges97 silver badges125 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

CanCeylan Over a year ago

First, thank you ! But what is #mp-itn b a ?

Amir Raminfar Over a year ago

#mp-itn is just a container with id="mp-itn"

Amir Raminfar Over a year ago

See my edit. Understanding how css selectors work would really help you.

CanCeylan Over a year ago

OK, jsoup.org/cookbook/extracting-data/dom-navigation this was really what I need, thanks.

Horatiu Jeflea Over a year ago

A library is a better choice than raw code, I would go for it

|

Diego Palomar · Accepted Answer · 2013-05-08 13:31:07Z

22

Definitely JSoup is the answer. ;-)

answered May 8, 2013 at 13:31

Diego Palomar

7,0612 gold badges36 silver badges42 bronze badges

Comments

Jan · Accepted Answer · 2012-01-30 22:16:33Z

5

HTML is not always valid, well-formatted XML. Try a special HTML parser instead of an XML parser. There are a couple of different ones available:

http://java-source.net/open-source/html-parsers

answered Jan 30, 2012 at 22:16

Jan

2,4781 gold badge16 silver badges6 bronze badges

Collectives™ on Stack Overflow

Parse Web Site HTML with JAVA [duplicate]

3 Answers 3

6 Comments

Comments

Comments

Hot Network Questions