1

My java program is storing the content of web page in the string sb and I want to parse the string to HTML DOM. How do I do that?

import java.io.IOException;
import java.io.InputStream;
import java.io.StringReader;
import java.net.*;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;

public class Scraper {
    public static void main(String[] args) throws IOException, SAXException {
        URL u;
        try {
            u = new URL("https://twitter.com/ssjsatish");
            URLConnection cn = u.openConnection();
            System.out.println("content type:  "+cn.getContentType());
            InputStream is = cn.getInputStream();
            long l = cn.getContentLengthLong();
            StringBuilder sb = new StringBuilder();
            if (l!=0) {
                int c;
                while ((c = is.read()) != -1) {
                   sb.append((char)c);
                }
                is.close();
                System.out.println(sb);
                DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
                InputSource i = new InputSource();
                i.setCharacterStream(new StringReader(sb.toString()));
                Document doc = db.parse(i);
            }
        } catch (MalformedURLException e) {
            e.printStackTrace();
        } catch (ParserConfigurationException e) {
            e.printStackTrace();
        }
    }
}
1
  • 1
    It is important to use consistent style when writing code. I have edited your code to use one of the several styles that were present. Also, I moved is.close() earlier so you don't leave your connection open longer than absolutely necessary. Commented Nov 29, 2014 at 18:13

1 Answer 1

3

You don't want to use an XML parser to parse HTML, because not all valid HTML is valid XML. I would recommend using a library specifically designed to parse "real-world" HTML, for example I have had good results with jsoup, but there are others. Another advantage of using this sort of library is that their APIs are designed with Web Scraping in mind, and provide much simpler ways of accessing data in the HTML document.

Sign up to request clarification or add additional context in comments.

5 Comments

That's what I am trying to build but I don't want to use library like jsoup.org or jaunt-api.com. I have tried jsoup.
@SatishPatel Why not? If not, you will probably have to write it yourself.
@ColinvH Yeah, I want to write it myself. I Want to learn how they have done it?
@SatishPatel Then you should look at their source code: github.com/jhy/jsoup If you want to learn how to build a parser, you should start somewhere simple like JSON.
Yeah, there's a huge amount of corner case type stuff in parsing HTML. If you're interested in the String -> Tree thing (parsing), you probably want to start with something a lot simpler. Parsing is a really neat topic, but HTML is not a good way to learn it. If you want to learn what they do with the Tree to, for example, search it for elements matching a selector, the jsoup source is a good place to start, especially since it's self contained.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.