How do I parse a string to HTML DOM in java

Question

My java program is storing the content of web page in the string sb and I want to parse the string to HTML DOM. How do I do that?

import java.io.IOException;
import java.io.InputStream;
import java.io.StringReader;
import java.net.*;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;

public class Scraper {
    public static void main(String[] args) throws IOException, SAXException {
        URL u;
        try {
            u = new URL("https://twitter.com/ssjsatish");
            URLConnection cn = u.openConnection();
            System.out.println("content type:  "+cn.getContentType());
            InputStream is = cn.getInputStream();
            long l = cn.getContentLengthLong();
            StringBuilder sb = new StringBuilder();
            if (l!=0) {
                int c;
                while ((c = is.read()) != -1) {
                   sb.append((char)c);
                }
                is.close();
                System.out.println(sb);
                DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
                InputSource i = new InputSource();
                i.setCharacterStream(new StringReader(sb.toString()));
                Document doc = db.parse(i);
            }
        } catch (MalformedURLException e) {
            e.printStackTrace();
        } catch (ParserConfigurationException e) {
            e.printStackTrace();
        }
    }
}

It is important to use consistent style when writing code. I have edited your code to use one of the several styles that were present. Also, I moved is.close() earlier so you don't leave your connection open longer than absolutely necessary. — Colin vH
– Colin vH, Commented Nov 29, 2014 at 18:13

DisappointedByUnaccountableMod · Accepted Answer · 2021-02-13 23:29:43Z

3

You don't want to use an XML parser to parse HTML, because not all valid HTML is valid XML. I would recommend using a library specifically designed to parse "real-world" HTML, for example I have had good results with jsoup, but there are others. Another advantage of using this sort of library is that their APIs are designed with Web Scraping in mind, and provide much simpler ways of accessing data in the HTML document.

edited Feb 13, 2021 at 23:29

DisappointedByUnaccountableMod

6,8444 gold badges21 silver badges23 bronze badges

answered Nov 29, 2014 at 18:05

jjm

6,2282 gold badges27 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Satish Patel Over a year ago

That's what I am trying to build but I don't want to use library like jsoup.org or jaunt-api.com. I have tried jsoup.

Colin vH Over a year ago

@SatishPatel Why not? If not, you will probably have to write it yourself.

Satish Patel Over a year ago

@ColinvH Yeah, I want to write it myself. I Want to learn how they have done it?

Colin vH Over a year ago

@SatishPatel Then you should look at their source code: github.com/jhy/jsoup If you want to learn how to build a parser, you should start somewhere simple like JSON.

jjm Over a year ago

Yeah, there's a huge amount of corner case type stuff in parsing HTML. If you're interested in the String -> Tree thing (parsing), you probably want to start with something a lot simpler. Parsing is a really neat topic, but HTML is not a good way to learn it. If you want to learn what they do with the Tree to, for example, search it for elements matching a selector, the jsoup source is a good place to start, especially since it's self contained.

Collectives™ on Stack Overflow

How do I parse a string to HTML DOM in java

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related