Convert html String to org.w3c.dom.Document in Java

Question

To convert from HTML String to

org.w3c.dom.Document

I'm using

jtidy-r938.jar

here is my code:

public static Document getDoc(String html) {
        Tidy tidy = new Tidy();
        tidy.setInputEncoding("UTF-8");
        tidy.setOutputEncoding("UTF-8");
        tidy.setWraplen(Integer.MAX_VALUE);
        // tidy.setPrintBodyOnly(true);
        tidy.setXmlOut(false);
        tidy.setShowErrors(0);
        tidy.setShowWarnings(false);
        // tidy.setForceOutput(true);
        tidy.setQuiet(true);
        Writer out = new StringWriter();
        PrintWriter dummyOut = new PrintWriter(out);
        tidy.setErrout(dummyOut);
        tidy.setSmartIndent(true);
        ByteArrayInputStream inputStream = new ByteArrayInputStream(html.getBytes());
        Document doc = tidy.parseDOM(inputStream, null);
        return doc;
    }

But sometime the library work incorrectly, some tag is lost.

Please tell a good open library to do this task.

Thanks very much!

davidxxx · Accepted Answer · 2015-06-07 11:14:12Z

3

You don't tell why sometimes the library doesn't give the good result. Nevertheless, i am working very regularly with html files where I must extract data from and the main problem encountered is that fact that some tags are not valid because not closed for example. The best solution i found to resolve is the api htmlcleaner (htmlCleaner Website).

It allows you to make your html file well formed. Then, to transform it in document w3c or another strict format file is easier.

With HtmlCleaner, you could do such as :

HtmlCleaner cleaner = new HtmlCleaner();
TagNode node = cleaner.clean(html);
DomSerializer ser = new DomSerializer(cleaner.getProperties());
Document myW3cDoc = ser.createDOM(node);

I refer DomSerializer from htmlcleaner.

edited Jun 7, 2015 at 11:14

answered Jun 7, 2015 at 11:02

davidxxx

132k23 gold badges231 silver badges228 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Minh Le Over a year ago

Hi, I tried HtmlCleaner. It convert to Document well but I can using xpath to get Node in this case. Please tell me how to do it. In jTidy i use this code: public static Node getNodeViaXpath(Node doc, String xpath) throws XPathExpressionException{ XPathFactory xPathFactory = XPathFactory.newInstance(); XPath xPath = xPathFactory.newXPath(); String expression = xpath; XPathExpression xPathExpression = xPath.compile(expression); Object result = xPathExpression.evaluate(doc, XPathConstants.NODE); Node node = (Node) result; return node; }

davidxxx Over a year ago

Hi, I would help you but could you drive me further by explaining me which is the problem that you have encountered with the code you posted ?

Minh Le Over a year ago

Here is a example website bongdaso.com/…. I want to get the tag <a href="Indonesia-Philippines-2015_06_09-_Fix_43073.aspx.aspx?LeagueID=42&SeasonID=237&Data=odds"><h4>Tỷ lệ cược</h4></a> via xpath. The tag is lost if I user jTidy. How can I get the tag via xpath by HTMLCleaner

Collectives™ on Stack Overflow

Convert html String to org.w3c.dom.Document in Java

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related