2

To convert from HTML String to

org.w3c.dom.Document

I'm using

jtidy-r938.jar

here is my code:

public static Document getDoc(String html) {
        Tidy tidy = new Tidy();
        tidy.setInputEncoding("UTF-8");
        tidy.setOutputEncoding("UTF-8");
        tidy.setWraplen(Integer.MAX_VALUE);
        // tidy.setPrintBodyOnly(true);
        tidy.setXmlOut(false);
        tidy.setShowErrors(0);
        tidy.setShowWarnings(false);
        // tidy.setForceOutput(true);
        tidy.setQuiet(true);
        Writer out = new StringWriter();
        PrintWriter dummyOut = new PrintWriter(out);
        tidy.setErrout(dummyOut);
        tidy.setSmartIndent(true);
        ByteArrayInputStream inputStream = new ByteArrayInputStream(html.getBytes());
        Document doc = tidy.parseDOM(inputStream, null);
        return doc;
    }

But sometime the library work incorrectly, some tag is lost.

Please tell a good open library to do this task.

Thanks very much!

1 Answer 1

3

You don't tell why sometimes the library doesn't give the good result. Nevertheless, i am working very regularly with html files where I must extract data from and the main problem encountered is that fact that some tags are not valid because not closed for example. The best solution i found to resolve is the api htmlcleaner (htmlCleaner Website).

It allows you to make your html file well formed. Then, to transform it in document w3c or another strict format file is easier.

With HtmlCleaner, you could do such as :

HtmlCleaner cleaner = new HtmlCleaner();
TagNode node = cleaner.clean(html);
DomSerializer ser = new DomSerializer(cleaner.getProperties());
Document myW3cDoc = ser.createDOM(node);

I refer DomSerializer from htmlcleaner.

Sign up to request clarification or add additional context in comments.

3 Comments

Hi, I tried HtmlCleaner. It convert to Document well but I can using xpath to get Node in this case. Please tell me how to do it. In jTidy i use this code: public static Node getNodeViaXpath(Node doc, String xpath) throws XPathExpressionException{ XPathFactory xPathFactory = XPathFactory.newInstance(); XPath xPath = xPathFactory.newXPath(); String expression = xpath; XPathExpression xPathExpression = xPath.compile(expression); Object result = xPathExpression.evaluate(doc, XPathConstants.NODE); Node node = (Node) result; return node; }
Hi, I would help you but could you drive me further by explaining me which is the problem that you have encountered with the code you posted ?
Here is a example website bongdaso.com/…. I want to get the tag <a href="Indonesia-Philippines-2015_06_09-_Fix_43073.aspx.aspx?LeagueID=42&amp;SeasonID=237&amp;Data=odds"><h4>Tỷ lệ cược</h4></a> via xpath. The tag is lost if I user jTidy. How can I get the tag via xpath by HTMLCleaner

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.