0

I've been learning about iText and its beauty for the pass few days.

I manage to convert HTML source code to PDF successfully. However, I've been wondering if its possible to convert broken html (missing tags, etc) to PDF without XMLWorker throwing an exception just like HTMLWorker used to do. I know XMLWorker is very sensible and only works with correctly written HTML or (X)HTML but since I am getting the html from a second party which most likely will have broken HTML.

I would like to know if there is a way to just convert what's possible and leave the errors floating around just like a browser would do.

1 Answer 1

1

Use TagSoup before passing the broken HTML to iText. It will clean up the broken HTML and return valid X(HT)ML.

TagSoup implements the SAX parser interface. There are some examples on how to use it, but it lacks some "real" documentation.

Probably you will have to serialize the XML again and dump it to a file to feed it to iText, I don't know its interface.

Serializing a SAX stream is possible using XMLWriter. By chance it is already included with TagSoup, so you don't need to add an extra dependency.

final Parser parser = new Parser();
final StringWriter writer = new StringWriter();

parser.setContentHandler(new XMLWriter(writer));
parser.parse(new InputSource(
        new URL("http://oregonstate.edu/instruct/phl302/texts/hobbes/leviathan-c.html")
                .openConnection().getInputStream()));
System.out.println(writer.toString());

Decide based on iText's API whether to dump writer's output to a file or pass it another way.

Sign up to request clarification or add additional context in comments.

5 Comments

Just tried JTidy. I got a long list of warnings and errors when I tried to parse my broken html. After this long list there is a message that says "InputStream: Document content looks like HTML 4.01 Transitional 80 warnings, 33 errors were found! This document has errors that must be fixed before using HTML Tidy to generate a tidied up version." Apparently to be able to use JTidy the html has to be somewhat perfect which is not what I'm looking for here.
Aye sorry, wrote about the wrong library. Really ment TagSoup which does a great job and I never had a document which it didn't tidy. Flushed JTidy from my answer and wrote about the correct one. ;)
Well., thanks for your answer. I still have yet to find how to transform html from a string to xhtml using tagsoup and like you mentioned tagsoup lacks documentation. I'll keep searching and if I find an answer I will post it here. Unless perhaps you already know a way?
Dumping tagsoup's output is really easy, I added a short example.
Thank you for your help. It worked as a charm. Too bad XMLWorker doesn't handle ftp urls but only http urls :(. But with all and all, thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.