Convert broken html to pdf with XMLWorker using java

Question

I've been learning about iText and its beauty for the pass few days.

I manage to convert HTML source code to PDF successfully. However, I've been wondering if its possible to convert broken html (missing tags, etc) to PDF without XMLWorker throwing an exception just like HTMLWorker used to do. I know XMLWorker is very sensible and only works with correctly written HTML or (X)HTML but since I am getting the html from a second party which most likely will have broken HTML.

I would like to know if there is a way to just convert what's possible and leave the errors floating around just like a browser would do.

Jens Erat · Accepted Answer · 2013-02-12 00:30:51Z

1

Use TagSoup before passing the broken HTML to iText. It will clean up the broken HTML and return valid X(HT)ML.

TagSoup implements the SAX parser interface. There are some examples on how to use it, but it lacks some "real" documentation.

Probably you will have to serialize the XML again and dump it to a file to feed it to iText, I don't know its interface.

Serializing a SAX stream is possible using XMLWriter. By chance it is already included with TagSoup, so you don't need to add an extra dependency.

final Parser parser = new Parser();
final StringWriter writer = new StringWriter();

parser.setContentHandler(new XMLWriter(writer));
parser.parse(new InputSource(
        new URL("http://oregonstate.edu/instruct/phl302/texts/hobbes/leviathan-c.html")
                .openConnection().getInputStream()));
System.out.println(writer.toString());

Decide based on iText's API whether to dump writer's output to a file or pass it another way.

edited Feb 12, 2013 at 0:30

answered Feb 11, 2013 at 16:32

Jens Erat

39k16 gold badges86 silver badges99 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Y_Y Over a year ago

Just tried JTidy. I got a long list of warnings and errors when I tried to parse my broken html. After this long list there is a message that says "InputStream: Document content looks like HTML 4.01 Transitional 80 warnings, 33 errors were found! This document has errors that must be fixed before using HTML Tidy to generate a tidied up version." Apparently to be able to use JTidy the html has to be somewhat perfect which is not what I'm looking for here.

Jens Erat Over a year ago

Aye sorry, wrote about the wrong library. Really ment TagSoup which does a great job and I never had a document which it didn't tidy. Flushed JTidy from my answer and wrote about the correct one. ;)

Y_Y Over a year ago

Well., thanks for your answer. I still have yet to find how to transform html from a string to xhtml using tagsoup and like you mentioned tagsoup lacks documentation. I'll keep searching and if I find an answer I will post it here. Unless perhaps you already know a way?

Jens Erat Over a year ago

Dumping tagsoup's output is really easy, I added a short example.

Y_Y Over a year ago

Thank you for your help. It worked as a charm. Too bad XMLWorker doesn't handle ftp urls but only http urls :(. But with all and all, thanks!

Collectives™ on Stack Overflow

Convert broken html to pdf with XMLWorker using java

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related