Are there any C++ libraries available to read HTML in Linux?
2 Answers
libcurl is your friend + tidy (HTML tidy) if you've got broken HTML to fix.
Edit: Here is the full sequence
HTML (in file) -> tidy (which will clean up the malformed HTML) -> XSLT transformation (you'll need to provide an XSL file to translate your HTML to latex), and use libxml/libxsl (http://xmlsoft.org/) -> latex document is then processed using latex (by forking out to latex the command) or if you want, you could download the source code for lyx and see how they do it (http://www.lyx.org/). Unfortunately the sequence is too complex to write into a single example, all I can give you is the sequence...