2

I thought BeautifulSoup could do that, but it does not seem to do the trick.

What method have you already used, and is long term reliable ?

1
  • 1
    Did you try utidylib.berlios.de ? I don't know it, but Tidy is able to convert ugly HTML into sexy XML. Maybe its python wrapper can do it too ? Commented Sep 13, 2010 at 15:10

2 Answers 2

4

You could use the lxml library, specifically lxml.html which gives you an ETree object which you can then serialize as XML with (amongst others) the .tostring() method.

If this fails on your HTML (it is too broken) you can use ElementSoup (an extension on BeautifulSoup) to build a lxml.html tree.

Sign up to request clarification or add additional context in comments.

Comments

2

You can try http://utidylib.berlios.de/ , a python wrapper for tidy library. Tidy works well in most cases.

For something more robust (or at least more browser-like), I guess you could try webkit or gecko. I'm not sure the wrappers responsible for cleaning HTML are available, but you can have a look.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.