2

Is there an HTML cleaner for .NET that can parse HTML and (for instance) convert it to a more machine friendly format such as XHTML?

I've tried the HTML Agility Pack, but that fails to correctly parse even fairly simple examples.

To give an example of HTML that should be parsed correctly:

<html><title>test</title>
<body>
    <ul><li>TestElem1
        <li>TestElem2
        <li>TestElem3 List:
            <ul><li>Nested1
                <li>Nested2</li>
                <li>Nested3
            </ul>
        <li>TestElem4
    </ul>
    <p>paragraph 1
    <p>paragraph 2
    <p>paragraph 3
</body></html>

li tags don't need to be closed (see specification), and neither do P tags. In other words, the above sample should be parsed as:

<html><title>test</title>
<body>
    <ul><li>TestElem1</li>
        <li>TestElem2</li>
        <li>TestElem3 List:
            <ul><li>Nested1</li>
                <li>Nested2</li>
                <li>Nested3</li>
            </ul></li>
        <li>TestElem4</li>
    </ul>
    <p>paragraph 1</p>
    <p>paragraph 2</p>
    <p>paragraph 3</p>
</body></html>

Since the aim is to use the library on various machines, it's a big disadvantage to need to fall back to native code (such as a wrapper around HTML Tidy) which would require extra deployment hassle and sacrifice platform independence, not to mention being impossible in sandboxed scenarios.

Any suggestions? To recap, I'm looking for:

  • An HTML cleaner ala HTML Tidy
  • Must be able to deal with real world HTML, not just XHTML, at the very least correctly reading valid HTML 4
  • Must be able to convert to a more easily processable XML format
  • Should be a purely managed application.

1 Answer 1

1

Try TidyManaged.​​​​​​​​​​​​​​​​​​

Sign up to request clarification or add additional context in comments.

4 Comments

I haven't seen TidyManaged, if I ever need something similar again, I'll take a peek. However, your timing is uncanny, as I did write a patch to add support for optional end tags to HTML agility pack just two weeks ago: htmlagilitypack.codeplex.com/workitem/29218 - I'm hoping they'll integrate it and that'll be that.
It looks like TidyManaged is a wrapper rather than a port; that's slightly unhandy as it won't work in things like silverlight, and it requires you know the platform you'll be executing on at compile-time. Still, for many uses those restrictions aren't an issue.
I've listed a few more Implementations of HTML TidyLib for .Net at my blog geekswithblogs.net/mnf/archive/2011/06/08/…
:-) Nice summary. It is a tricky problem; I'm not quite sure I fully trust Html Agility Pack. There's also the Majestic 12 parser which sounds robust, but it's really more like a tokenizer - it's not going to fix or hide things like missing end tags. And there's lots of Tidy's, but even tidy isn't perfect; if the html is malformed (not just wrong nesting), tidy may refuse to process the input entirely, leaving it tricky to use without human interaction (it's fine for a website editor, less so for a search engine).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.