Technically, something like <div>12 < 8</div> is invalid HTML. It should be written <div>12 < 8</div> instead. (The example mentioned in the comments <div>12 > 8</div> is actually valid HTML, though it's more usual to write it escaped as <div>12 > 8</div>.) However, I gather you aren't interested in writing a perfectly correct HTML parser and would like your parser to accept < characters within content that aren't part of a valid start or end tag.
So, you'd like to accept each of these examples:
<div>12 < 8</div>
<p>x<y</div>
<pre><<<>>></pre>
but would probably like to reject:
<p>x<y>z</p>
on the basis that <y> is a valid start tag, but it's missing the matching </y> and also reject:
<div>x</dvi>
on the basis that </dvi> is an end tag that doesn't match the current active start tag.
I would start by writing separate parsers for start and end tags:
startTag :: Parser String
startTag = char '<' *> many1 letter <* char '>'
endTag :: Parser String
endTag = string "</" *> many1 letter <* char '>'
Then, write a parser for a String of text content. This can be tricky. Here's a straightforward implementation, even though it's not ideal performance-wise:
contentString :: Parser String
contentString = do
-- fail if start or end tag (so caller will handle them)
notFollowedBy startTag
notFollowedBy endTag
-- otherwise, parse either '<' as content, or some non-empty '<'-free text
txt <- string "<" <|> many1 (noneOf "<")
-- and possibly more text
rest <- contentString <|> pure ""
return (txt ++ rest)
Note how we first ensure we aren't looking at a < that's part of a valid (start or end) tag. Then, we allow ourselves to parse a single non-tag < as content, or else some entirely <-free content, before looping to parse more. This is a tough function to get right, so testing is key. (It took me two or three tries before I got something that handled all my test cases.)
Now, we can rewrite element to use the startTag and endTag parsers like so:
element :: Parser Html
element = do
name <- startTag
children <- many $ try element <|> Content <$> contentString
name' <- endTag
when (name /= name') $ unexpected ("</" ++ name' ++ ">, expected </" ++ name ++ ">")
return $ Element name children
and now we get:
λ> parseTest element "<div>12 < 8</div>"
Element "div" [Content "12 < 8"]
λ> parseTest element "<div>x<y</div>"
Element "div" [Content "x<y"]
λ> parseTest element "<pre><<<>>></pre>"
Element "pre" [Content "<<<>>>"]
λ> parseTest element "<p>x<y>z</p>"
parse error at (line 1, column 5):
unexpected "y"
expecting "</"
λ> parseTest element "<div>x</dvi>"
parse error at (line 1, column 13):
unexpected </dvi>, expected </div>
We can slightly improve error reporting in the fourth test case by fiddling with the trys in element:
element :: Parser Html
element = do
-- add "try" here
name <- try startTag
-- remove "try" here
children <- many $ element <|> Content <$> contentString
name' <- endTag
when (name /= name') $ unexpected ("</" ++ name' ++ ">, expected </" ++ name ++ ">")
return $ Element name children
which gives:
λ> parseTest element "<p>x<y>z</p>"
parse error at (line 1, column 13):
unexpected </p>, expected </y>
There's probably some more testing to do, but it seems to work okay on the above test cases, plus a couple more as given below. The full code:
import Text.Parsec
import Text.Parsec.String
import Control.Monad
data Html = Element String [Html] | Content String
deriving Show
startTag :: Parser String
startTag = char '<' *> many1 letter <* char '>'
endTag :: Parser String
endTag = string "</" *> many1 letter <* char '>'
element :: Parser Html
element = do
name <- try startTag
children <- many $ element <|> Content <$> contentString
name' <- endTag
when (name /= name') $ unexpected ("</" ++ name' ++ ">, expected </" ++ name ++ ">")
return $ Element name children
contentString :: Parser String
contentString = do
-- fail if start or end tag (so caller will handle them)
notFollowedBy startTag
notFollowedBy endTag
-- otherwise, parse either '<' as content, or some non-empty text
txt <- string "<" <|> many1 (noneOf "<")
-- and possibly more text
rest <- contentString <|> pure ""
return (txt ++ rest)
main = do
mapM_ (parseTest element)
[ "<div>12 < 8</div>"
, "<div>x<y</div>"
, "<pre><<<>>></pre>"
, "<p>x<y>z</p>"
, "<div>x</dvi>"
, "<table><tr><td>1</td><td>2</td></tr></table>"
, "<empty></empty>"
]
<sign in a content string? How should your parser know when a<sign is part of an HTML element and when not?<and>when you want to use them as content is<and>respectively.<div>12 > 8</div>is malformed HTML (though some HTML parsers may still be able to parse it)