0

I'm currently writing a html parser in haskell. I'm using the parsec library. At this time the parser just considers standard elements with an opening and closing tag and no attributes. The code looks like this:

data Html = Element String [Html] | Content String
    deriving Show

element :: Parser Html
element = do
    name <- char '<' *> many1 letter <* char '>'
    children <- many $ (try element) <|> content
    string "</" >> string name >> char '>'
    return $ Element name children

content :: Parser Html
content = fmap Content $ many1 $ satisfy (\x -> x /='<')

If I'm using alphabetic and numeric characters for the content, all is working fine. But if I'm using the "less"-sign (<) I just get bad results. Thats why I excluded the "less"-sign for now. Does anybody have an idea how I can fix this behaviour? I tried different things but I just can't get it working.

Thanks and regards Philipp

4
  • 5
    “I just get bad results” – such as ...? Please always be clear about what's actually the problem, when asking a question on StackOverflow. Commented Aug 15, 2020 at 17:09
  • What do you want to happen when you use a < sign in a content string? How should your parser know when a < sign is part of an HTML element and when not? Commented Aug 15, 2020 at 21:01
  • With "bad results" I just meant that my parser returned a parsing error that it expected "</" and not "<". E.g. I want to parse this string "<div>12 > 8</div>" Commented Aug 15, 2020 at 22:11
  • 3
    The standard HTML encoding for < and > when you want to use them as content is &lt; and &gt; respectively. <div>12 > 8</div> is malformed HTML (though some HTML parsers may still be able to parse it) Commented Aug 17, 2020 at 13:44

1 Answer 1

2

Technically, something like <div>12 < 8</div> is invalid HTML. It should be written <div>12 &lt; 8</div> instead. (The example mentioned in the comments <div>12 > 8</div> is actually valid HTML, though it's more usual to write it escaped as <div>12 &gt; 8</div>.) However, I gather you aren't interested in writing a perfectly correct HTML parser and would like your parser to accept < characters within content that aren't part of a valid start or end tag.

So, you'd like to accept each of these examples:

<div>12 < 8</div>
<p>x<y</div>
<pre><<<>>></pre>

but would probably like to reject:

<p>x<y>z</p>

on the basis that <y> is a valid start tag, but it's missing the matching </y> and also reject:

<div>x</dvi>

on the basis that </dvi> is an end tag that doesn't match the current active start tag.

I would start by writing separate parsers for start and end tags:

startTag :: Parser String
startTag = char '<' *> many1 letter <* char '>'

endTag :: Parser String
endTag = string "</" *> many1 letter <* char '>'

Then, write a parser for a String of text content. This can be tricky. Here's a straightforward implementation, even though it's not ideal performance-wise:

contentString :: Parser String
contentString = do
  -- fail if start or end tag (so caller will handle them)
  notFollowedBy startTag
  notFollowedBy endTag
  -- otherwise, parse either '<' as content, or some non-empty '<'-free text
  txt <- string "<" <|> many1 (noneOf "<")
  -- and possibly more text
  rest <- contentString <|> pure ""
  return (txt ++ rest)

Note how we first ensure we aren't looking at a < that's part of a valid (start or end) tag. Then, we allow ourselves to parse a single non-tag < as content, or else some entirely <-free content, before looping to parse more. This is a tough function to get right, so testing is key. (It took me two or three tries before I got something that handled all my test cases.)

Now, we can rewrite element to use the startTag and endTag parsers like so:

element :: Parser Html
element = do
    name <- startTag
    children <- many $ try element <|> Content <$> contentString
    name' <- endTag
    when (name /= name') $ unexpected ("</" ++ name' ++ ">, expected </" ++ name ++ ">")
    return $ Element name children

and now we get:

λ> parseTest element "<div>12 < 8</div>"
Element "div" [Content "12 < 8"]
λ> parseTest element "<div>x<y</div>"
Element "div" [Content "x<y"]
λ> parseTest element "<pre><<<>>></pre>"
Element "pre" [Content "<<<>>>"]
λ> parseTest element "<p>x<y>z</p>"
parse error at (line 1, column 5):
unexpected "y"
expecting "</"
λ> parseTest element "<div>x</dvi>"
parse error at (line 1, column 13):
unexpected </dvi>, expected </div>

We can slightly improve error reporting in the fourth test case by fiddling with the trys in element:

element :: Parser Html
element = do
    -- add "try" here
    name <- try startTag
    -- remove "try" here
    children <- many $ element <|> Content <$> contentString
    name' <- endTag
    when (name /= name') $ unexpected ("</" ++ name' ++ ">, expected </" ++ name ++ ">")
    return $ Element name children

which gives:

λ> parseTest element "<p>x<y>z</p>"
parse error at (line 1, column 13):
unexpected </p>, expected </y>

There's probably some more testing to do, but it seems to work okay on the above test cases, plus a couple more as given below. The full code:

import Text.Parsec
import Text.Parsec.String
import Control.Monad

data Html = Element String [Html] | Content String
    deriving Show

startTag :: Parser String
startTag = char '<' *> many1 letter <* char '>'

endTag :: Parser String
endTag = string "</" *> many1 letter <* char '>'

element :: Parser Html
element = do
  name <- try startTag
  children <- many $ element <|> Content <$> contentString
  name' <- endTag
  when (name /= name') $ unexpected ("</" ++ name' ++ ">, expected </" ++ name ++ ">")
  return $ Element name children

contentString :: Parser String
contentString = do
  -- fail if start or end tag (so caller will handle them)
  notFollowedBy startTag
  notFollowedBy endTag
  -- otherwise, parse either '<' as content, or some non-empty text
  txt <- string "<" <|> many1 (noneOf "<")
  -- and possibly more text
  rest <- contentString <|> pure ""
  return (txt ++ rest)

main = do
  mapM_ (parseTest element)
    [ "<div>12 < 8</div>"
    , "<div>x<y</div>"
    , "<pre><<<>>></pre>"
    , "<p>x<y>z</p>"
    , "<div>x</dvi>"
    , "<table><tr><td>1</td><td>2</td></tr></table>"
    , "<empty></empty>"
    ]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.