Parsing simple markup language with Haskell

Question

I'm trying to implement a very simple markup language. I have an intermediate representation that looks like:

data Token = Str Text
           | Explode Text

type Rep = [Token]

So, the idea is to turn an arbitrary text of the form:

The quick brown %%fox%% %%jumps%% over the %%lazy%% dog.

into:

[Str "The quick brown", Explode "fox", Explode "jumps", Str "over the", Explode "lazy", Str "dog"]

for further processing. Also, it is important that we treat:

%%fox%% %%jumps%%

differently than

%%fox jumps%%

The latter should (Explode "fox jumps")

I tried to implement this using attoparsec, but I don't think I have the tools I need. But I'm not so good with parsing theory (I studied math, not CS). What kind of grammar is this? What kind of parser combinator library should I use? I considered using Parsec with a stateful monad transformer stack to keep track of the context. Does that sound sensible?

Is it allowed to use %% over multiple words? The %%quick brown%% fox for example. Because if not, the standard library would be good enough for now. parse = map toToken . words; isExplode s = isPrefixOf s "%%" && isSuffixOf s "%%" ; toToken s | isExplode s = Explode s | otherwise = Str s — vek
– vek, Commented May 12, 2014 at 3:45
It is allowed, but I want to treat %%foo bar%% differently than %%foo%% %%bar%%. Thanks for the question, I'll edit mine. — nomen
– nomen, Commented May 12, 2014 at 3:46
How can you get a literal %%X%% in the output, i.e. what input is required to get [Str "%%X%%", Explode "X"]? — Frerich Raabe
– Frerich Raabe, Commented May 12, 2014 at 8:36
Frerich: Good question. It's not specified (i.e., it doesn't really matter if those Tokens are ever emitted). — nomen
– nomen, Commented May 12, 2014 at 12:56

kqr · Accepted Answer · 2014-05-12 08:34:08Z

You can take the cheap and easy way, without a proper parser. The important thing to recognise is that this grammar is actually fairly simple – it has no recursion or such. It is just a flat listing of Strs and Explodes.

The easy way

So we can start by breaking the string down into a list containing the text and the separators as separate values. We need a data type to separate the separators (%%) from actual text (everything else.)

data ParserTokens = Sep | T Text

Breaking it down

Then we need to break the list into its constituents.

tokenise = intersperse Sep . map T . Text.splitOn "%%"

This will first split the string on %%, so in your example it'll become

["The quick brown ","fox"," ","jumps"," over the ","lazy"," dog."]

then we map T over it, to turn it from a [Text] to a [ParserTokens]. Finally, we intersperse Sep over it, to reintroduce the %% separators but in a shape that's easier to deal with. The result is, in your example,

[T "The quick brown ",Sep,T "fox",Sep,T " ",Sep,T "jumps",Sep,T " over the ",Sep,T "lazy",Sep,T " dog."]

Building it up

With this done, all that remains is parsing this thing into the shape you want it. Parsing this amounts to finding the 1-2-3 punch of Sep–T "something"–Sep and replacing it with Explode "something". We write a recursive function to do this.

construct [] = []
construct (T s : rest) = Str s : construct rest
construct (Sep : T s : Sep : rest) = Explode s : construct rest
construct _ = error "Mismatched '%%'!"

This converts T s to Str s and the combination of separators and a T s into an Explode s. If the pattern matching fails, it's because there were a stray separator somewhere, so I've just set it to crash the program. You might want better error handling there – such as wrapping the result in Either String or something similar.

With this done, we can create the function

parseTemplate = construct . tokenise

and in the end, if we run your example through parseTemplate, we get the expected result

[Str "The quick brown ",Explode "fox",Str " ",Explode "jumps",Str " over the ",Explode "lazy",Str " dog."]

arrowd · Accepted Answer · 2014-05-12 03:42:51Z

0

For such simple parser even Attoparsec seems to be overkill:

parse = map (\w -> case w of 
              '%':'%':expl -> Explode $ init $ init expl
              str -> Str str) . words

Of course, this code needs some sanity checks for Explode case.

answered May 12, 2014 at 3:42

arrowd

34.6k8 gold badges89 silver badges123 bronze badges

Comments

Jeremy List · Accepted Answer · 2014-05-12 06:24:54Z

0

This doesn't handle whitespace the way you specified, but it should get you on the right track.

parseMU = zipWith ($) (cycle [Str,Explode]) . splitps where
  splitps :: String -> [String]
  splitps [] = [[]]
  splitps ('%':'%':r) = [] : splitps r
  splitps (c:r) = let
    (a:r') = splitps r
    in ((c:a):r')

answered May 12, 2014 at 6:24

Jeremy List

1,7869 silver badges17 bronze badges

Collectives™ on Stack Overflow

Parsing simple markup language with Haskell

3 Answers 3

The easy way

Breaking it down

Building it up

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

The easy way

Breaking it down

Building it up

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related