1

I'm trying to implement a very simple markup language. I have an intermediate representation that looks like:

data Token = Str Text
           | Explode Text

type Rep = [Token]

So, the idea is to turn an arbitrary text of the form:

The quick brown %%fox%% %%jumps%% over the %%lazy%% dog.

into:

[Str "The quick brown", Explode "fox", Explode "jumps", Str "over the", Explode "lazy", Str "dog"]

for further processing. Also, it is important that we treat:

%%fox%% %%jumps%%

differently than

%%fox jumps%%

The latter should (Explode "fox jumps")

I tried to implement this using attoparsec, but I don't think I have the tools I need. But I'm not so good with parsing theory (I studied math, not CS). What kind of grammar is this? What kind of parser combinator library should I use? I considered using Parsec with a stateful monad transformer stack to keep track of the context. Does that sound sensible?

4
  • Is it allowed to use %% over multiple words? The %%quick brown%% fox for example. Because if not, the standard library would be good enough for now. parse = map toToken . words; isExplode s = isPrefixOf s "%%" && isSuffixOf s "%%" ; toToken s | isExplode s = Explode s | otherwise = Str s Commented May 12, 2014 at 3:45
  • It is allowed, but I want to treat %%foo bar%% differently than %%foo%% %%bar%%. Thanks for the question, I'll edit mine. Commented May 12, 2014 at 3:46
  • How can you get a literal %%X%% in the output, i.e. what input is required to get [Str "%%X%%", Explode "X"]? Commented May 12, 2014 at 8:36
  • Frerich: Good question. It's not specified (i.e., it doesn't really matter if those Tokens are ever emitted). Commented May 12, 2014 at 12:56

3 Answers 3

1

You can take the cheap and easy way, without a proper parser. The important thing to recognise is that this grammar is actually fairly simple – it has no recursion or such. It is just a flat listing of Strs and Explodes.

The easy way

So we can start by breaking the string down into a list containing the text and the separators as separate values. We need a data type to separate the separators (%%) from actual text (everything else.)

data ParserTokens = Sep | T Text

Breaking it down

Then we need to break the list into its constituents.

tokenise = intersperse Sep . map T . Text.splitOn "%%"

This will first split the string on %%, so in your example it'll become

["The quick brown ","fox"," ","jumps"," over the ","lazy"," dog."]

then we map T over it, to turn it from a [Text] to a [ParserTokens]. Finally, we intersperse Sep over it, to reintroduce the %% separators but in a shape that's easier to deal with. The result is, in your example,

[T "The quick brown ",Sep,T "fox",Sep,T " ",Sep,T "jumps",Sep,T " over the ",Sep,T "lazy",Sep,T " dog."]

Building it up

With this done, all that remains is parsing this thing into the shape you want it. Parsing this amounts to finding the 1-2-3 punch of Sep–T "something"–Sep and replacing it with Explode "something". We write a recursive function to do this.

construct [] = []
construct (T s : rest) = Str s : construct rest
construct (Sep : T s : Sep : rest) = Explode s : construct rest
construct _ = error "Mismatched '%%'!"

This converts T s to Str s and the combination of separators and a T s into an Explode s. If the pattern matching fails, it's because there were a stray separator somewhere, so I've just set it to crash the program. You might want better error handling there – such as wrapping the result in Either String or something similar.

With this done, we can create the function

parseTemplate = construct . tokenise

and in the end, if we run your example through parseTemplate, we get the expected result

[Str "The quick brown ",Explode "fox",Str " ",Explode "jumps",Str " over the ",Explode "lazy",Str " dog."]
Sign up to request clarification or add additional context in comments.

Comments

0

For such simple parser even Attoparsec seems to be overkill:

parse = map (\w -> case w of 
              '%':'%':expl -> Explode $ init $ init expl
              str -> Str str) . words

Of course, this code needs some sanity checks for Explode case.

Comments

0

This doesn't handle whitespace the way you specified, but it should get you on the right track.

parseMU = zipWith ($) (cycle [Str,Explode]) . splitps where
  splitps :: String -> [String]
  splitps [] = [[]]
  splitps ('%':'%':r) = [] : splitps r
  splitps (c:r) = let
    (a:r') = splitps r
    in ((c:a):r')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.