3

I have been trying to get my head around Scala's parser combinators. It seems that they are pretty powerful but the only tutorial examples I seem to find are with mathematical expressions and very little proper real-world parsing examples with DSLs that need to be parsed and mapped to different entities etc.

For the sake of this example, lets say I have this BNF where I have this entity named Model, which is made up of a string like this: [model [name <name> ]]. This is a simplistic example of a much larger BNF I have and there are more entities in reality.

So I defined my own class Model which takes the name as the constructor and then defined my own ModelParser object which extends JavaTokenParsers. I then defined the following parsers, following the BNF (I know some may have a simpler regex matcher but I preferred to follow the BNF exactly for other reasons).

def model : Parser[Model] = "[model" ~> "[name" ~> name <~ "]]" ^^ ( Model(_) )
def name : Parser[String] = (letter ~ (anyChar*)) ^^ {case text => text.toString())
def anyChar = letter | digit | "_".r | "-".r
def letter = """[a-zA-Z]""".r
def digit = """\d""".r

The toString of Model looks like this:

override def toString : String = "[model " + name + "]"

When I try to run it with a string like [model [name helloWorld]] I get this [model [h~List(e, l, l, o, W, o, r, l, d)]] instead of what I am expecting [model helloWorld]

How do I get those individual characters to join back in the string they were originally in?

I am also confused with the individual parsers and the use of .r. Sometimes I saw examples where they had just the following as a parser (to parse "hello"):

def hello = "hello"

Isn't that just a String? How on Earth did it suddenly become a parser that can be combined with other parsers? And what is the .r actually doing? I have read at least 3 tutorials but still totally lost what is actually happening.

1 Answer 1

3

The problem is that anyChar* parses a List[String] (where in this case each string is a single character), and the result of calling toString on a list of strings is "List(...)", not the string you'd get by concatenating the contents. In addition, the case text => pattern is matching on the entire letter ~ (anyChar*), not just the anyChar* part.

It's possible to address both of these issues pretty straightforwardly:

case class Model(name: String) {
  override def toString : String = "[model " + name + "]"
}

import scala.util.parsing.combinator._

object ModelParser extends RegexParsers {
  def model: Parser[Model] = "[model" ~> "[name" ~> name <~ "]]" ^^ (Model(_))

  def name: Parser[String] = letter ~ (anyChar*) ^^ {
    case first ~ rest => (first :: rest).mkString
  }

  def anyChar = letter | digit | "_".r | "-".r
  def letter = """[a-zA-Z]""".r
  def digit = """\d""".r
}

We just append the first character string to the list of the rest, and then call mkString on the entire list, which will concatenate the contents. This works as expected:

scala> ModelParser.parseAll(ModelParser.model, "[model [name helloWorld]]")
res0: ModelParser.ParseResult[Model] = [1.26] parsed: [model helloWorld]

As you note, it would be possible (and possibly clearer and more performant) to let the regular expressions do more of the work:

object ModelParser extends RegexParsers {
  def model: Parser[Model] = "[model" ~> "[name" ~> name <~ "]]" ^^ (Model(_))

  def name: Parser[String] = """[a-zA-Z\d_-]+""".r
}

This example also illustrates the way that the parsing combinator library uses implicit conversions to cut down on some of the verbosity of writing parsers. As you say, def hello = "hello" defines a string, and "[a-zA-Z]+".r defines a Regex (via the r method on StringOps), but either can be used as a parser because RegexParsers defines implicit conversions from String (this one's named literal) and Regex (regex) to Parser[String].

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks a lot for your clarifications, especially the .r confusion and the implicit conversion between a String literal to Parser[String]. The name parser is working fine now!
@Travis I noticed that for some reason, even [model [name hello World]] is being accepted and reproduced after parsing as helloWorld just the same. How do I force it to not accept the name part if it has a whitespace? The ~ seems to allow it just fine. I don't want to disable it completely for the parser because it is quite useful.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.