0

First of all: credits. This code is based on the solution from here: Use Scala parser combinator to parse CSV files

The CSV files I want to parse can have comments, lines starting with #. And to avoid confusion: The CSV files are tabulator-separated. There are more constraints which would make the parser a lot easier, but since I am completly new to Scala I thought it would be best to stay as close to the (working) original as possible.

The problem I have is that I get a type mismatch. Obviously the regex for a comment does not yield a list. I was hoping that Scala would interpret a comment as a 1-element-list, but this is not the case.

So how would I need to modify my code that I can handle this comment lines? And closly related: Is there an elegant way to query the parser result so I can write in myfunc something like

if (isComment(a)) continue

So here is the actual code:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import scala.util.parsing.combinator._

object MyParser extends RegexParsers {

    override val skipWhitespace = false   // meaningful spaces in CSV

    def COMMA   = ","
    def TAB     = "\t"
    def DQUOTE  = "\""
    def HASHTAG = "#"
    def DQUOTE2 = "\"\"" ^^ { case _ => "\"" }  // combine 2 dquotes into 1
    def CRLF    = "\r\n" | "\n"
    def TXT     = "[^\",\r\n]".r
    def SPACES  = "[ ]+".r

    def file: Parser[List[List[String]]] = repsep((comment|record), CRLF) <~ (CRLF?)
    def comment: Parser[List[String]] = HASHTAG<~TXT
    def record: Parser[List[String]] = "[^#]".r<~repsep(field, TAB)
    def field: Parser[String] = escaped|nonescaped

    def escaped: Parser[String] = {
        ((SPACES?)~>DQUOTE~>((TXT|COMMA|CRLF|DQUOTE2)*)<~DQUOTE<~(SPACES?)) ^^ {
            case ls => ls.mkString("")
        }
    }
    def nonescaped: Parser[String] = (TXT*) ^^ { case ls => ls.mkString("") }

    def applyParser(s: String) = parseAll(file, s) match {
        case Success(res, _) => res
        case e => throw new Exception(e.toString)
    }

    def myfunc( a: (String, String)) = {
        val parserResult = applyParser(a._2)
        println("APPLY PARSER FOR " + a._1)
        for( a <- parserResult ){
            a.foreach { println }
        }
    }

    def main(args: Array[String]) {
        val filesPath = "/home/user/test/*.txt"
        val conf = new SparkConf().setAppName("Simple Application")
        val sc = new SparkContext(conf)
        val logData = sc.wholeTextFiles(filesPath).cache()
        logData.foreach( x => myfunc(x))
    }
}

1 Answer 1

1

Since the parser for comment and the parser for record are "or-ed" together they must be of the same type.
You need to make the following changes:

def comment: Parser[List[String]] = HASHTAG<~TXT ^^^ {List()}

By using ^^^ we are converting the result of the parser (which is the result returned by HASHTAG parser) to an empty List.
Also change:

def record: Parser[List[String]] = repsep(field, TAB)

Note that because comment and record parser are or-ed and because comment comes first, if the row begins with a "#" it will be parsed by the comment parser.

Edit:
In order to keep the comments text as an output of the parser (say if you want to print them later), and because you are using | you can do the following:
Define the following classes:

trait Line
case class Comment(text: String) extends Line
case class Record(elements: List[String]) extends Line

Now define comment, record & file parsers as follows:

val comment: Parser[Comment] = "#" ~> TXT ^^ Comment
val record :Parser[Line]= repsep(field, TAB) ^^ Record
val file: Parser[List[Line]] = repsep(comment | record, CRLF) <~ (CRLF?)

Now you can define the printing function myFunc:

def myfunc( a: (String, String)) = {
  parseAll(file, a._2).map { lines =>
   lines.foreach{
     case Comment(t) => println(s"This is a comment: $t")
     case Record(elems) => println(s"This is a record: ${elems.mkString(",")}")
   }
  }
}
Sign up to request clarification or add additional context in comments.

3 Comments

I mark it as accepted. Can you answer my second question, too? If there is an elegant way to find out if my List[String] I want to print is a comment or an actual record? Currently it seems that comments are somehow skipped. Is this because we are converting the comment parser result to an empty list?
Yes, since a comment line will be parsed to an empty List, the parsed result will return List[List[String]] with an empty inner List for the comment lines. Thus the internal foreach will not do anything as the iterated List is empty.
Great to know I draw the right conclusions. But nevertheless: If I want the comment to be parsed and not to be thrown away (what returning an empty list instead does), how would I accomplish this?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.