1

Let's say I have a text file with data like such..

my "sample data set" kdf/dfjl/ looks like this

I have a regular expression that can capture all of this into groups. The values I'd like put into my columns would be like this.

desired values from groups

I'd like each group to become it's own column in an rdd

val pattern = """(\S+) "([\S\s]+)\" (\S+) (\S+) (\S+) (\S+)""".r

var myrdd = sc.textFile("my/data/set.txt")
myrdd.map(line => pattern.findAllIn(line))

I've tried several different methods for getting the matches from the regex out into different columns, like toArray, toSeq, but haven't even come close yet.

I'm aware of how the data exists inside the matches....

val answer = pattern.findAllIn(line).matchData
for(m <- answer){
  for(e <- m.subgroups){
    println(e)
  }
}

It's those 'e's that I'm after.. but not having much luck getting that data separated out into my RDD.

Thanks

1 Answer 1

2

I would suggest using for-comprehension, rather than for-loop, to generate a list of extracted groups per line and map the list elements into individual columns:

val rdd = sc.textFile("/path/to/textfile")

val pattern = """(\S+) "([\S\s]+)\" (\S+) (\S+) (\S+) (\S+)""".r

rdd.map{ line =>
    ( for {
        m <- pattern.findAllIn(line).matchData
        g <- m.subgroups
      } yield(g)
    ).toList
  }.
  map(l => (l(0), l(1), l(2), l(3), l(4), l(5)))
// org.apache.spark.rdd.RDD[(String, String, String, String, String, String)] = ...
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.