Spark Scala Regex -> Creating multiple columns based on regex

Question

Let's say I have a text file with data like such..

my "sample data set" kdf/dfjl/ looks like this

I have a regular expression that can capture all of this into groups. The values I'd like put into my columns would be like this.

desired values from groups

I'd like each group to become it's own column in an rdd

val pattern = """(\S+) "([\S\s]+)\" (\S+) (\S+) (\S+) (\S+)""".r

var myrdd = sc.textFile("my/data/set.txt")
myrdd.map(line => pattern.findAllIn(line))

I've tried several different methods for getting the matches from the regex out into different columns, like toArray, toSeq, but haven't even come close yet.

I'm aware of how the data exists inside the matches....

val answer = pattern.findAllIn(line).matchData
for(m <- answer){
  for(e <- m.subgroups){
    println(e)
  }
}

It's those 'e's that I'm after.. but not having much luck getting that data separated out into my RDD.

Thanks

Leo C · Accepted Answer · 2018-09-02 06:15:08Z

2

I would suggest using for-comprehension, rather than for-loop, to generate a list of extracted groups per line and map the list elements into individual columns:

val rdd = sc.textFile("/path/to/textfile")

val pattern = """(\S+) "([\S\s]+)\" (\S+) (\S+) (\S+) (\S+)""".r

rdd.map{ line =>
    ( for {
        m <- pattern.findAllIn(line).matchData
        g <- m.subgroups
      } yield(g)
    ).toList
  }.
  map(l => (l(0), l(1), l(2), l(3), l(4), l(5)))
// org.apache.spark.rdd.RDD[(String, String, String, String, String, String)] = ...

answered Sep 2, 2018 at 6:15

Leo C

22.5k3 gold badges28 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Spark Scala Regex -> Creating multiple columns based on regex

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related