3

I have an rdd and the structure of the RDD is as follows:

org.apache.spark.rdd.RDD[(String, Array[String])] = MappedRDD[40] at map at <console>:14

Here is x.take(1) looks like:

Array[(String, Array[String])] = Array((8239427349237423,Array(122641|2|2|1|1421990315711|38|6487985623452037|684|, 1229|2|1|1|1411349089424|87|462966136107937|1568|.....))

For each string in the array I want to split by "|" and take the 6th item and return it with the first element of the tuple as follows:

8239427349237423-6487985623452037
8239427349237423-4629661361079371

I started as follows:

  def getValues(lines: Array[String]) {
    for(line <- lines) {
      line.split("|")(6)
    }

I also tried following:

val b= x.map(a => (a._1, a._2.flatMap(y => y.split("|")(6))))

But that ended up giving me following:

Array[(String, Array[Char])] = Array((8239427349237423,Array(1, 2, 4, |, 9, |, 4, 1, 7, 6, |, 2, 9, 2, 7, 2, |, 7, |,....)))

1 Answer 1

5

If you want to do it for the whole x you can use flatMap:

def getValues(x: Array[(String, Array[String])]) =
  x flatMap (line => line._2 map (line._1 + "-" + _.split("\\|")(6)))

Or, maybe a bit more clearly, with for-comprehension:

def getValues(x: Array[(String, Array[String])]) = 
  for {
    (fst, snd) <- x
    line <- snd
  } yield fst + "-" + line.split("\\|")(6)

You have to call split with "\\|" argument, because it takes a regular expression and | is a special symbol, thus you need to escape it. (Edit: or you can use '|' (a Char), as suggested by @BenReich)

To answer your comment, you can modify getValues to take a single element from x as an argument:

def getValues(item: (String, Array[String])) =
  item._2 map (item._1 + "-" + _.split('|')(6))

And then call it with

x flatMap getValues
Sign up to request clarification or add additional context in comments.

9 Comments

You can also call split with '|' (the character instead of the string).
You can also use string interporlation here, which is sometimes clearer: x flatMap { case(header, strings) => strings.map(str => s"$header-${str.split('|')(6)}") }
@Kolmar when I pass x in to the method I get type mismatch my x is a RDD org.apache.spark.rdd.RDD[(String, Array[String])] = MappedRDD[40] at map at <console>:14
@Null-Hypothesis So change the type of x parameter to org.apache.spark.rdd.RDD[(String, Array[String])] in the function definition.
@Null-Hypothesis For your second question, look up anonymous function syntax and placeholder syntax in Scala. Each underscore _ becomes a new function argument. So line._1 + "-" + _.split('|')(6) + _.split('|')(5) translates to a function (t1, t2) => line._1 + "-" + t1.split('|')(6) + t2.split('|')(5), but you want only one argument, so in that case you would have to use t => line._1 + "-" + t.split('|')(6) + t.split('|')(5) with an explicit argument t instead of _
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.