0

I am new with spark and scala. I want to sum up all the values present in the RDD. below is the example. RDD is key value pair and suppose after doing some join and transformation the output of RDD have 3 record as below, where A is key:

(A, List(1,1,1,1,1,1,1))
(A, List(1,1,1,1,1,1,1))
(A, List(1,1,1,1,1,1,1))

Now i want to sum up all values of each record with corresponding value in other records, so the output should come like

(A, List(3,3,3,3,3,3,3))

Can anyone please help me out on this. Is there any possible way to achieve this using scala?

Big Thanks in Advance

1
  • I tried to group them all then add the elements based on the position....but could not get the required result Commented Jul 2, 2016 at 12:42

1 Answer 1

5

A naive approach is to reduceByKey:

rdd.reduceByKey(
  (xs, ys) => xs.zip(ys).map { case (x, y) => x + y }
)

but it is rather inefficient because it creates a new List on each merge.

You can improve on that by using for example aggregateByKey with mutable buffer:

rdd.aggregateByKey(Array.fill(7)(0)) // Mutable buffer 
  // For seqOp we'll mutate accumulator 
  (acc, xs) => {
    for {
      (x, i) <- xs.zipWithIndex
    } acc(i) += x
    acc
  },
  // For performance you could modify acc1 as above
  (acc1, acc2) => acc1.zip(acc2).map { case(x, y) => x + y }
).mapValues(_.toList)

It should be also possible to use DataFrames but by default recent versions schedule aggregations separately so without adjusting configuration it is probably not worth the effort.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.