0

I am learning Spark source code, and get confused on the following code:

/**
 * Return a new RDD containing the distinct elements in this RDD.
 */
def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] =
  map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)

What is the input data for the map(x => (x, null)) function? Why and when the input can be omitted?

UPDATE:

Here is the link to the source code.

2
  • 1
    Link to the source code? Commented Jun 9, 2015 at 16:53
  • Hi @Daenyth Thanks for the reminder, I've added the link to the source code. Commented Jun 9, 2015 at 16:57

2 Answers 2

3

distinct and map are both methods on the RDD class (source), so distinct is just calling another method on the same RDD.

The map function is a higher-order function - i.e. it accepts a function as one of its parameters (f: T => U)

/**
 * Return a new RDD by applying a function to all elements of this RDD.
 */
def map[U: ClassTag](f: T => U): RDD[U] = withScope {
  val cleanF = sc.clean(f)
  new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
}

In the case of distinct, the parameter f to map is the anonymous function x => (x, null).

Here's a simple example of using an anonymous function (lambda), in the Scala REPL (using the similar map function on a Scala list, not a Spark RDD):

scala> List(1,2,3).map(x => x + 1)
res0: List[Int] = List(2, 3, 4)
Sign up to request clarification or add additional context in comments.

Comments

1

the map function map(x => (x, null)) is the map defined by the class

I don't understand your question about omitting the input. You can't call a function in scala that expects an argument without giving it the argument.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.