Apache spark and python lambda

Question

I have the following code

file = spark.textFile("hdfs://...")
counts = file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...")

http://spark.apache.org/examples.html i have copied the example from here

I am unable to understand this code especially the keywords

flatmap,
map and
reduceby

can someone please explain in plain english what's going on.

I'm not expert but I think flatMap builds a list from a nested structure (list of lines of words?), map applies the function to all elements, and reduceByKey groups the elements by keys (here the same words, I guess) and applies the function (here a sum) pairwise. That probably counts the occurences for each word in a text. — user189
– user189, Commented Jul 4, 2014 at 13:12
Things get much more concise and easier to read if you use a functional language to do functional programming. I.e. I highly suggest using Scala instead of a OO scripting language. Scala is more powerful, slightly more performant for Spark, and makes digging into Spark code easier. Your code just becomes: spark.textFile("hdfs://...").flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _).saveAsTextFile("hdfs://...") — samthebest
– samthebest, Commented Jul 6, 2014 at 9:38

Community · Accepted Answer · 2017-05-23 12:26:17Z

map is the easiest, it essentially says do the given operation on every element of the sequence and return the resulting sequence (very similar to foreach). flatMap is the same thing but instead of returning just one element per element you are allowed to return a sequence (which can be empty). Here's an answer explaining the difference between map and flatMap. Lastly reduceByKey takes an aggregate function (meaning it takes two arguments of the same type and returns that type, should also be commutative and associative otherwise you will get inconsistent results) which is used to aggregate every V for each K in your sequence of (K,V) pairs.

EXAMPLE^*:
reduce (lambda a, b: a + b,[1,2,3,4])

This says aggregate the whole list with + so it will do

1 + 2 = 3  
3 + 3 = 6  
6 + 4 = 10  
final result is 10

Reduce by key is the same thing except you do a reduce for each unique key.

So to explain it in your example

file = spark.textFile("hdfs://...") // open text file each element of the RDD is one line of the file
counts = file.flatMap(lambda line: line.split(" ")) //flatMap is needed here to return every word (separated by a space) in the line as an Array
             .map(lambda word: (word, 1)) //map each word to a value of 1 so they can be summed
             .reduceByKey(lambda a, b: a + b) // get an RDD of the count of every unique word by aggregating (adding up) all the 1's you wrote in the last step
counts.saveAsTextFile("hdfs://...") //Save the file onto HDFS

So, why count words this way, the reason is that the MapReduce paradigm of programming is highly parallelizable and thus scales to doing this computation on terabytes or even petabytes of data.

_{I don't use python much tell me if I made a mistake.}

Community · Accepted Answer · 2017-05-23 12:10:39Z

5

See inline-comments:

file = spark.textFile("hdfs://...") # opens a file
counts = file.flatMap(lambda line: line.split(" ")) \  # iterate over the lines, split each line by space (into words)
             .map(lambda word: (word, 1)) \ # for each word, create the tuple (word, 1)
             .reduceByKey(lambda a, b: a + b) # go over the tuples "by key" (first element) and sum the second elements
counts.saveAsTextFile("hdfs://...")

A more detailed explanation of reduceByKey can be found here

edited May 23, 2017 at 12:10

CommunityBot

11 silver badge

answered Jul 4, 2014 at 13:58

Nir Alfasi

53.6k11 gold badges94 silver badges138 bronze badges

4 Comments

jhon.smith Over a year ago

Sorry i did not understand the reduceByKey .In a normal lambda expression lambda a, b: a + b means for a input pair (a,b) give me the sum of a + b as the result isn't it ? But here it does something else weird syntax ?

Nir Alfasi Over a year ago

To understand reduceBykey you first have to understand reduce. A simple reduce example: print reduce(lambda a,b:a+b, [1,2,3]) it iterates an iterable and applies the function (first argument - here it's the lambda expression) to the first two elements and then uses the result with the third element an etc.

jhon.smith Over a year ago

I alfasin i re read your explanation and i only wish i could award points to you too.Your comment clears the confusion for me for reduceByKey

Nir Alfasi Over a year ago

@jhon.smith I'm glad I could help, the points here are meaningless (I can't use them to buy anything ;) cheers!

Alexk · Accepted Answer · 2014-09-13 23:32:22Z

1

The answers here are accurate at the code level but it may help to understand what goes on under the hood.

My understanding is that when a reduce operation is called there is a massive data shuffle that results in all K-V pairs obtained by a map() operation that have the same value of the key being assigned to a task that sums the values in the collection of K-V pairs. These tasks are then assigned to different physical processors and the results are then collated with another data shuffle.

so if the map operation produces (cat 1) (cat 1) (dog 1) (cat 1) (cat 1) (dog 1)

The reduce operation produces (cat 4) (dog 2)

Hope this helps

answered Sep 13, 2014 at 23:32

Alexk

311 bronze badge

Collectives™ on Stack Overflow

Apache spark and python lambda

3 Answers 3

Comments

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related