How can I use "for" loop in spark with pyspark

Question

I met a problem while using spark with python3 in my project. In a Key-Value pair, like ('1','+1 2,3'), the part "2,3" was the content I wanted to check. So I wrote the following code:
(Assume this key-Value pair was saved in a RDD called p_list)

def add_label(x):   
    label=x[1].split()[0]  
    value=x[1].split()[1].split(",")  
    for i in value:     
        return (i,label)  
p_list=p_list.map(add_label)

After doing like that, I could only get the result: ('2','+1') and it should be ('2','+1') and ('3','+1'). It seems like that the "for" loop in map operation just did once. How can I let it do multiple times? Or is there any other way I can use to implement such a function like "for" loop in map operation or reduce operation?

I want to mention that what I really deal with is a large dataset. So I have to use AWS cluster and implement the loop with parallelization. The slave nodes in the cluster seem not to understand the loop. How can I let them know that with Spark RDD function? Or how can have such a loop operation in another pipeline way (which is one of the main design of Spark RDD)?

I'm tempted to remove the spark tags from this question because nothing about this is really spark-specific other than the RDD.map function — OneCricketeer
– OneCricketeer, Commented Nov 18, 2016 at 21:48
Sorry for not mentioning the details of my project. It is really a spark application. The problem above is just an abstract of the main problem I met. There will be a bunch of key-value pairs, like ('1','+1 2,3'), saved in the rdd. I need to compare the label and the following child nodes, and return each (child node, label) for all key-value pairs. The whole operation may be RDD.map().filter().reduceByKey() and others. It should be done on AWS with spark cluster. — Linghao
– Linghao, Commented Nov 18, 2016 at 23:59
rdd.foreach() applies a function to all elements in RDD. I can also apply a function to every element with rdd.map() or rdd.reduce(). Any difference? Actually I tried to set a for loop in rdd.foreach(), but it still just did once for every element. I need to do an operation multiple times for every element, not just once. — Linghao
– Linghao, Commented Nov 19, 2016 at 1:10
I ran this on a single computer, and it worked well. But when I ran it in pseudo mode or fully distributed mode(like AWS), the result showed that this operation was only done once. Why didn't the for loop work in distributed computing? — Linghao
– Linghao, Commented Nov 19, 2016 at 1:16

Matt Cremeens · Accepted Answer · 2016-11-18 21:42:38Z

2

Your return statement cannot be inside the loop; otherwise, it returns after the first iteration, never to make it to the second iteration.

What you could try is this

result = []
for i in value:
    result.append((i,label))
return result

and then result would be a list of all of the tuples created inside the loop.

answered Nov 18, 2016 at 21:42

Matt Cremeens

5,2118 gold badges42 silver badges71 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Linghao Over a year ago

I get your point and will try to use rdd.flatMap() to flatten a list of results for every element in rdd. Thank you.

Collectives™ on Stack Overflow

How can I use "for" loop in spark with pyspark

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related