I met a problem while using spark with python3 in my project. In a Key-Value pair, like ('1','+1 2,3'), the part "2,3" was the content I wanted to check. So I wrote the following code:
(Assume this key-Value pair was saved in a RDD called p_list)
def add_label(x):
label=x[1].split()[0]
value=x[1].split()[1].split(",")
for i in value:
return (i,label)
p_list=p_list.map(add_label)
After doing like that, I could only get the result: ('2','+1') and it should be ('2','+1') and ('3','+1'). It seems like that the "for" loop in map operation just did once. How can I let it do multiple times? Or is there any other way I can use to implement such a function like "for" loop in map operation or reduce operation?
I want to mention that what I really deal with is a large dataset. So I have to use AWS cluster and implement the loop with parallelization. The slave nodes in the cluster seem not to understand the loop. How can I let them know that with Spark RDD function? Or how can have such a loop operation in another pipeline way (which is one of the main design of Spark RDD)?
RDD.mapfunctionrdd.foreach()?